Why eigenvectors of covariance matrix are principle components in PCA

Given a set of random variables (must be multivariate ones), a lot of us know how to find their principle components (PCs): by finding the eigenvectors of their covariance matrix. But why will this method guarantee us finding the PCs? Why variances are maximized along those eigenvectors? I saw many people asking these questions on the Internet [Refs. 2, 3]. But many answers are useless, actually repeating the conclusion. Hence, I decide to give a 10-minute answer to this question.

First of all, how to formulate Principle Component Analysis (PCA)? Saying "finding vectors such that variances are maximized along them" is not enough. We need to express this idea mathematically.

If the variance within each variable is maximized, then their covariance matrix is a diagonal matrix, i.e., cross-covariance terms of the covariance matrix are all zeros. Equivalently, those variables are uncorrelated.

Therefore, the essence of PCA is to change the basis of data (i.e., rotating the coordinate system if data is zero-mean) such that the covariance matrix of data is diagonal (namely, cross-covariances are zeros in the new basis/coordinate system).

In the language of matrix, here is a formal definition to PCA [Ref. 1]: Given an $m$-by-$n$ matrix $\mathbf{X}$ representing $n$ $m$-dimentional zero-mean data points (each column is one data point in $\mathbf{X}$), find a linear transformation (denoted as matrix $\mathbf{P}$) such that the covariance matrix $\mathbf{C_Y}$ of $\mathbf{Y}$ is diagonal where $\mathbf{Y}=\mathbf{PX}$ and $\mathbf{C_Y}=(\mathbf{Y}\mathbf{Y}^T)/n$.

Note that here data are normalized (means are 0). That's why we can use $\mathbf{Y}\mathbf{Y}^{T}$ to compute the covariance matrix. Similarly, the original covariance matrix $\mathbf{C_X}=\mathbf{X}\mathbf{X}^{T}$.

So the last step is to solve the $\mathbf{P}$. How? From the problem formulation, we can get

$$\begin{split}
\mathbf{C_Y} & = (\mathbf{Y}\mathbf{Y}^T)/n \\
& = \mathbf{PX}(\mathbf{PX})^T/n \\
& = \mathbf{PXX^TP^T}/n \\
& = \mathbf{PC_XP}^T
\end{split}$$

There is a famous theorem on matrix diagonalization that any symmetric matrix $\mathbf{A}$ can be represented as $\mathbf{A}=\mathbf{EDE^{-1}}$ where $\mathbf{D}$ is a diagonal matrix and $\mathbf{E}$ is a matrix each column of which is an eigenvector of $\mathbf{A}$. Since $\mathbf{E}$ is an orthogonal matrix, then its inverse is the same as its transpose: $\mathbf{E}^{-1}=\mathbf{E}^{T}$.

Remember that we want $\mathbf{C_Y}$ to be diagonal. Let's see what kind of magic will happen if we simply let $\mathbf{P}=\mathbf{E}^T$ such that $\mathbf{C_X}=\mathbf{EDE^{-1}}$:

$$
\begin{split}
\mathbf{C_Y} & = \mathbf{PC_XP}^T \\
& = \mathbf{E}^T(\mathbf{EDE^{-1}})\mathbf{E}\\
& = (\mathbf{E}^{-1}\mathbf{E})\mathbf{D}(\mathbf{E}^{-1}\mathbf{E})\\
& = \mathbf{D}
\end{split}
$$

So, if we simply apply eigen decomposition (EVD, the V in the abbreviation is just to make it look like SVD) onto $\mathbf{C_x}$ then the $\mathbf{E}^T$ is the $\mathbf{P}$ we want and the $\mathbf{D}$ is the new covariance matrix $\mathbf{C_Y}$.

The End.

References:

1. Jonathon Shlens, A Tutorial on PCA, http://arxiv.org/pdf/1404.1100v1.pdf
2. StackExchange: http://math.stackexchange.com/questions/23596/why-is-the-eigenvector-of-a-covariance-matrix-equal-to-a-principal-component
3. Quora: http://www.quora.com/Machine-Learning/What-is-an-eigenvector-of-a-covariance-matrix
4. Duncan Fyfe Gillies, http://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/IDAPILecture15.pdf
5. David Forsyth, PCA, http://luthuli.cs.uiuc.edu/~daf/courses/CS-498-DAF-PS/Lecture%209%20-%20PCA.pdf

No comments: