Principal Component Analysis

Section 10.1 Principal Component Analysis

In this chapter, we shall explore the concept of pricipal component analysis (PCA) which is somewhat similar to SVD. We shall look at what is similarity between PCA and SVD along with some applications.

Subsection 10.1.1 Introduction

Large datasets with a large number of features/variables are very common and widespread. Interpreting such a large datasets is very complex task. In order to interpret such datasets one requires a method that reduces the dimension/features drastically, at the same time most of the information in the dateset is preserved. The principal component analysis (PCA) is one of the most widely used dimensionality reduction techniques. The main idea of PCA is to reduce the dimensionality in the datasets while preserving much of the variability as much as possible. It does so by creating a new set of uncorrelated variables that successfully maximize the variance. Finding such new variables also known as principal components reduces the problem to solving an eigenvalue-eigenvector problem.

Let us look at the set of points in the plane, (data with two features) in the figure 10.1.1. In this case the data has maximum spread or variability along the \(y\)-axis. Thus if we project, the points onto the \(y\)-axis, the variability in the data can be captured. In particular, we can ignore the \(x\)-coordinates. On the other hand if we look at the set of points in the figure 10.1.2, maximum spread or variability lies along the \(x\)-axis. Thus if we project, the points onto the \(x\)-axis, the variability in the data can be captured. In particular, we can ignore the \(y\)-coordinates. Thus in these two examples, we are able to reduce the dimension by 1.

Figure 10.1.1. Variability along \(y\)-axis

Figure 10.1.2. ariability along \(x\)-axis

Now suppose we have 12 points as show in the Figure 10.1.3 again in \(\R^2\text{,}\) that is having two features/dimensions. The spread of this data seems to be not along \(x\)-axis but roughly along the axis as shown in the Figure 10.1.4, that is, along the vector \(u_1\text{.}\) So if we project these points on the line along \(u_1\) as shown in the Figure 10.1.4, we will have maximum spread or variation of the data. Thus \(u_1\) is the new axis along which the data has maximum variation.

Figure 10.1.4. Variability along vector \(u_1\text{.}\)

Next if one carefully looks into the data points, one can see that the data also has some dispersion or variation along the line given by the direction \(u_2\) as shown in the Figure 10.1.5 and which is not captured by the line along \(u_1\text{.}\) In a way, we need to create another axis which is perpendicular to the 1st one.

Thus we have two perpendicular coordinate axes or a new coordinates system along which all the variations in the data can be captured. In this case, maximum variation along \(u_1\) and second maximum along \(u_2\text{.}\) Here \(u_1\) is called the first principal direction and \(u_2\) is called the second principal direction. Thus we can work with new coordinate axes and forget about the original \(x\) and \(y\)-axes as show in the Figure 10.1.6. We can even rotate the new coordinate system that coincides with original \(x\) and \(y\)-axes.

Figure 10.1.5. 1st and 2nd principal components \(u_1\) and \(u_2\text{.}\)

Figure 10.1.6. 1st and 2nd principal components \(u_1\) and \(u_2\text{.}\)

The above two examples, geometrically explains the essence of PCA. The idea is to project the original high dimensional data to a new coordinate system and choose only first we coordinates axes also called principal components. How many principal component to be taken depends upon how much variation we wish to capture.

Subsection 10.1.2 Mathematics behind PCA

Let us assume that we have a data which has \(d\) features and there are \(n\) of them. This data can be represented by a \(n\times d\) matrix, say \(X\text{.}\) Thus

\begin{equation*} X = \begin{bmatrix}x_{11} \amp x_{12} \amp \cdots \amp x_{1d}\\ x_{21} \amp x_{22} \amp \cdots \amp x_{2d}\\ \vdots \amp \vdots \amp \ddots \amp \vdots\\ x_{n1} \amp x_{n2} \amp \cdots \amp x_{nd} \end{bmatrix}\text{.} \end{equation*}

Thus each columns of \(X\) represents a feature and there are \(n\) samples for each feature.

Now we are looking for an unit vector \(u_1\) and we wish to project the data onto \(u_1\) such that the variance of the projected data is maximum.

Before we explain that in generality, let us look at what is meaning of projection of data in 2 dimension (that is in \(\R^2\)) on an unit vector. Suppose \(u=(a,b)\) is an unit vector ad \(p_1=(x_1,y_1)\) be a point/vector in \(\R^2\text{.}\) Then

\begin{equation*} \proj_u(p_1)=\frac{p_1\cdot u}{\norm{u}^2}u=(x_1a+x_2b)u\text{.} \end{equation*}

The length of the projection is \(p_1x_2+p_2x_2\text{.}\) If we have another point, say \(p_2 =(x_2,y_2)\text{,}\) then the projection of both these points can be captured as

\begin{equation*} \begin{bmatrix}x_1 \amp x_2 \\y_1 \amp y_2 \end{bmatrix} \begin{bmatrix}a\\b \end{bmatrix} = Xu\text{.} \end{equation*}

Thus in general the projection of data \(X\) which is \(n\times p\) matrix onto a unit vector \(u_1=\begin{bmatrix}u_{11}\amp u_{12}\amp \cdots \amp u_{1d} \end{bmatrix} ^T\) is

Next we deal with the second issue in PCA, namely, ’variance’. For this we take the centered data \(X_c =X-\overline{X}\text{,}\) where \(\overline{X}=\frac{1}{d}\begin{bmatrix}x_{11}+x_{12}+\cdots+x_{1d}\\ x_{21}+x_{22}+\cdots+x_{2d}\\\vdots\\x_{1n}+x_{1n}+\cdots+x_{1n} \end{bmatrix}\text{.}\) The covariance of \(X\text{,}\) is given

\begin{equation*} S ={ Cov}(X)=\frac{1}{n-1}{X_c}^TX_c\text{.} \end{equation*}

Note that (i) \(S\) is symmetric and (ii) Semi-positive definite, all eigenvalues of \(S\) are non negative. Also \(S\) is orthogonally diagonalizable. In particular, there exists an orthogonal matrix \(U = \begin{bmatrix}u_1\amp u_2\amp \cdots \amp u_d \end{bmatrix}\) such that \(U^TSU = { diag }(\lambda_1,\cdots,\lambda_p)\text{.}\) What we wanted was to maximize the variance of projection of the data onto unit vector \(u\text{.}\) That is, we want to find an unit vector \(u\) such that the variance of \(X_cu\) is maximum. In other words,

\begin{align*} \text{maximize }\amp \frac{1}{n-1}(X_cu)^T{X_cu}=\frac{1}{n-1}u^T(X_c^TX_c)u=u^TSu\\ \text{subject to } \amp \norm{u}=1\text{.} \end{align*}

It turns out that the solution of this optimization problem is \(u\text{,}\) which is the eigenvector of \(S\text{.}\) Thus the variance of the projected data onto a unit vector is maximum if \(u\) happens to be an eigenvector of the covariance matrix \(S\text{.}\)

Note that \(S\) is of order \(d\times d\) which has \(d\) linearly independent eigenvectors. We arrange these eigenvector corresponding to the decreasing eigenvalues. That \(u_1\) is the eigenvector corresponding to the largest eigenvector \(\lambda_1\) and is called the first principal component. The eigenvector \(u_2\) corresponding to the second highest eigenvalue \(\lambda_2\text{,}\) is called the second principal component. Thus if we project data onto the second principal component that it will have second higher variance. Look at Figure 10.1.7 in which the data is plotted along with the principal components. The Figure 10.1.8, the data projected on the 1st component of PCA is plotted along with the data.

Figure 10.1.7. Data set with principal components

Figure 10.1.8. Projection on 1st PCA components

Next question is how many principal components, we should choose. This depends upon what percentage of variance of the data we wish to capture. Suppose we want to capture 90% variations, the we choose the 1st \(k\) components such that

\begin{equation*} \frac{\sum_{i=1}^k \lambda_i}{\sum_{j=1}^d\lambda_j}\geq 0.9\text{.} \end{equation*}

The projected data onto the 1st \(k\) principal components is given by

\begin{equation*} \begin{bmatrix}x_{11} \amp x_{12} \amp \cdots \amp x_{1d}\\ x_{21} \amp x_{22} \amp \cdots \amp x_{2d}\\ \vdots \amp \vdots \amp \ddots \amp \vdots\\ x_{n1} \amp x_{n2} \amp \cdots \amp x_{nd} \end{bmatrix} \begin{bmatrix}u_1\amp u_2\amp \cdots \amp u_k \end{bmatrix} =XV= \begin{bmatrix}z_{11} \amp z_{12} \amp \cdots \amp z_{1k}\\ z_{21} \amp z_{22} \amp \cdots \amp z_{2k}\\ \vdots \amp \vdots \amp \ddots \amp \vdots\\ z_{n1} \amp z_{n2} \amp \cdots \amp z_{nk} \end{bmatrix} =Z \end{equation*}

Here \(V\) is called the loading matrix. The new data or transformed data \(Z=XV\text{.}\) Once we know the transformed data then we can construct the original data by \(X=ZV^T\text{.}\)

Example 10.1.9.

Consider the following 2 dimensional data.

\(x_1\)	2.5	0.5	2.2	1.9	3.0	2.3	2.0	1.0	1.5	1.1
\(x_2\)	2.0	0.7	2.9	2.2	2.8	2.7	1.6	1.1	1.6	0.9

Find the first and the second principal components of this data set. Explain what percentage of variance os explained by the 1st principal component.

The \(\overline{x_1} =1.8\) and \(\overline{x_2}=1.85\text{.}\) The centered data set is

\begin{equation*} X_c = X - \begin{bmatrix}\overline{x_1}\\\overline{x_2} \end{bmatrix} =\left(\begin{array}{rr} 0.7 \amp 0.15 \\ -1.3 \amp -1.15\\ 0.4 \amp 1.05\\ 0.1 \amp 0.35 \\ 1.2 \amp 0.95 \\ 0.5\amp 0.85 \\ 0.2 \amp -0.25\\ -0.8 \amp -0.75 \\ -0.3 \amp -0.25 \\ -0.7 \amp -0.95 \end{array} \right) \end{equation*}

Next we construct the covariance matrix of \(X\text{,}\) which is

\begin{equation*} S = \frac{1}{10-1}X_c^TX_c=\left(\begin{array}{rr} 0.589 \amp 0.546 \\ 0.546 \amp 0.643 \end{array} \right) \end{equation*}

The eigenvalues of \(S\) are eigenvalues \(\lambda_1 =1.1620\) and \(\lambda_2=0.0696\text{.}\) The corresponding eigenvectors are \(u_1 = \begin{pmatrix}0.6894\\0.7243 \end{pmatrix}\) and \(u_2 = \begin{pmatrix}0.7243\\-0.689 \end{pmatrix}\text{.}\)

Hence the loading matrix \(V\) is given by

\begin{equation*} V = \left(\begin{array}{rr} 0.6894 \amp 0.7243 \\ 0.7243 \amp -0.6894 \end{array} \right)\text{.} \end{equation*}

The projected data on the 1st two principal components is

\begin{equation*} Z = X_cV = \left(\begin{array}{rr} 0.591 \amp 0.404 \\ -1.73 \amp -0.149 \\ 1.04 \amp -0.434 \\ 0.322 \amp -0.169 \\ 1.52 \amp 0.214 \\ 0.960 \amp -0.224 \\ -0.0432 \amp 0.317 \\ -1.09 \amp -0.0624 \\ -0.388 \amp -0.0449 \\ -1.17 \amp 0.148 \end{array} \right) \end{equation*}

We can recover the original data set by

\begin{equation*} ZV^T+\left(\begin{array}{rr} 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \\ 1.80 \amp 1.85 \end{array} \right)=X \end{equation*}

The variance explained by the 1st principal component is

\begin{equation*} \frac{\lambda_1}{\lambda_1+\lambda_2}\approx 0.9435 \end{equation*}

Thus approximately 94.35% variance is captured by the 1st principal component.

Example 10.1.10.

Consider the following data in 3-dimension.

\(x_1\)	24	8	21	1	9	7	8	10	1	15	4	12
\(x_2\)	13	3	6	14	3	1	7	16	3	2	6	10
\(x_3\)	38	17	40	-9	21	14	11	3	2	30	1	18
\(x_1\)	1	7	5	1	21	8	1	15	16	7	14	3	5
\(x_2\)	9	3	1	12	9	8	18	8	10	0	2	7	6
\(x_3\)	-4	19	13	-6	34	7	-18	25	29	17	31	0	7

The mean of each feature are \(\overline{x_1}=8.96, \overline{x_2}=7.081,\overline{x_2}=3.6\text{.}\) We have

\begin{equation*} X= \left(\begin{array}{rrr} 24 \amp 13 \amp 38 \\ 8 \amp 3 \amp 17 \\\vdots \amp \vdots \amp \vdots \\ 5 \amp 6 \amp 7 \end{array} \right), X -\overline{X} = \left(\begin{array}{rrr} 15.04 \amp 5.92 \amp 24.4 \\ -0.96 \amp -4.08 \amp 3.4 \\\vdots \amp \vdots \amp \vdots\\ -3.96 \amp -1.08 \amp -6.6 \end{array} \right) \end{equation*}

The covariance matrix of \(X\) is given by

\begin{align*} { Cov}(X)=S=\amp \frac{1}{n-1}(X -\overline{X})^T(X -\overline{X})\\ =\amp \left(\begin{array}{rrr} 45.7066 \amp -0.2466\amp 94.8583 \\ -0.2466 \amp 24.0766 \amp -29.175 \\ 94.8583\amp -29.175 \amp 235.9166 \end{array} \right) \end{align*}

The eigenvalues of \(S\) are \(\lambda_1=278.02366293, \lambda_2=26.95307696, \lambda_3=0.72326011\text{.}\) The corresponding eigenvectors are

\begin{equation*} pc_1=\left(\begin{array}{r} 0.3759787 \\ -0.10612 \\ 0.920531 \end{array} \right), pc_2=\left(\begin{array}{r} 0.46959\\ 0.87822 \\ -0.09055 \end{array} \right), pc_3=\left(\begin{array}{r} 0.79882 \\ -0.4663\\ -0.38003 \end{array} \right)\text{.} \end{equation*}

The percentage of variance explained by the first principal component is

\begin{equation*} \frac{\lambda_1}{\lambda_1+\lambda_2+\lambda_3}\approx 0.9094\approx 90.94% \end{equation*}

The percentage of variance explained by the first two principal components is

\begin{equation*} \frac{\lambda_1+\lambda_2}{\lambda_1+\lambda_2+\lambda_3}\approx 0.997=99.7% \end{equation*}

Prev Top Next