Principal Component Analysis and Dimension Reduction – Part II

Category Data Engineering

In the previous post, we saw why we should be interested in Principal Component Analysis. In this post, we will do a deep dive and get to know how this is implemented.

Now that you have some idea about how to change higher dimensions to lower dimensions, we will go through the below description which is shown in a jupyter notebook.

I have downloaded the data of three companies that are in the Indian stock market from Quandl. We will try to understand the Indian ecosystem using this.

These data are for the Hitech Corporation Ltd. Housing and Development Corporation Ltd. and Bhagyanagar India limited. The features of the data set are Date, Open pricing, the maximum price reached during the day, minimum price during the day, last price before closing, closing price, total traded quantity and turn over in lakhs.

In order to load the data set, we will use the pandas library. If you are interested, in the dataset, you can create an account on the website which is free and download the data sets.

 import pandas as pd
 df_hitech = pd.read_csv("NSE-HITECHCORP.csv")
 df_hitech.head()
view rawpca1.py hosted with ❤ by GitHub

 df_hudco = pd.read_csv("NSE-HUDCO.csv")
 df_hudco.head()
view rawpca2.py hosted with ❤ by GitHub

Taking a look at the shape of the matrices.

 print(df_hitech.shape, df_bhagyanar.shape, df_hudco.shape)
view rawpca3.py hosted with ❤ by GitHub

((51, 8), (55, 8), (53, 8))

So our three data sets are three matrices of dimensions (48 x 8), (52 x 8) and (50 x 8) where the columns are the different features and every row represents a separate data sample.

Visualization

To get the feeling of how the different stocks are distributed along opening stock prices and total trade quantities, let us visualize them using histograms.
Get the relevant data. Right now we are only interested in visualization and as we cannot show all the features in a plot, we will just focus on some specific columns.

 import numpy as np
 from bokeh.charts import output_file, show, Line
 from bokeh.charts import Area, show, TimeSeries
 from bokeh.io import output_notebook
  
 X1 = df_hitech.loc[0:5, ['Date', 'Open', 'Total Trade Quantity']].values
 X2 = df_bhagyanar.loc[0:5,['Date', 'Open', 'Total Trade Quantity']].values
 X3 = df_hudco.loc[0:5, ['Date', 'Open', 'Total Trade Quantity']].values
  
 # Get a feel for the data.
 print([x for x in X1])
 print([x for x in X2])
view rawpca4.py hosted with ❤ by GitHub

[array([‘2017-08-02’, 209.1, 2935.0], dtype=object), array([‘2017-08-01’, 212.0, 5094.0], dtype=object), array([‘2017-07-31’, 212.0, 6803.0], dtype=object), array([‘2017-07-28’, 208.0, 2023.0], dtype=object), array([‘2017-07-27’, 213.05, 6714.0], dtype=object), array([‘2017-07-26’, 218.95, 3389.0], dtype=object)]
[array([‘2017-08-02’, 24.5, 5857.0], dtype=object), array([‘2017-08-01’, 24.75, 12622.0], dtype=object), array([‘2017-07-31’, 25.05, 31276.0], dtype=object), array([‘2017-07-28’, 25.15, 11483.0], dtype=object), array([‘2017-07-27’, 26.0, 19290.0], dtype=object), array([‘2017-07-26’, 26.5, 17808.0], dtype=object)]

Show how the data is distributed for one company.

 data = dict(
 hitech=[x[1] for x in X1],
 bhagyagar=[x[1] for x in X2],
 date=[x[0] for x in X1],
 )
  
 line_plot = Line(data, x='date', title="Area Chart", legend="top_left",
 xlabel='time', ylabel='memory')
 show(line_plot)
view rawpca5.py hosted with ❤ by GitHub

 

Standardizing
We want to compare oranges to oranges and apples to apples. Generally, we are more interested in how the changes are happening and not too much concerned about the scale on which the data set was published. In this case, you can see the data set is comprised of numbers that were gathered on various scales, so it would be wise to standardize the data set to the unit scale with mean as 0 and variance as 1.

 from sklearn.preprocessing import StandardScaler
  
 column_names = ['Open', 'High', 'Low', 'Last', 'Close', 'Total Trade Quantity', 'Turnover (Lacs)']
  
 X_hitech = df_hitech.loc[:, column_names].values
 X_bhagyanagar = df_bhagyanar.loc[:, column_names].values
 X_hudco = df_hudco.loc[:, column_names].values
  
 X = np.concatenate([X_hitech, X_bhagyanagar, X_hudco], axis=0)
  
 X_std = StandardScaler().fit_transform(X)
  
 print(X_std.shape)
view rawpca6.py hosted with ❤ by GitHub

(159, 7)

Covariance matrix
We can perform the Eigen decomposition on the covariance matrix.
In terms of code, this can be represented using the below format.
Please note that the shape of the standardized matrix is 150 x 7. So we should get a covariance matrix of 7 x 7.

 mean_vec = np.mean(X_std, axis=0)
  
 cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
  
 print('Covariance matrix \n{}'.format(cov_mat))
 print('Covariance matrix shape is: {}'.format(cov_mat.shape))
view rawpca7.py hosted with ❤ by GitHub

Covariance matrix
[[ 1.00632911 1.00550536 1.00597911 1.00542205 1.00545316 -0.06315605
-0.05901863]
[ 1.00550536 1.00632911 1.00541311 1.00598897 1.00602285 -0.0542054
-0.0497978 ]
[ 1.00597911 1.00541311 1.00632911 1.00562847 1.00565687 -0.06570958
-0.06198619]
[ 1.00542205 1.00598897 1.00562847 1.00632911 1.00628689 -0.06051605
-0.05646143]
[ 1.00545316 1.00602285 1.00565687 1.00628689 1.00632911 -0.05997139
-0.05587568]
[-0.06315605 -0.0542054 -0.06570958 -0.06051605 -0.05997139 1.00632911
0.99963087]
[-0.05901863 -0.0497978 -0.06198619 -0.05646143 -0.05587568 0.99963087

1.00632911]]
Covariance matrix shape is: (7, 7)

We showed the code just to show how the covariance works. In general, the cov function is used.

 covariance_matrix = np.cov(X_std.T)
 print(covariance_matrix)
view rawpca8.py hosted with ❤ by GitHub

[[ 1.00632911 1.00550536 1.00597911 1.00542205 1.00545316 -0.06315605
-0.05901863]
[ 1.00550536 1.00632911 1.00541311 1.00598897 1.00602285 -0.0542054
-0.0497978 ]
[ 1.00597911 1.00541311 1.00632911 1.00562847 1.00565687 -0.06570958
-0.06198619]
[ 1.00542205 1.00598897 1.00562847 1.00632911 1.00628689 -0.06051605
-0.05646143]
[ 1.00545316 1.00602285 1.00565687 1.00628689 1.00632911 -0.05997139
-0.05587568]
[-0.06315605 -0.0542054 -0.06570958 -0.06051605 -0.05997139 1.00632911
0.99963087]
[-0.05901863 -0.0497978 -0.06198619 -0.05646143 -0.05587568 0.99963087
1.00632911]]

Next is to perform an eigen decomposition on the above matrix. Eigen decomposition is the method where we decompose a square matrix into its Eigen vectors and Eigen values. The Eigen vectors are basically hyperplanes to which the data gets projected into. Take a look at wikipedia to learn more.

We find the Eigen vectors and the Eigen values using the numpy library. In PCA the Eigen vectors are called loadings. Let’s call them the matrix W.

 eigen_vals, eigen_vecs = np.linalg.eig(covariance_matrix)
  
 print(eigen_vals)
 print(eigen_vecs)
view rawpca9.py hosted with ❤ by GitHub

[ 5.04062300e+00 1.99469506e+00 6.70989466e-03 1.52221819e-03
5.08344416e-04 2.04592597e-04 4.06913481e-05]
[[-0.44636953 -0.0255697 -0.00233638 0.61091357 0.39241169 -0.52200819
0.02000256]
[-0.44626454 -0.03201096 -0.03801748 -0.31356635 0.66065443 0.51116882
-0.04791904]
[-0.44644488 -0.02362002 0.04386902 0.47716468 -0.50791923 0.55804311
-0.03353175]
[-0.44641063 -0.02742468 0.0018537 -0.3959805 -0.28922772 -0.3236239
-0.67437107]
[-0.44640923 -0.02782594 -0.0024592 -0.37808739 -0.25605417 -0.22367243
0.73579569]
[ 0.04405545 -0.70563431 0.70623273 -0.01919751 0.03123937 -0.00518154
0.0018351 ]
[ 0.04224484 -0.705916 -0.70558526 0.02560183 -0.03707468 0.00362514
-0.00206853]]

To know why the Eigen values and Eigen vectors are important, take a look at the math below. If this sounds too heady, you can skip this section and just focus on the code.
So, we know that after performing the Eigen decomposition we get the Eigen values and the Eigen vectors. Let’s call them λ and W. Therefore, in terms of principal component analysis, we will say that the scores are the product of matrices X and W, i.e.,

assuming T is the score.

Singular Value Decomposition
Finding the Eigen values and Eigen vectors are computationally not the most efficient way of finding the loadings or the matrix W. We have a shortcut to do the same thing and that is through Singular Value Decomposition or SVD.
How it works is that the same input matrix X can be broken down to three matrices U, Σ, and V*.

Let us call this equation 1.
U is the left singular vector and V* is the right singular vector. So Σ will contain singular values in its diagonal and they are ordered. Please note that although V* means conjugate of V. Since we will be dealing with real world data, you can think of this as the transpose of V.
Now, interestingly the matrix V turns out to be identical to the loadings or W.
How is this transformation beneficial to us. Let’s explore that below. We know that the scores T can be shown with the given equation

Coming from equation 1 above we can write –

Since V*V is the Identity Matrix, and W is the same as V, we get

So we can find the scores if we can compute U and Σ, which can be easily done. If you are interested in trying this right away, take a look at this SVD computation example.

Selecting Principal Components

Now we will select the principal components. This is mostly based on human judgment but many thumb rules can be followed. Understanding of how many components you incorporate will ultimately come through experience and experimentation.

Right now with the help of the Eigen decomposition we transformed our data to the new set of planes with the eigen vectors as the axes.

We need to decide which Eigen vectors can be dropped without losing too much information. Please note that the Eigen values with the least amount of value are the ones that have the least amount of information regarding the data distribution. Why is that so? Remember the definition of Eigen vectors and Eigen values.

The eigen value λ is the amount the vectors elongate or shrink. So, if the value of λ is small that would mean that the effect ν has on matrix A will be small.
We will try to see how much each value contributes to the overall variance.

We will try to see how much each value contributes to the overall variance.

 total = sum(eigen_vals)
 distribution = [100 * (i/total) for i in sorted(eigen_vals, reverse=True)]
  
 cumulative_distribution = []
 sum = 0
 for i in distribution:
 sum += i
 cumulative_distribution.append(sum)
  
 bottom = [0] + cumulative_distribution[:-1]
 left = [2*x for x in range(7)]
 right = map(lambda x: x+1, left)
 print(cumulative_distribution)
 print(len(cumulative_distribution))
 print(bottom)
 print(len(bottom))
 print(list(left))
 print(list(right))
view rawpca12.py hosted with ❤ by GitHub

[71.556013843924504, 99.872439612636256, 99.967692385037282, 99.989301591992117, 99.996517981727536, 99.999422351033218, 99.999999999999986]
7
[0, 71.556013843924504, 99.872439612636256, 99.967692385037282, 99.989301591992117, 99.996517981727536, 99.999422351033218]
7
[0, 2, 4, 6, 8, 10, 12]
[1, 3, 5, 7, 9, 11, 13]

 from bokeh.plotting import figure, show
  
 p = figure(width=400, height=400)
 p.quad(top=cumulative_distribution,
 bottom=bottom,
 left=[1, 2, 3, 4, 5, 6, 7],
 right=[1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2],
 color="#B3DE69")
  
 show(p)
view rawpca13.py hosted with ❤ by GitHub

You can see that after the first two components the contribution of the remaining to the total is almost nonexistent and cannot be seen at all. They are there and you can zoom in to see the others. The first two principal components can explain more than 99% of the data that we have.

Now, since you know where the majority of the data in your data set is coming from it becomes obvious to trim those extra dimensions. Effectively we will construct a projection matrix to transform our data to a new feature subspace. In this case, we will trim all the other features and take into account only two features. So we are reducing our 7-dimensional feature space to 2-dimensional feature subspace. This is possible since we are only choosing the top 2 Eigen vectors with the highest Eigen values to construct our (150 x 2) Eigen vector matrix W.

 # Mapping eigen value to eigen vector
 eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:,i]) for i in range(len(eigen_vals))]
  
 # Sort the (eigenvalue, eigenvector) tuples from high to low
 eigen_pairs.sort()
 eigen_pairs.reverse()
  
 # check if the value is decreasing
 print([e[0] for e in eigen_pairs])
view rawpca14.py hosted with ❤ by GitHub

[5.0406230005245538, 1.9946950557326693, 0.0067098946634397017, 0.0015222181861223813, 0.00050834441617230038, 0.00020459259729253709, 4.0691348102287684e-05]

https://gist.github.com/infinite-Joy/91177e07b9916e2f268e33a4154c6360/

[[-0.44636953 -0.0255697 ]
[-0.44626454 -0.03201096]
[-0.44644488 -0.02362002]
[-0.44641063 -0.02742468]
[-0.44640923 -0.02782594]
[ 0.04405545 -0.70563431]
[ 0.04224484 -0.705916 ]]

Dimensionality Reduction

Using this new loading matrix W, we can reduce the dimensions of the original matrix X. This can be done using the matrix multiplication property, whereby if you multiply two matrices of dimensions (m x n) and (n x p), you get a new matrix of dimensions (m x p). So, in this case, we multiply matrix X, which is (159 x 7) and matrix W, which is (7 x 2) to get a new matrix (159 x 2).

Taking a look at the code below.

 Y = X_std.dot(matrix_w)
  
 print(Y[:4])
view rawpca16.py hosted with ❤ by GitHub

array([[-3.12570525, 0.1778062 ],
[-3.22460186, 0.17146108],
[-3.26460636, 0.16888734],
[-3.227314 , 0.17157395]])

If you remember, the shapes of hitech, bhagyanagar and hudco were (51, 8), (55, 8), (53, 8). Hence, to create the mappings, we create a Z vector/list with shape (159, 1).

https://gist.github.com/infinite-Joy/infinite-Joy/446130b2e02e914359ecca1e5610916d

You can see that two companies are very tightly packed, saying that there is not much variation in them while the third company has a lot of variation. This company has had a rocky year.

In production
The code that is shown above is mainly for learning purposes. In production, we will be using the inbuilt function from the sklearn library.

 from sklearn.decomposition import PCA as productionPCA
 prod_pca = productionPCA(n_components=2)
 Y_prod = prod_pca.fit_transform(X_std)
  
 print(Y_prod[:2])
  
 print(Y[:2])
view rawpca18.py hosted with ❤ by GitHub

The graphs are opposite as the matrices are reversed in their sign. In case you are implementing this in production, you may not be finding the eigen values to understand if you are capturing the majority of the variance in the data set, as was done earlier. So, to find if your new feature subspace is capturing the intended amount of variance, you can use the explained_variance_ratio_ attribute.

 print(prod_pca.explained_variance_)
 print(prod_pca.explained_variance_ratio_)
 print(np.sum(prod_pca.explained_variance_ratio_.tolist()))
view rawpca20.py hosted with ❤ by GitHub

[ 5.00892097 1.9821498 ]
[ 0.71556014 0.28316426]
0.99872439612636232

Generally, people see the explained variance ratios or keep a hard cutoff threshold of maybe 95%. In our case, the cumulative graph, generated by pca13.py was getting most of the variation and the inflection shoulder was quite steep. But even in cases where the shoulder is quite shallow, having such a cutoff will enable you to have a principled and consistent way to understand your data.

You can also download the whole blog as a jupyter notebook if you want to try out the code yourself.

Now that you know about Principal Component Analysis take a look at the decision systems around you, where data is the primary input and try to observe if they are using more data than necessary. If not, there is your gold mine.

 

References:
Siral Raval. Dimensionality Reduction. YouTube

Importance of Principal Component Analysis. Quora

Principal Component Analysis.Washington University.YouTube

Minimizing reconstruction error is equivalent to maximizing variance.Statistics StackExchange

Making sense of Eigen Vectors and Eigen Values and PCA – stats.stackexchange.com

plotly.Principal Component Analysis.

Ready to embark on a transformative journey? Connect with our experts and fuel your growth today!