Principal Component Analysis and Dimension Reduction

Category Data Engineering

Normally when we are applying any of the machine learning concepts, we need to deal with a lot of matrices. Each matrix may have a lot of features or dimensions and then we will need to do a lot of computation. It may be prohibitive to run all the computations in a production environment, not counting the added problem of overfitting. In many occasions, it is also very useful to visualize the data. Due to our limitations as human beings, we are not able to visualize higher dimensions. For these reasons, we need to resort to Principal Component Analysis or PCA to reduce the dimensions in our data-set.

What is principal component analysis?

Generally, when we are collecting data, we try to gather as many types of data points as possible. In many cases, the data we have maybe correlated. Principal Component Analysis converts our original variables to a new set of variables, which are a linear combination of the original set of variables. These variables would be unrelated to each other and hence, are called principal components. Through the use of PCA, we will be using only those variables that capture the maximum variance in the data. Not all this discussion might make sense right now, so we will delve into some simple mathematics to illustrate what we mean as we go forward.

Why are we interested in PCA?

Data science mainly deals in matrix multiplications. Matrix multiplications, in general, are very computationally expensive.
Let us a take a simple matrix with n samples and m measurements. A naive matrix multiplication code(see below) will implement this in time complexity of O(n3).

Input: matrices A and B
Let C be a new matrix of the appropriate size
For i from 1 to n:
For j from 1 to p:
Let sum = 0
For k from 1 to m:
Set sum ← sum + Aik × Bkj
Set Cij ← sum
Return C

Source: wikipedia

You can, however, optimize this algorithm and reach a limit of 2.3737. Another approach is that you can parallelize the matrix multiplication but that would also mean a lot of time investment in terms of understanding the data set that you have and then segregating parts that are mutually exclusive. What if I told you that in many cases there is a third and better approach; that you can just reduce the number of dimensions in your data set, thus, reducing the complexity, while still conserving most of the characteristic of the data that you have.

Preparation
What is in store for you may be a bit heady so to be more prepared I will highly encourage you to watch the following video by the great Carl Sagan himself, explaining higher dimensions and how reducing the dimensions aids us in thinking about them and understanding them.

In the next post, you will understand dimensionality reduction and the popular method of Principal Component Analysis to achieve that.

 

Ready to embark on a transformative journey? Connect with our experts and fuel your growth today!