Unsupervised Learning

Emergence

Till now we have mostly concerned ourselves with supervised machine learning and this is the field with the most practical applications right now.

Having said that, it is unsupervised machine learning that holds the promise for the future. Why? Because getting data is cheap; it is getting labelled data that is relatively hard.

In this post, we will take a view of the common unsupervised machine learning algorithms and techniques.

K Means clustering
K-means clustering aims to cluster or group n observations into k clusters such that the centers correspond to the respective means of the respective groups. To find the mean Euclidean distance is used but you can use any valid measure for computing the distance. The algorithm for learning using k-means clustering is below.

1.Guess cluster centers at random.
2.Assign each data point to the cluster centers. Even though they are randomly chosen, assign the most likely corresponding data points. Here, Euclidean distance can be used.

Now that we have a correspondence between the data points and the cluster centers, find the optimal cluster center that corresponds to the points associated with the current groups of data points. You will now have a new set of cluster centers.

3. With this new set of cluster centers jump to step 2 and iterate through the rest of the processes either by defining the number of iterations or by defining if the step jump of the cluster centers is below a small threshold.

Expectation Maximization
K-Means algorithms are a simple and scalable machine learning algorithm. But it does not have the satisfaction and beauty of other machine learning algorithms as there is sound mathematical background involved in the understanding of why K-Means works. We see the result of this in practice where K Means is quite prone to converging to local minima. This is where Expectation Minimization algorithm comes into the picture.

In Expectation Maximization all the cluster points have mapping to all the data points, the difference is that the correspondence is more loosely based and probabilistic. On the flip side, due to the correspondence being a probability distribution, the computation can be very slow.

For example, for a two clusters, the probability distribution of the two clusters may look like the figure on the right. Thus in EM, the aim is to compute the probability distributions and arrive at the final cumulative distribution.

Dimensionality Reduction

Dimensionality Reduction techniques can be used for finding the underlying relationships between data. In the real world the data that we have correlates to each other a lot.

The steps in principal component analysis are shown below.

1.Take the whole data set with d dimensions and N samples.
2.Compute the d dimensional mean vector (the mean for every dimension).
3.Compute the covariance matrix of the whole data set.
4.Compute the eigen vectors and the eigen values.
5.Sort the eigen vectors based on decreasing eigen values.
6.Choose only the first k eigen vectors based on how many dimensions you want to keep, where k ≤ d.7
7.Use this kxN matrix to transform the samples into the new space.

Through the application of dimensionality reduction, clusters in the data may emerge and hence, we will be able to bucket them. PCA helps us in understanding latent interaction between the variables in an unsupervised setting.

Spectral Clustering
Till now all the techniques for unsupervised learning that we have discussed are based on linear transformations. These linear algorithms fail when the relationships are non-linear. Non-linearity in case of supervised algorithms have been handled using kernel methods but we don’t have that luxury in case of unsupervised learning.

For example although the below curve is two dimension we can express it in one dimension as it is a sine curve. We can do so because we know the relationship between the x axis and the y axis. But if the y axis relationship is not known as is the case with unsupervised learning, then reducing the dimension will be a tough job.

Hence, to tackle such issues spectral clustering is used. The fundamental idea of spectral clustering is to cluster by affinity. So, you will create an affinity matrix which will be a sparse matrix and will show the distance of all the points relative to all the other points. Affinity matrix can be created in various ways, either by defining an adjacency matrix, or by using a Gaussian kernel. Scikit-learn uses the following formula for finding the similarity.

similarity = np.exp(-beta * distance / distance.std())

Once done, we can run K-means as the final step.

References:

Expectation Maximisation Algorithm
Gaussian Mixture Models
Multivariate Normal Distributions
Principal Component Analysis
spectral clustering
scikitlearn: spectral clustering
http://www.statisticshowto.com/em-algorithm-expectation-maximization/