Basics Of Statistics For Machine Learning Engineers II

More essential statistical concepts

I hope I have been able to drive home how important basic statistical concepts are in machine learning and also put some light on important concepts like mean, median and mode in my previous post.

In this post, we will take a look at some more essential concepts such as variance and standard deviation, percentiles and moments, covariance and Bayes’ Theorem.

Variance and Standard Deviation


Variance and Standard Deviation are essentially a measure of the spread of the data in the data set.

Variance is the average of the squared differences from the mean. In mathematical terms this would mean:

where σ2 is the variance, N is the number of observations (whole population), X is the individual set of observations and μ is the mean. Taking the same example as we took before, if x is given by the following numbers,

x = [23, 40, 6, 74, 38, 1, 70]

then the variance can be calculated as shown below.

observations = [23, 40, 6, 74, 38, 1, 70]
mean = 36
difference_from_the_mean = [13, 4, 30, 38, 2, 35, 34]
square_of_the_differences = [169, 16, 900, 1444, 4, 1225, 1156]
variance = (169+16+900+1444+4+1225+1156)/7 = 4914/7 = 702

Now one question may come to mind as why the variance is given as σ2. This is because standard deviation is denoted as σ and standard deviation is the square root of the variance. So in the above case, the standard deviation can be calculated as below.

Standard deviation is an excellent way to identify outliers. Data points that lie more than one standard deviation from the mean can be considered unusual. In many cases, data points that are more than two standard deviations away from the mean are not considered in analysis. We can talk about how extreme a data point is by asking the question “how many sigmas away from the mean is this?”

Percentiles and Moments

When a value is given x percentile, this means that x percentage of values in the distribution is below that value.

Moments try to measure the shape of the probability distribution function. The zeroth moment is the total probability of the distribution which is 1. The first moment is the mean. The second moment is the variance. The third moment is the skew which measures how lopsided the distribution is. The fourth moment is kurtosis which is the measure of how sharp is the peak of the graph.


Moments are important because, under some assumptions, moments are a good estimate of how the population probability distribution is based on the sample distribution. We can even have a good feel of how far off the population moments are from our sample moments under some realistic assumptions. And once the population moments are known that means the shape of the population probability distribution is known as well.

Covariance and Correlation

Let’s say we have two different attributes of something. Covariance and Correlation are the tools that we have to measure if the two attributes are related to each other or not.

Covariance measures how two variables vary in tandem to their means. The formula to calculate covariance is shown below.

where x and y are the individual values of X and Y ranging from i = 1,2, .., n where the probability that each value may occur is equal and is equal to (1/n). E(x) and E(y) are the means of X and Y.

Correlation also measures how two variables move with respect to each other. A perfect positive correlation means that the correlation coefficient is 1. A perfect negative correlation means that the correlation coefficient is -1. A correlation coefficient of 0 means that the two variables are independent of each other. The formula for finding the correlation coefficient can be found using the following formula.

Both correlation and covariance only measure the linear relationship between data. They will fail to discover any nth order relationship between the two. Correlation is a special case of covariance when the data is standardized. If we are interested in only knowing if there is a relationship then correlation is a better measure as they also measure the extent of the relationship.

Probability and Statistics

We use a lot of probability concepts in statistics and hence in machine learning, they are like using the same methodologies. In probability, the model is given and we need to predict the data. While in statistics we start with the data and predict the model. We look at probability and search from data distributions which closely match the data distribution that we have. Then we assume that the function or the model must be the same as the one we looked into in probability theory.


Conditional Probability and Bayes’ theorem


Conditional Probability is the study of the probability of two things happening together. The way to do this is by applying Bayes’ theorem which provides a simple way for calculating conditional probabilities.

Speaking mathematically, the probability of the model given the data is probability of the data given the model times the ratio of the independent probability of the model and the independent probability of the data.

Bayes’ theorem is simple but has profound implications. The degree of belief in a machine learning model can also be thought of as probabilities and machine learning can be thought of as learning models of data. Thus, we can consider multiple models, find out the probabilities they have given the data and then consider the model which has the higher probability. In practice, this may not be that simple but at least it will be easier to understand and not fish on routes with zero probabilities.


Joydeep Bhattacharjee