Basics Of Statistics For Machine Learning Engineers

Essential statistical concepts if you are interested in machine learning

These days there is a Cambrian explosion of various data science and machine learning tools that make it very easy to start in machine learning. Probably, you are someone who has heard about the buzzword and wanted to try it out yourself. Maybe you have gone through tutorials on one of the hot and trending machine learning libraries such as scikit-learn and want to have an idea on how to implement machine learning. You recognize that you have all the prerequisites of a problem that make it suitable for machine learning. You have the data set and also a problem that seems to have a pattern to it, but you cannot pin it down using an algorithm. You threw the data set to the machine learning library and got something as an outcome.

Now, starts the problem. You don’t understand what this ‘something’ is. More often than not this ‘something’ is a collection of numbers in a weird format. Apart from better interpretability, the business for which your product or insights are built, wants to know if this is the best that can be done. As you are the one who implemented this, they want you to prove that this is the state of the art results and if not, it’s you who needs to improve the results. And secretly you are starting to realize that you have no clue what is happening around you.

Machine Learning is a buzzword right now and there are many who would be trying their hands at it. Therefore, it would help you a great deal if you are able to understand a little bit about the intricacies that make machine learning tools and libraries so powerful. In this post, we will look at the basic statistical concepts that make up the bedrock of analysis in machine learning.

Types of Data
Understanding different types of data will help you in choosing the different types of techniques that you may use to get insights into the data.
The major types of Data are:
Numerical:

This represents some quantifiable thing that you can measure. This can be discrete, which is normally the counts of some event. As for example, how many times a person visits the same doctor when he is sick, or the number of clothes bought by a customer on an eCommerce platform. Another flavor is that the numerical data can be continuous as well. A characteristic of continuous data is that the range can be infinite. An example maybe, how much rainfall happened in a given year and so on.

Categorical:

These are data that has no inherent numerical meaning, such as man, woman. Or the state of birth of a group of people and so on. Generally, numbers are assigned to these values but the numbers themselves don’t mean anything. One of the good ways to denote categorical values is through graphs. So, in the figure, the scientists ‘Chandrasekhara Venkata Raman’ and ‘Jagadish Chandra Bose’ both have the birthplace as India which is shown by the edges.

Ordinal: This is the mixture of numerical and categorical data. An example is the ratings given by a customer as in 5 stars is better than 1 star.

Mean Median Mode
These are the measures of central tendency of a data set.

Mean is given by the total of the values of the samples divided by the number of samples. In mathematical terms this would mean:

where x̄ is the mean and there are n values of x. Σ means that you need to take the sum of all the data points.
Below is an example where the mean of a set of data points is calculated.

Below is an example where the mean of a set of data points is calculated.

This is a simple example taken to explain discrete data. The mean can be calculated in case of continuous data as well.

To calculate the median, sort the values and take the middle value. To illustrate let x denote the following numbers:

x = [23, 40, 6, 74, 38, 1, 70]

In this case, you would need to sort the list first and then take the midpoint to find the median.

sorted_x = [1, 6, 23, 38, 40, 70, 74]

In this case, the midpoint is 38 since the value 38 divides sorted x into 3 values which are smaller than 38 and three values which are higher than 38. Now, in case there are even number of values then the average of the two middle values are taken as the median. The advantage of the median over the mean is that median is less susceptible to outliers. So, in situations where there is a high chance that there may be outliers present in the data set, it is wiser to take the median instead of the mean. For example, to understand what is the per capita income of a country the median is taken because the rich may be extremely rich which would skew the average and show a different picture than what the average people might experience. In that case, probably, taking the median would give a better insight.

Mode represents the most common value in a data set. Mode is most useful when you need to understand clustering or number of ‘hits’. For example, a retailer may want to understand the mode of sizes purchased so that he can set stocking labels optimally. Say, store A has a mode of ‘small’ while store B has a mode of ‘XXL’.

In this post, we looked at some of the basic statistics that you will encounter while looking at data for machine learning. There are some more concepts that you need to know and so head over to my next post.

 

Joydeep Bhattacharjee