Linear Regression is one of the most basic, simple and widely used methods of machine learning. In this post, we will take a look at machine learning through linear regression in depth. The code shown is in Python but in case you are an R user, we will highly appreciate it if you share the code in the comments below.

Linear Regression is a supervised machine learning technique. As we know, in supervised learning, the right answers to the input variables are given. This is also a regression problem because the output distribution is continuous. Now before moving on, I will highly recommend you to read my introductory post on machine learning and come back to this post after reading that. The post should help in understanding the concept better. We will assume that you have already read through the post and hence, build up on the concepts discussed in that post.

Let’s take an example dataset. We will be considering the diabetes dataset that comes preloaded with scikit learn. In this post, we will consider the essential mathematics and then do a custom implementation in Python. Later on, we will also see how to implement this using the scikit learn library.

#### Dataset and Linear Regression using scikit-learn

We will take a sample dataset and get a feel of making some real predictions using the scikit-learn library. Once we do that we should get some knowledge of what can be achieved using linear regression. We can then look at the implementation and the math concepts should come to us faster. So moving forward, first we will call all the libraries that we will be needing and load the diabetes dataset.

Let’s take a look at the dataset. The dataset consists of 10 features which have been anonymised. To make the example simple, we will then reduce the dimensions and take only the third feature.

We will then be able to split the dataset into training and testing datasets. Once done we will create a linear regression object and then fit the training dataset to the model. This will enable us to make predictions on the testing dataset.

We can then compare how our predictions fared against the real y_test and also plot the real values and the prediction line together so that can see the result.

#### The Analysis

In linear regression, we have the training set and the hypothesis. We already have the training set as above and our hypothesis will be:

which you can note is a linear function. Here θ’s can be called the parameters of the function.

Please note we have different values of θ’s. In the code below we will say that `θ0`

is a and `θ`

is b for clarity. The result is that two lines are produced.

The goal of the machine learning exercise is to find the values of these θ’s so that the function `h`

shown above is close to `y`

for the training examples. Speaking in mathematical terms, we want to minimize the difference between `h(x)`

and the corresponding value of `y`

squared. We will call this our cost functions. Or saying the same thing more accurately is that we characterize this difference by the cost function. The cost function is a function that assigns a cost to instances where the model deviates from the observed data. In this case, our cost is the sum of squared errors. The goal of any supervised learning exercise is to minimize whatever cost we chose. Our cost function can also be shown using the below equation.

Here `m`

is the number of samples which in our previous example is `442`

. Our goal can now be said in terms of the above equation that we want to minimize `J`

.

Before we take a look at how to minimize `J`

, let’s take a look at the shape of the function `J`

. In machine learning we start with the datasets. Hence, x and y are pretty much known and will not change. Hence, we can consider them to be constants in the equation. The real variables in our equation are `θ0`

and `θ1`

. We can also see that the equation J resembles that of an upside-down cone.

#### Gradient Descent

So you can see that we need to move towards the bottom where the value of `J`

is the lowest. A way to do that is using the gradient descent algorithm. As per the algorithm, we need to repeat the below procedure till convergence,

Here `α`

is the learning rate and we multiply that will the derivative or the gradient of `J`

. We know that the gradient of an equation is given by the derivative of the equation. For example if `f(x)=x2`

then the gradient of `f(x)`

is defined by the derivative of `f(x)`

which is `2x`

. Hence to find the gradient of a particular point we need to solve the equation of the derivative at that particular point. Showing this mathematically,

`f(x)`

when `x`

is `1`

, we need to find the value of the derivative of `f(x)`

at that point, which in this case is `2.1=2`

. Since there are two variables `θ0`

and `θ1`

, we will be taking partial derivatives and hence, the step shown in equation 3 is a 2 step process.Now one question may arise why is the sign before the learning rate negative. To understand the answer lets take a look at the above paraboloid plot. You may observe that in places where value of `θ`

is less than the minimum value of J the gradient (point B) would be negative and hence the new value of `θ`

after running equation 3 will be more than that of old `θ`

. Similarly, in places where the value of `θ`

is more than `θ `

corresponding to the minimum value of J the gradient is positive(point C) and hence the new value of `θ`

will be less than the old `θ`

. Thus, we can see that the values of `θ`

will progressively move towards where `J`

is minimum. The value of `α`

here is a hyperparameter and needs to be fixed by the user. Ideally, it should be as low as possible.

#### Applying gradient descent to linear regression

Now the final step is to find the derivative of the cost function and apply it to linear regression. The linear regression model is given by the two equations above equation 1 and equation 2 and we want to find the minimum of `J`

as given in equation 2. Since in the gradient descent algorithm it is necessary to figure out the partial derivatives, so the partial derivatives of `J`

with respect to the two variables `θ0`

and `θ1`

are given by the below equations,

Hence, plugging this into the equation 3, we get the modified gradient descent algorithm as follows,

And this is the gradient descent algorithm for running linear regression over some training data. An interesting thing to note is that each step of the process runs through all the training data and updates its `θ0`

and `θ1`

together and hence this algorithm is called **batch** gradient descent algorithm.

As a final step to test your understanding create a custom implementation of this machine learning model using any language of your choice and post the links below.

Thanks for reading and let me know what algorithms you would want to be covered in a future post. If you are interested in talking more on this, just drop me a message @alt227Joydeep. I would be glad to discuss this further.