Logistic Regression although named regression is actually a classification technique and not a regression. It is named regression because the technique is quite similar to linear regression which I had discussed in this post. Have a look at the post as it will help you in understanding the concepts. You will need to keep in handy the equations mentioned there. The term “logistic” is taken from the logit function that is used in this method of classification.
In this post we will take a look at the math of how logistic regression works, code a custom implementation of the model and how to implement it using scikitlearn.
What is Classification?
Classification is termed when you want the output in buckets. For example, you have an email and you want to classify it as Spam and Not Spam. Or you have a lot of pictures of dogs and cats and you want a model that can classify the pictures into dogs and cats and appropriately label them. In more serious science maybe you have scans of tumors and you want to identify them as malignant or benign. All of these examples are classification problems and logistic regression can be employed for them.
Problem with using regression in classification
Now you may say. “OK, but why should I take your word for it. I already know one machine learning technique which is linear regression. I will use that and be done with it.” To answer that question let’s try to use the principal of negation. Say, we are trying to fit a straight line through a sample dataset.
In the above example malignant tumors get 1 and non malignant tumors get 0 and we are trying to fit the green line. While making predictions we will say that if the value on the line lies above 0.5 on the y axis (malignant?), then the tumor is malignant, else we will say it’s benign. We are happy with our predictions and we go home.
But wait, the doctor (our customer) comes back to us and shows us the following problem in the model. He ran the model on his dataset and the model got trained a bit different and it’s giving erroneous results. You look at the dataset and the resultant green line and you see this.
This is valid because all tumors that are large are also malignant. Now our model that malignant if y>0.5 does not work.
We cannot change the hypothesis every time a new dataset comes. That would defeat the whole purpose. In technical terms, our model does not generalize. We have to find a better way of defining our model. Fortunately, we have Logistic Regression, which is Regression analysis but with a twist.
Logistic Regression Model
In the logistic regression model, we want our classifier to output values that are between 0 and 1. So we are going to come up with a hypothesis that has this limiting property.
Linear regression gives us the below form to be used for fitting our model.
For logistic regression the above equation is modified a bit to yield:
where the value of g is the sigmoid function as follows:
Therefore the function h transforms to —
Now in case you are wondering why is this function taken for the value of g and why we could not have taken some other function, the answer is — the major advantage of using the logit function is that you get the simplicity of the methodology of linear regression without the disadvantages. Which means independent variables don’t have to be normally distributed or have equal variance in each group. This is because of interesting properties of the number e and I would highly recommend you to look at this video to gain a little bit of insight into the beauty of e. Hence, for classification where the data is binary, it is the ideal choice.
The heart of logistic regression is the sigmoid function and that is what we will define at first.
The hypothesis(h) is the same as linear regression as in equation 2. The only difference is that instead of z we define the sigmoid of z as in equation 4.
Now we need to define the error function. The error function needs to be such that the ‘punishment’ due to more errors should push the gradient of the function towards the minimum value. This is done using the Cost_function where we are basically finding the difference between actual and predicted value, and taking the logarithm of it. Here Y can take values of 0 or 1. Another fact to keep in mind is that logarithm of x with x=1 is 0 with the value approaching negative infinity as x approaches 0. Now, if Y is 1 then we just take the logarithm of the hypothesis and if Y = 0 then take the difference of the hypothesis from 1. Notice that if h is 1 or close to 1 when Y is 0 it will make the error big. Thus, a huge penalty will be introduced. Then we just take the sum of the errors and say this is our cumulative error.
Now we need to take the derivative of the cost function. This is the basis of the gradient descent algorithm. We will push the derivative towards 0 which means that the cost_function will be pushed towards the minimum value and the result will be that the hypothesis should approach the true value which is Y.
Compiling all the above ideas we write a high level logistic regression function that will take the X and Y as input and some other hyper parameters. The logistic regression function shown here is based on number of iterations for faster execution but you can easily change the logic where the cost is below some threshold.
Please find all the code and jupyter notebook that is shown here in this link. Shoutout to perborgen which has been the inspiration for this post. I would also urge the readers to take a look at this alternate implementation of Logistic Regression from first principles by computing the Hessian by Siraj. The code for that is here.