If you are not familiar with logistic regression, feel free to check out Understanding Logistic regression
Logistic regression is very similar to regular old linear models like linear regression but the big difference is that logistic regression uses the log odds on the y-axis. In logistic regression, we make sure that the curve fitted makes the range of response variable y belong to 0 and 1.
If you are not familiar with linear regression, feel free to check out Linear Regression in a Nutshell
As we know that in linear regression to find the best fitting we start with some data and we fit a line to them using the least-squares i.e we measure the residuals (the distance between the data and the line) then square them, so that negative values do not cancel out positive values, and add them up.
Then we rotate the line a little bit and do the same. The line with the smallest sum of squared residuals is the line chosen to fit best.
Why can’t we make use of least-squares to find the best fitting line in logistic regression?
Well, to answer this we need to recall logistic regression. Our goal in logistic regression is to draw the best fitting S-curve for given data points. And in logistic regression, we transform the y-axis from the probabilities to log(odds). The problem is that this transformation pushes the data points to positive and negative infinity as shown below
So we can’t use least-squares to find the best fitting line as the residuals are also equal to positive and negative infinity. Instead of least-squares, we make use of the maximum likelihood to find the best fitting line in logistic regression.
In Maximum Likelihood Estimation, a probability distribution for the target variable (class label) is assumed and then a likelihood function is defined that calculates the probability of observing the outcome given the input data and the model. This function can then be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset.
By applying Maximum Likelihood estimation, the first thing we do is project the data points onto the line. This gives each data point a log(odds) value.
We then transform this log(odds) to probabilities using the formula
Derivation of the above formula
As we already know that
Exponentiate both the sides
Once we calculate the probabilities, we will plot them to get an s-curve. Now, We just keep rotation the log(odds) line and projecting the data points onto it and then transforming it to probabilities and calculating the log-likelihood. We will repeat this process until we maximize the log-likelihood.
The algorithm that finds the line with the maximum likelihood is pretty smart each time it rotates the line, it does so in a way that increases the log-likelihood. Thus, the algorithm can find the optimal fit after a few rotations.
Estimation of Log-likelihood function
As explained, Logistic regression uses the Maximum Likelihood for parameter estimation. The logistic regression equation is given as
The parameters to be estimated in the equation of a logistic regression are β vectors.
To estimate β vectors consider the N sample with labels either 0 or 1.
Mathematically, For samples labeled as ‘1’, we try to estimate β such that the product of all probability p(x) is as close to 1 as possible. And for samples labeled as ‘0’, we try to estimate β such that the product of all probability is as close to 0 as possible in other words (1 — p(x)) should be as close to 1 as possible.
The above intuition is represented as
xᵢ represents the feature vector for the iᵗʰ sample.
On combining the above conditions we want to find β parameters such that the product of both of these products is maximum over all elements of the dataset.
This function is the one we need to optimize and is called the likelihood function.
Now, We combine the products and take log-likelihood to simply it further
Let’s substitute p(x) with its exponent form
Now we end up with the final form of the log-likelihood function which is to be optimized and is given as
Maximizing Log-likelihood function
The goal here is to find the value of β that maximizes the log-likelihood function. There are many methods to do so like
- Fixed-point iteration
- Bisection method
- Newton-raphson method
- Muller’s method
In this article, we will be using the Newton-raphson method to estimate the β vector. The Newton-raphson equation is given as
Let’s determine the gradient first. To determine gradient we will take the first-order derivative of our log-likelihood function
Now, we will replace the exponential term with probability
The matrix representation of gradient will be
We are done with the numerator term of newton-raphson. Now we will calculate the denominator i.e second-order derivative which is also called as Hessian Matrix.
To do so we will find derivate of gradient as
Now, We will replace probability with its equivalent exponential term and compute its derivative
Resubstitute exponential term as probability
The matrix representation of the Hessian matrix will be
As we have calculated gradient and Hessian matrix, plugging these two terms into the newton-raphson equation to get a final form
Now, we will execute the final equation for t number of iterations until the value of β converges.
Once the coefficients have been estimated we can then plug in the values of some feature vector X to estimate the probability of it belonging to a specific class.
We should choose a threshold value above which belongs to class 1 and below which is class 0.
The Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a logistic regression model. This estimation method is one of the most widely used. The method of maximum likelihood selects the set of values of the model parameters that maximize the likelihood function.
The likelihood function is the probability that the observed values of the dependent variable may be predicted from the observed values of the independent variables. The likelihood varies from 0 to 1.
The MLE is the value that maximizes the probability of the observed data. And is an example of a point estimate because it gives a single value for the unknown parameter.