Maximum Likelihood Estimation in Logistic Regression

7 min readApr 26, 2021

If you are not familiar with logistic regression, feel free to check out Understanding Logistic regression

Logistic regression is very similar to regular old linear models like linear regression but the big difference is that logistic regression uses the log odds on the y-axis. In logistic regression, we make sure that the curve fitted makes the range of response variable y belong to 0 and 1.

If you are not familiar with linear regression, feel free to check out Linear Regression in a Nutshell

As we know that in linear regression to find the best fitting we start with some data and we fit a line to them using the least-squares i.e we measure the residuals (the distance between the data and the line) then square them, so that negative values do not cancel out positive values, and add them up.

Then we rotate the line a little bit and do the same. The line with the smallest sum of squared residuals is the line chosen to fit best.

Why can’t we make use of least-squares to find the best fitting line in logistic regression?

Well, to answer this we need to recall logistic regression. Our goal in logistic regression is to draw the best fitting S-curve for given data points. And in logistic regression, we transform the y-axis from the probabilities to log(odds). The problem is that this transformation pushes the data points to positive and negative infinity as shown below

So we can’t use least-squares to find the best fitting line as the residuals are also equal to positive and negative infinity. Instead of least-squares, we make use of the maximum likelihood to find the best fitting line in logistic regression.

In Maximum Likelihood Estimation, a probability distribution for the target variable (class label) is assumed and then a likelihood function is defined that calculates the probability of observing the outcome given the input data and the model. This function can then be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset.

By applying Maximum Likelihood estimation, the first thing we do is project the data points onto the line. This gives each data point a log(odds) value.

We then transform this log(odds) to probabilities using the formula

Derivation of the above formula

As we already know that

Exponentiate both the sides

Once we calculate the probabilities, we will plot them to get an s-curve. Now, We just keep rotation the log(odds) line and projecting the data points onto it and then transforming it to probabilities and calculating the log-likelihood. We will repeat this process until we maximize the log-likelihood.

The algorithm that finds the line with the maximum likelihood is pretty smart each time it rotates the line, it does so in a way that increases the log-likelihood. Thus, the algorithm can find the optimal fit after a few rotations.

Estimation of Log-likelihood function

As explained, Logistic regression uses the Maximum Likelihood for parameter estimation. The logistic regression equation is given as

The parameters to be estimated in the equation of a logistic regression are β vectors.

To estimate β vectors consider the N sample with labels either 0 or 1.

Mathematically, For samples labeled as ‘1’, we try to estimate β such that the product of all probability p(x) is as close to 1 as possible. And for samples labeled as ‘0’, we try to estimate β such that the product of all probability is as close to 0 as possible in other words (1 — p(x)) should be as close to 1 as possible.

The above intuition is represented as

xᵢ represents the feature vector for the iᵗʰ sample.

On combining the above conditions we want to find β parameters such that the product of both of these products is maximum over all elements of the dataset.

This function is the one we need to optimize and is called the likelihood function.

Now, We combine the products and take log-likelihood to simply it further

Let’s substitute p(x) with its exponent form

Now we end up with the final form of the log-likelihood function which is to be optimized and is given as

Maximizing Log-likelihood function

The goal here is to find the value of β that maximizes the log-likelihood function. There are many methods to do so like

Fixed-point iteration
Bisection method
Newton-raphson method
Muller’s method

In this article, we will be using the Newton-raphson method to estimate the β vector. The Newton-raphson equation is given as

Let’s determine the gradient first. To determine gradient we will take the first-order derivative of our log-likelihood function

Now, we will replace the exponential term with probability

The matrix representation of gradient will be

We are done with the numerator term of newton-raphson. Now we will calculate the denominator i.e second-order derivative which is also called as Hessian Matrix.

To do so we will find derivate of gradient as

Now, We will replace probability with its equivalent exponential term and compute its derivative

Resubstitute exponential term as probability

The matrix representation of the Hessian matrix will be

As we have calculated gradient and Hessian matrix, plugging these two terms into the newton-raphson equation to get a final form

Now, we will execute the final equation for t number of iterations until the value of β converges.

Once the coefficients have been estimated we can then plug in the values of some feature vector X to estimate the probability of it belonging to a specific class.

We should choose a threshold value above which belongs to class 1 and below which is class 0.

Conclusion

The Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a logistic regression model. This estimation method is one of the most widely used. The method of maximum likelihood selects the set of values of the model parameters that maximize the likelihood function.

The likelihood function is the probability that the observed values of the dependent variable may be predicted from the observed values of the independent variables. The likelihood varies from 0 to 1.

The MLE is the value that maximizes the probability of the observed data. And is an example of a point estimate because it gives a single value for the unknown parameter.

Thanks for reading this article! Leave a comment below if you have any questions. Be sure to follow @ArunAddagatla, to get notified regarding the latest Data Science and Deep Learning articles.

You can connect with me on LinkedIn, Github, Kaggle, or by visiting Medium.com.