Logistic regression is a model that applying logistic function on the output of a linear regression model. Because the output of logistic function is restricted in range $[0, 1]$, we can interpret the output as a probability measure. Then we can use Maximize Likelihood Estimation (MLE) to estimate the model parameters $\theta$.

Logistic function defined as

where $z = \theta ^\top x$, $x \in \mathbb{R}^d$, $\theta \in \mathbb{R}^d$.

Consider binary classification problem, we denote $h_{\theta}(x)$ output as probability for positive class: $\mathbb{P}[y=1 | x; \theta] = h_{\theta}(x)$, Thus, $\mathbb{P}[y=0 | x; \theta] =1- h_{\theta}(x)$. We can write down the condition probability distribution function (probability mass function here):

This formula is for single sample $x$, consider a set of data $\mathbf{X}, \mathbf{y}$ sampled independently from identical distribution (i.i.d) the corresponding distribution function is the product of pmf/pdf of each sample:

The term $ p(\mathbf{y}|\mathbf{X}; \theta) $ here is the conditional likelihood, since it is conditioned on $ \mathbf{X} $. We can derive this term from following procedure:

The MLE procedure is to modify $\theta$ to maximize likelihood function. The general idea is taking the derivative of the likelihood function, then perform gradient ascend procedure. While taking the gradient of a product is difficult, a very long chain rule, instead we maximize the log-likelihood function, which transform the product into big-sum.

We can compute the gradient for single sample $(x^{(i)}, y^{(i)})$ then sum them up. According to matrix cook book, we use $\frac{\partial a^\top b}{\partial a} = b$

Thus, gradient ascend procedure for $\theta$ updating as follows:

It is also possible to derive from squared loss function $(h_{\theta}(x) - y)^2$, but the derivatives are not that clean. And I see people argue the computational efficiency is not good.