Logistic regression

In words...

Despite its name, logistic regression is a linear classification method. Its name refers to the fact that this method models the log-odds, i.e., the logarithm of ratios of posterior probabilities, as linear functions of the input vector that have to be regressed, i.e., estimated.

A logistic regression model is trained by application of the maximum likelihood inference principle.

In pictures...

Why linear approximation works

Here is a plot illustrating why approximating the odds by exponential functions, i.e., assumming the log-odds to be linear, works even when this assumption is clearly violated.

This binary classification problem is linearly separable, since the Bayes classifier is a linear classifier simply applying a threshold on the one-dimensional input $x$.
This is the case because the true log-odd, $$\log \frac{P(Y=+1\ |\ X=x) }{ P(Y = -1\ |\ X=x)} ,$$ plottted with a green solid line intersects the $x$-axis at a single location (and thus is positive on one side and negative on the other).

Training a logistic regression model

The plot below shows the classification function of the logistic regression model trained by maximizing its likelihood with gradient ascent.

This model has been trained with data randomly drawn according to a uniform distribution of $X$ and the posterior class-membership probabilities shown in the plot above.

In maths...

The model

Logistic regression is a classification method based on the optimal decision rule used by the Bayes classifier: $$ f(\g x) = \arg\max_{k\in \Y} P(Y = k \ |\ X=\g x) = \arg\max_{k\in \Y} \log P(Y = k \ |\ X=\g x) $$ (where taking the log does not change the argument of the maximum but only the maximum itself).
While the Bayes classifier is expressed in terms of the true posterior probabilities $P(Y = k \ |\ X=\g x)$ induced by the unknown distribution $P$, logistic regression uses estimates of these probabilities.

More precisely, in logistic regression we assume that the posterior probabilities can be modeled via linear functions as $$ \log \frac{P(Y = k \ |\ X=\g x)}{P(Y = Q \ |\ X=\g x)} = \g w_k^T \g x + b_k, \quad k=1,\dots,Q-1 , $$ for a problem with $Q$ categories. In this model, we arbitrarily chose the last category (of label $Q$) as the denominator of the odds, i.e, the ratios $\frac{P(Y = k \ |\ X=\g x)}{P(Y = Q \ |\ X=\g x)}$. Doing so we already notice that only $Q-1$ linear models are involved in the global model (the $Q$th of the log-odds could actually be interpreted as a constant linear model equal to 0).

For this model to be reasonable, we must ensure that the probabilities sum to one: $$ \sum_{k=1}^Q P(Y = k \ |\ X=\g x) = 1 . $$ This constraint actually removes a degree of freedom in the model, in the sense that knowing the values of all the probabilities but one directly yields the remaining one. This is why we can consider only $Q-1$ linear functions that model the log-odds with respect to the probability of a category left aside.
The posterior probability for this $Q$th category is given by the other probabilities $$ P(Y = k \ |\ X=\g x) = e^{ \g w_k^T \g x + b_k} P(Y=Q\ |\ X=\g x),\quad k=1,\dots,Q-1 $$ and the constraint \begin{align*} & \sum_{k=1}^Q P(Y = k \ |\ X=\g x) = P(Y = Q \ |\ X=\g x) + \sum_{k=1}^{Q-1} P(Y = k \ |\ X=\g x) = P(Y=Q\ |\ X=\g x)\left(1 + \sum_{k=1}^{Q-1} e^{ \g w_k^T \g x + b_k} \right) = 1 \\ \Rightarrow\quad & P(Y=Q\ |\ X=\g x) = \frac{1}{1 + \sum_{k=1}^{Q-1} e^{ \g w_k^T \g x + b_k} } . \end{align*} This calculation also leads to the expression $$ P(Y=k\ |\ X=\g x) = \frac{e^{ \g w_k^T \g x + b_k}}{1 + \sum_{l=1}^{Q-1} e^{ \g w_l^T \g x + b_l} }, \quad k=1,\dots,Q-1, $$ involving only the model parameters for the other probabilities.

Binary classification

For binary classification, $Q=2$ and thus there is a single linear model $$ \log \frac{P(Y = +1 \ |\ X=\g x)}{P(Y = -1 \ |\ X=\g x)} = \g w^T \g x + b $$ to estimate. Note that here we use the typical set of labels for binary classification, $\Y=\{-1,+1\}$.
Moreover, the final classification function $f$ can be expressed as $$ f(\g x) = \sign\left( \g w^T \g x + b \right) $$ since $$ f(\g x) = \sign\left( \log \frac{P(Y = +1 \ |\ X=\g x)}{P(Y = -1 \ |\ X=\g x)} \right) = \begin{cases} +1,& \mbox{if } P(Y = +1 \ |\ X=\g x) \geq P(Y = -1 \ |\ X=\g x)\\ -1, &\mbox{otherwise}. \end{cases} $$ In this case, we also have the following expression of the posterior probability $$ P(Y = -1 \ |\ X=\g x) = 1 - P(Y = +1 \ |\ X=\g x) = \frac{1}{1 + e^{\g w^T \g x + b}} . $$

Training via maximum likelihood inference

For the sake of clarity, we here restrain the discussion to binary classification, where the calculations drastically simplify.

Training the logistic regression model on a training set $\{(\g x_i, y_i)\}_{i=1}^N$ amounts to applying the maximum likelihood inference principle, i.e., to maximizing the log-likelihood of the model $$ \sum_{i=1}^N \log P(Y=y_i, X=\g x_i) = \sum_{i=1}^N \log P(Y=y_i\ |\ X=\g x_i) + \log P(X=\g x_i) . $$ Removing the constant terms that do not influence the maximization leads to maximize $$ \sum_{i=1}^N \log P(Y=y_i\ |\ X=\g x_i) . $$ This objective function can be rewritten in terms of the linear model parameters $\g w$ and $b$ by noting that \begin{align*} \log P(Y=+1\ |\ X=\g x_i)& = \log \frac{P(Y=+1\ |\ X=\g x_i)}{P(Y=-1\ |\ X=\g x_i)}P(Y=-1\ |\ X=\g x_i) \\ &= \log \frac{P(Y=+1\ |\ X=\g x_i)}{P(Y=-1\ |\ X=\g x_i)} + \log P(Y=-1\ |\ X=\g x_i) \\ & = \g w^T\g x_i + b + \log P(Y=-1\ |\ X=\g x_i) \end{align*} and $$ \log P(Y=-1\ |\ X=\g x_i) = \log \frac{1}{1 + e^{\g w^T \g x + b}} = -\log (1 + e^{\g w^T \g x + b}) . $$ This yields the optimization problem $$ (\hat{\g w}, \hat{b}) = \arg\max_{\g w\in\R^d, b\in\R} \ L(\g w, b) = \sum_{y_i = +1} (\g w^T\g x_i +b) -\sum_{i=1}^N \log (1 + e^{\g w^T \g x + b}) $$ The objective function here is concave, so we can use a local optimization method to obtain the solution. Since there are no constraints, the most simple of such methods is the gradient ascent.

Gradient ascent

The derivatives of the objective function are $$ \drond{L(\g w, b)}{\g w} = \sum_{y_i = +1} \g x_i - \sum_{i=1}^N \g x_i \frac{e^{\g w^T \g x_i + b}}{ 1 + e^{\g w^T \g x_i + b}} = \sum_{y_i = +1} \g x_i - \sum_{i=1}^N P(Y=+1\ |\ X=\g x_i) \g x_i $$ and \begin{align*} \drond{L(\g w, b)}{b} &= \sum_{y_i = +1} 1 - \sum_{i=1}^N\frac{e^{\g w^T \g x_i + b}}{ 1 + e^{\g w^T \g x_i + b}} = \sum_{y_i = +1} 1 - \sum_{i=1}^N P(Y=+1\ |\ X=\g x_i) \end{align*} .........