Linear discriminant analysis (LDA) is a straightforward method applying the generative approach for classification. It is based on the assumption that each class can be modeled by a Gaussian distribution and that all the classes share the same covariance matrix.

These assumptions make LDA a linear classification method. If by any chance, they hold for the true distribution of the data, then LDA is optimal in the sense that it converges to the Bayes classifier when the number of data tends towards infinity (at which point the parameter estimates coincide with the true distribution parameters).

In practice, LDA requires few computations to estimate the classifier parameters that amount to computing percentages and means, plus a matrix inversion.

Quadratic discriminant analysis (QDA) is similar to LDA but without the assumption that the classes share the same covariance matrix, i.e., each class has its own covariance matrix. In this case, the boundary between classes is a quadratic surface instead of a hyperplane.

With knowledge of the true covariance matrix

and true means, we get the Bayes classifier of parameters:

$\g w = $

$b = $ .

With $N = $
data points, the estimated covariance matrix

and means, lead to the LDA classifier of parameters:

$\g w = $

$b = $

Try to observe to convergence of LDA towards the Bayes classifier by changing the number of data.

Linear discriminant analysis follows the generative approach while assuming that the data of each class is generated by a Gaussian distribution of probability density function $$ p_{X|Y = y} (\g x\ |\ Y = y) = \frac{1}{(2\pi)^{d/2} |\g\Sigma_y|^{1/2}} \exp\left(-\frac{1}{2} ( \g x - \g \mu_y)^T \g \Sigma_y^{-1}( \g x - \g \mu_y)\right), $$ and that the covariance matrix $\g \Sigma_y$ is the same across all the labels: $$ \forall y\in\Y,\quad \g\Sigma_y = \g \Sigma. $$

The parameters are estimated as follows. The prior probabilities are simply the fractions of data points of each category: $$ \forall y\in\Y,\quad P(Y=y) = \frac{N_y}{N}, \quad \mbox{with } N_y = \sum_{i=1}^N \I{y_i = y}. $$ The means of the Gaussians are estimate by the sample means: $$ \forall y\in\Y,\quad \g \mu_k = \frac{1} {N_y} \sum_{y_i = y} \g x_i $$ and the covariance matrix by $$ \g\Sigma = \frac{1}{N-|\Y|} \sum_{y\in\Y} \sum_{y_i = y} (\g x_i - \g\mu_y)(\g x_i - \g\mu_y)^T . $$ This formula comes from the weighted average of the local covariance matrix estimates within each class: $$ \g\Sigma = \frac{\sum_{y\in\Y} (N_y-1) \g\Sigma_y}{\sum_{y\in\Y} (N_y-1)} $$ where $N = \sum_{y\in\Y} N_y$ and $$ \g\Sigma_y = \frac{1}{N_y-1} \sum_{y_i = y} (\g x_i - \g\mu_y)(\g x_i - \g\mu_y)^T $$

Then, the classification of new data points is obtained by $$ f(\g x) = \arg\max_{y\in\Y} p_{X | Y=y} (\g x\ |\ Y = y) P(Y=y) . $$

To see why LDA is a linear classification method, consider first the case of binary classification with $\Y=\{-1,+1\}$. Then, the classification function $f$ can be written as $$ f(\g x) = \begin{cases} 1 , & \mbox{if } p_{X | Y= y} (\g x\ |\ Y = 1) P(Y=1) \geq p_{X | Y= y} (\g x\ |\ Y =- 1) P(Y=-1) \\ -1, & \mbox{otherwise.} \end{cases} $$ The condition for predicting $f(\g x) = +1$ is equivalent to $$ \frac{p_{X | Y= y} (\g x\ |\ Y = 1) P(Y=1) }{ p_{X | Y= y} (\g x\ |\ Y =- 1) P(Y=-1)} \geq 1, $$ or $$ \log \frac{p_{X | Y= y} (\g x\ |\ Y = 1) P(Y=1) }{ p_{X | Y= y} (\g x\ |\ Y =- 1) P(Y=-1)} \geq 0 . $$ Computing this log-ratio with the LDA assumptions gives \begin{align} \log \frac{p_{X | Y= y} (\g x\ |\ Y = 1) P(Y=1) }{ p_{X | Y= y} (\g x\ |\ Y =- 1) P(Y=-1)} &= \log \frac{p_{X | Y= y} (\g x\ |\ Y = 1) }{ p_{X | Y= y} (\g x\ |\ Y =- 1) } + \log \frac{ P(Y=1)}{ P(Y=-1)} \\ &= -\frac{1}{2} ( \g x - \g \mu_1)^T \g \Sigma^{-1}( \g x - \g \mu_1) + \frac{1}{2}( \g x - \g \mu_{-1})^T \g \Sigma^{-1}( \g x - \g \mu_{-1}) + \log \frac{N_1}{N_{-1}} \\ &= -\frac{1}{2} \g \mu_{1}^T \g \Sigma^{-1} \g \mu_{1} + \g x^T\g \Sigma^{-1}\g\mu_1 + \frac{1}{2} \g \mu_{-1}^T \g \Sigma^{-1} \g \mu_{-1} - \g x^T\g \Sigma^{-1}\g\mu_{-1} + \log \frac{N_1}{N_{-1}} \\ &= \frac{1}{2} (\g \mu_{-1} + \g \mu_1)^T \g \Sigma^{-1}(\g \mu_{-1} - \g \mu_1)+ \g x^T\g \Sigma^{-1} ( \g\mu_1 - \g\mu_{-1}) + \log \frac{N_1}{N_{-1}} \\ &= \g w^T \g x + b \end{align} with $$ \g w = \g \Sigma^{-1}(\g\mu_1 - \g\mu_{-1} ) $$ and $$ b = \frac{1}{2} (\g \mu_{-1} + \g \mu_1)^T \g \Sigma^{-1}(\g \mu_{-1} - \g \mu_1)+\log \frac{N_1}{N_{-1}} = \frac{-1}{2} (\g \mu_{-1} + \g \mu_1)^T \g w +\log \frac{N_1}{N_{-1}} . $$ Thus, we can write the classification function as the one of a linear classifier, i.e., $$ f( \g x ) = \sign ( \g w^T \g x + b) . $$

For a multi-class problem, the same reasoning applies to the log-ratio of any pair of labels and LDA divides the input space by cutting through it with several hyperplanes.