The generative approach to classification is based on the factorization of the joint probability distribution of the inputs and labels as a product of a prior probability and a conditional probability.

More precisely, the prior probability refers to the probability of drawing an example with a particular label *a priori* (without observing its input pattern).
And the conditional probability is the probability of drawing a particular input in a given class, i.e., with a given label.

A generative learning algorithm starts by estimating these two terms. Then, the classification of a new input is computed with the same decision rule as the optimal Bayes classifier, i.e., the input pattern is assigned to the class with maximal posterior class-membership probability, except that the posterior probabilities are estimates and not the true ones.

In particular, the posterior class-membership probabilities can be obtained by application of Bayes' rule to the estimates of the prior and class-conditional probabilities.

Two very popular generative methods are the linear discriminant analysis and the naive Bayes classifier.

The plot below shows the true conditional probability density functions of the data in dashed lines and the estimated ones in plain lines (the prior probabilties are taken as equal here). A generative classifier outputs a predicted label corresponding to the maximal estimated pdf.

For a discrete random variable $X$ taking values in $\X$, the generative approach is based on the following factorization of the joint probability distribution of $(X,Y)$:
$$
P(X=\g x,\ Y=y) = P(X = \g x\ |\ Y = y) P(Y=y) ,
$$
where $P(X=\g x\ |\ Y=y)$ is the conditional probability of drawing a particular input pattern in a given class $y$ and $P(Y=y)$ is the **prior probability** of drawing an example of this class.

Using Bayes' rule this decomposition also provides access to the posterior class-membership probabilities which are useful for classification. Overall, the generative approach amounts to the following steps.

- Estimate the prior probabilities $P(Y=y)$ for all labels $y\in\Y$, e.g., via simple counts: $$ P(Y=y) = \frac{1}{N}\sum_{i=1}^N \I{y_i = y}. $$
- Choose a family of distributions $\mathcal{P}$ for the input pattern and assume that $P(X=\g x\ |\ Y = y) \in \mathcal{P}$.
- Estimate the conditional distribution in all classes.
- Classify new input patterns with \begin{align*} f(\g x) &= \arg\max_{y\in\Y} P(Y=y\ |\ X =\g x) & (\mbox{optimal classification rule}^*)\\ & = \arg\max_{y\in\Y}\frac{P(X=\g x\ |\ Y = y) P(Y=y)}{P(X=x)} & (\mbox{due to Bayes' rule}) \\ & = \arg\max_{y\in\Y} P(X=\g x\ |\ Y = y) P(Y=y) & (\mbox{due to } P(X=\g x) \mbox{ being a constant wrt. } y ) \end{align*} $^*$: this classification rule would be optimal if we had access to the true probabilities.

Depending on the choice of distribution family $\mathcal{P}$, we obtain different classifiers.

If the input is a real vector and $X$ is a continuous random variable, the generative approach is stated in terms of the conditional probability density functions $p_{X|Y=y} ( \g x\ |\ Y =y)$ instead of the conditional probabilities $P(X=\g x\ |\ Y = y)$. Specifically, we work from the decomposition of the mixed density:
$$
p_{X,Y}(\g x,\ Y=y) = p_{X|Y=y}( \g x\ |\ Y = y) P(Y=y) .
$$
and assume $p_{X|Y=y}( \g x\ |\ Y = y) \in\mathcal{P}$ with $\mathcal{P}$ a family of densities.

A popular example of this case is the linear discriminant analysis, which assumes Gaussian densities with fixed variance over the classes.