The naive Bayes classifier

In words...

First of all, the naive Bayes classifier should not be confused with the (optimal) Bayes classifier as it is in fact far from optimal.

The naive Bayes classifier is one of the most basic generative classifiers. It relies on a very crude approximation/assumption of conditional independence of the features. This means that the classifier assumes that the value taken by a feature does not depend on the other features.

This assumption is obviously wrong in most applications. For instance, in spam filtering, this amounts to considering that the presence of the word "earn" does not depend on the presence of the word "money". But despite this very crude approximation, the naive Bayes classifier remains rather effective.

In pictures...

Demo application

This demo shows how the naive Bayes classifier can be used to filter spams in mailboxes.

In maths...

For a random variable $X$ taking values in $\X\subset\R^d$, the conditional independence assumption at the core of the naive Bayes classifier results in the following factorization of the conditional probability of $X$ given $Y$: $$ P(X=\g x\ |\ Y=y) = \prod_{j=1}^d P(X_j = x_j\ |\ Y = y) , $$ where $X_j$ is the $j$th component of $X$.

This assumption drastically eases the estimation of the generative model as shown in the example below.

Naive Bayes classifier with real inputs

For a continuous random variable $X$ of density $p_X(\g x)$, the factorization above reads $$ p_X(\g x) = \prod_{j=1}^d p_{X_j}(x_j\ |\ Y = y) $$ where $p_{X_j}(x_j\ |\ Y = y)$ is the density of the $j$th input variable (or feature) $X_j$.

Assume now that we would like to fit a Gaussian distribution over $X$ in each class $y\in\Y$. Fitting a multi-dimensional Gaussian distribution in $\R^d$ implies to estimate the values of the $d$ means and of the covariance matrix $\g \Sigma \in\R^{d\times d}$. But with the factorization above, this can be reduced to the estimation of $d$ one-domensional densities $$ p_{X_j}(x_j\ |\ Y = y) = \exp\left( \frac{-(x_j - \mu_{yj})^2}{2\sigma_{yj}^2}\right),\quad j=1,\dots,d, $$ with $d$ means $$ \mu_{yj} = \frac{1}{N_y} \sum_{y_i = y} x_{ij} $$ and $d$ variances $$ \sigma_{yj}^2 = \frac{1}{N_y-1} \sum_{y_i = y} (x_{ij} - \mu_{yj})^2 $$ for each class $y$ having $N_y$ examples in the training set $\{(\g x_i, y_i)\}_{i=1}^N$.