First, we need to find a vector representation of e-mails. For this demo, we will restrain ourselves to a simple ``bag of words" model in which we only take into account the presence of words in e-mails, not their number of occurrences nor their ordering in the text.
Specifically, we encode a text as a binary vector in which each component is associated to a word of a dictionnary. In order to apply the naive Bayes classifier we need to choose a probability distribution for these vectors. For binary vectors, the most common choice is a binomial distribution. Given that the naive Bayes classifier assumes the components of the input to be independent, this actually reduces to a sequence of independent Bernoulli variables whose distributions can be simply estimated from the data as frequencies, i.e., by counting the number of e-mails of each category containing a specific word.
An algorithmic advantage of the method used here is that it can be trained online, meaning that the classifier is updated after each new labeled e-mail. This allows us to spare some resources: there is no need to store all the e-mails and the classifier can be improved without having to do all the computations again.
In pictures...
Label this e-mail as or
Number of labeled SPAMs:
Number of labeled HAMs:
After labeling a number of e-mails, you can write a new e-mail and
Processed e-mail with words from the dictionnary and others left aside:
Corresponding feature vector and probabilities
Dictionnary
Feature vector $\g x$
Probability of having this word in a SPAM
Probability of having this word in a HAM
SPAM
HAM
Prior probabilities
Probability of having an e-mail like this one
Posterior probabilities
In maths...
Encoding e-mails as vectors
We use the ``bag of words" encoding of e-mails in which a piece of text $T$ is represented by a dictionnary $D=(D_j)_{1\leq j\leq d}$ of words $D_j$ and a binary vector $\g x\in\{0,1\}^d$ such that
$$
x_j = \begin{cases}
1, & \mbox{if } T \mbox{ contains the word } D_j\\
0, & \mbox{otherwise.}
\end{cases}
$$
This rather simple model allows us to fix the dimension of the input for the classifier.
Probabilistic models of e-mails
We assume the following binomial model of SPAMs:
$$
P(X=\g x\ |\ Y=SPAM) = \prod_{j=1}^d ( b_j^{SPAM} )^{x_j} ( 1 - b_j^{SPAM} )^{(1-x_j)} ,
$$
where $d$ is the number of words in the dictionnary and $b_j^{SPAM}$ is a parameter estimating the probability of the $j$th word being in the e-mail given that it is a SPAM, and a similar model of HAMs:
$$
P(X=\g x\ |\ Y=HAM) = \prod_{j=1}^d ( b_j^{HAM} )^{x_j} ( 1 - b_j^{HAM} )^{(1-x_j)} .
$$
Implementation details
There are a few implementation details to pay attention to in order for such applications to work in practice.
Smoothing
...
Numbers below machine precision
...
Online training
...
Speed-ups
In addition, the implementation can be faster when using tests instead of powers to compute expressions like
$$
( b_j^{SPAM} )^{x_j} ( 1 - b_j^{SPAM} )^{(1-x_j)} =
\begin{cases}
b_j^{SPAM},& \mbox{if } x_j = 1\\
1-b_j^{SPAM},& \mbox{otherwise.}
\end{cases}
$$
Notes
The algorithm does not know we're trying to detect spams. We could as well use this demo to discriminate between e-mails of your mother and those of a friend... let's try it!
You can check the implementation details discussed above by looking at the source code of this page.
The spam-filter can be improved by using an extended/dedicated dictionnary allowing the detection of embedded links or zip attachment and so on.