Spam-filtering with the naive Bayes classifier

In words...

First, we need to find a vector representation of e-mails. For this demo, we will restrain ourselves to a simple ``bag of words" model in which we only take into account the presence of words in e-mails, not their number of occurrences nor their ordering in the text.

Specifically, we encode a text as a binary vector in which each component is associated to a word of a dictionnary. In order to apply the naive Bayes classifier we need to choose a probability distribution for these vectors. For binary vectors, the most common choice is a binomial distribution. Given that the naive Bayes classifier assumes the components of the input to be independent, this actually reduces to a sequence of independent Bernoulli variables whose distributions can be simply estimated from the data as frequencies, i.e., by counting the number of e-mails of each category containing a specific word.

An algorithmic advantage of the method used here is that it can be trained online, meaning that the classifier is updated after each new labeled e-mail. This allows us to spare some resources: there is no need to store all the e-mails and the classifier can be improved without having to do all the computations again.

In pictures...

Label this e-mail as or

Number of labeled SPAMs:
Number of labeled HAMs:

After labeling a number of e-mails, you can write a new e-mail and


Processed e-mail with words from the dictionnary and others left aside:

Corresponding feature vector and probabilities

DictionnaryFeature vector $\g x$Probability of having this word in a SPAMProbability of having this word in a HAM


In maths...

Encoding e-mails as vectors

We use the ``bag of words" encoding of e-mails in which a piece of text $T$ is represented by a dictionnary $D=(D_j)_{1\leq j\leq d}$ of words $D_j$ and a binary vector $\g x\in\{0,1\}^d$ such that $$ x_j = \begin{cases} 1, & \mbox{if } T \mbox{ contains the word } D_j\\ 0, & \mbox{otherwise.} \end{cases} $$ This rather simple model allows us to fix the dimension of the input for the classifier.

Probabilistic models of e-mails

We assume the following binomial model of SPAMs: $$ P(X=\g x\ |\ Y=SPAM) = \prod_{j=1}^d ( b_j^{SPAM} )^{x_j} ( 1 - b_j^{SPAM} )^{(1-x_j)} , $$ where $d$ is the number of words in the dictionnary and $b_j^{SPAM}$ is a parameter estimating the probability of the $j$th word being in the e-mail given that it is a SPAM, and a similar model of HAMs: $$ P(X=\g x\ |\ Y=HAM) = \prod_{j=1}^d ( b_j^{HAM} )^{x_j} ( 1 - b_j^{HAM} )^{(1-x_j)} . $$

Implementation details

There are a few implementation details to pay attention to in order for such applications to work in practice.

Smoothing

...

Numbers below machine precision

...

Online training

...

Speed-ups

In addition, the implementation can be faster when using tests instead of powers to compute expressions like $$ ( b_j^{SPAM} )^{x_j} ( 1 - b_j^{SPAM} )^{(1-x_j)} = \begin{cases} b_j^{SPAM},& \mbox{if } x_j = 1\\ 1-b_j^{SPAM},& \mbox{otherwise.} \end{cases} $$

Notes