First, we need to find a vector representation of e-mails. For this demo, we will restrain ourselves to a simple ``bag of words" model in which we only take into account the presence of words in e-mails, not their number of occurrences nor their ordering in the text.
Specifically, we encode a text as a binary vector in which each component is associated to a word of a dictionnary. In order to apply the naive Bayes classifier we need to choose a probability distribution for these vectors. For binary vectors, the most common choice is a binomial distribution. Given that the naive Bayes classifier assumes the components of the input to be independent, this actually reduces to a sequence of independent Bernoulli variables whose distributions can be simply estimated from the data as frequencies, i.e., by counting the number of e-mails of each category containing a specific word.
An algorithmic advantage of the method used here is that it can be trained online, meaning that the classifier is updated after each new labeled e-mail. This allows us to spare some resources: there is no need to store all the e-mails and the classifier can be improved without having to do all the computations again.
Label this e-mail as or
Number of labeled SPAMs:
Number of labeled HAMs:
After labeling a number of e-mails, you can write a new e-mail and
Corresponding feature vector and probabilities
Dictionnary | Feature vector $\g x$ | Probability of having this word in a SPAM | Probability of having this word in a HAM |
...
...
...
In addition, the implementation can be faster when using tests instead of powers to compute expressions like $$ ( b_j^{SPAM} )^{x_j} ( 1 - b_j^{SPAM} )^{(1-x_j)} = \begin{cases} b_j^{SPAM},& \mbox{if } x_j = 1\\ 1-b_j^{SPAM},& \mbox{otherwise.} \end{cases} $$