Naive Bayes with the categorical event model

This post supplements the classification slides. Please see the slides for the setup.

In this post, we develop the naive Bayes classifier with categorical event model (also known as multinomial event model) for natural language processing. As a running example, consider a spam classification task: we have a training dataset consisting of \(n\) emails labeled as spam or not spam, and we wish to train a classifier to predict whether an email is spam or not.

First, we describe a way of representing emails numerically. Let \(\cV = \{\text{a},\text{aardvark},\dots\}\) be an enumerated set of possible words appearing in the emails (so each word has a unique index) and \(V\) be the size of \(\cV\). We call this enumerated set of words the vocabulary. An email is merely a sequence of words, so we can represent an email with \(d\) word with a vector \(X\) of length \(d\) whose \(j\)-th entry is the index of the \(j\)-th word in the email (e.g. if the first word in an email is the second word in the vocabulary “aardvark”, then \(X_1 = 2\)). We encode the emails in the training dataset numerically to obtain \((X_1,Y_1),\dots,(X_n,Y_n)\) in which \(x_i\) is the numeric encoding of the \(i\)-th email, and \(y_i\in\{0,1\}\) is a label that indicates whether the email is spam or not.

Second, we prescribe a naive Bayes model for the \((X,Y)\) pairs. An email is either spam or not, so \(Y\) is a Bernoulli random variable: \(\def\Ber{\mathrm{Ber}}Y\sim\Ber(\pi_1)\), where \(\pi_1\) is the probabilities that an email is spam. In naive Bayes models, we assume the features are conditionally independent given the label. Thus the entries of \(X\) given \(Y\) are IID random variables. Since each entry of \(X\) takes values in the vocabulary, they are categorical (also known as multinomial) random variables:

\[\Pr\{X=x\mid Y=y\} = \prod_{j=1}^d\Pr\{X_j=x_j\mid Y=y\} = \prod_{j=1}^df_y(x_j),\]

where \(f_y\) is the pmf’s of \(X_j\mid Y=y\). Intuitively, given whether an email is spam or not (and the length of the email \(d_i\)), we toss one of two \(V\)-sided dice (which dice we toss depends on whether the email is spam or not) \(d_i\) times to generate the words in the email. Mathematically, the generative model for the \((X,Y)\) pairs is

\[\begin{aligned} \Pr\{X=x, Y=y\} &= \Pr\{X=x\mid Y=y\}\Pr\{Y=y\} \\ &= \Big(\prod_{j=1}^{d_i}\Pr\{X_j=x_j\mid Y=y\}\Big)\Pr\{Y=y\} \\ &= \Big(\prod_{j=1}^{d_i}f_y(x_j)\Big)\pi_y. \end{aligned}\]

Taking a step back, the naive Bayes assumption seems ridiculous: words in a sentence are not independent. If you know a word in a sentence is a verb, then the subsequent word is generally not another verb. Surprisingly, this seemingly ridiculous model is good enough for many text classification tasks (including spam classification).

The parameters of the model are \(\pi_1,f_1,f_0\). We estimate the parameters from the training dataset by maximum likelihood:

\[\def\hf{\widehat{f}} \def\hpi{\widehat{\pi}} \begin{aligned} (\hpi_1,\hf_1,\hf_0) &\in \argmax_{\pi_1,f_1,f_0}\prod_{i=1}^n\Pr\{X=X_i, Y=Y_i\} \\ &= \argmax_{\pi_1,f_1,f_0}\prod_{i=1}^n\Big(\prod_{j=1}^{d_i}f_{Y_i}(X_{i,j})\Big)\pi_{Y_i}, \end{aligned}\]

where \(x_i\) and \(y_i\) are the observed values of \(X\) and \(Y\) in the training dataset and \(\pi_0\) is shorthand for \(1 - \pi_1\). We leave as an exercise to show that the maximum likelihood estimates (MLE) are

\[\begin{aligned} \hpi_1 &\gets \frac{\sum_{i=1}^n\ones\{Y_i=1\}}{n}, \\ \hf_1(k) &\gets \frac{\sum_{i=1}^n\sum_{j=1}^{d_i}\ones\{X_{i,j}=k,Y_i=1\}}{\sum_{i=1}^nd_i\ones\{Y_i=1\}}, \\ \hf_0(k) &\gets \frac{\sum_{i=1}^n\sum_{j=1}^{d_i}\ones\{X_{i,j}=k,Y_i=0\}}{\sum_{i=1}^nd_i\ones\{Y_i=0\}}. \end{aligned}\]

Intuitively, \(\hpi_y\) is simply the fraction of spam emails in the training data and \(\hf_y\) is a histogram of all the words in emails with label \(y\) (normalized so that the total probability mass is 1). Both estimates are intuitive. To generate an email, we first decide whether it is spam or not by flipping a (possibly unfair) coin with probability of heads \(\pi_1\), so an estimate of \(\pi_1\) is the fraction of times the coin comes up heads, which is the fraction of spam emails.

To make a prediction on a new email with features \(X_{n+1}\), we estimate the regression function \(\Pr\{Y=1\mid X=X_{n+1}\}\) by expressing it in terms of the parameters of the naive Bayes model \(\pi_1,f_1,f_0\) and plugging in their MLE’s:

\[\def\hPr{\widehat{\Pr}} \begin{aligned} &\hPr\{Y=1\mid X=X_{n+1}\} \\ &\quad= \frac{\hPr\{X=X_{n+1}\mid Y=1\}\hPr\{Y=1\}}{\hPr\{X=X_{n+1}\mid Y=1\}\hPr\{Y=1\} + \hPr\{X=X_{n+1}\mid Y=0\}\hPr\{Y=0\}} \\ &\quad= \frac{(\prod_{j=1}^{d_{n+1}}f_1(X_{n+1,j}))\hpi_1}{(\prod_{j=1}^{d_{n+1}}f_1(X_{n+1,j}))\hpi_1 + (\prod_{j=1}^{d_{n+1}}f_0(X_{n+1,j}))(1-\hpi_1)} \end{aligned}\]

This works, but it suffers from an issue that (unfortunately) arises often in text classification. To see the issue, suppose there is a word (say the \(k\)-th word in the vocabulary) that does not appear in any emails in the training dataset. This causes the MLE of \(f_1(k)\) and \(f_0(k)\) to be 0. When the classifier encounters a new email that includes this word, it tries to evaluates \(\hPr\{Y=1\mid X=X_{n+1}\}\), but it obtains \(\frac00\) because the numerator and denominator include products of the form \(\prod_{j=1}^{d_{n+1}}f_0(X_{n+1,j})\) and these product include \(f_1(k) = 0\) and \(f_0(k) = 0\) as factors.

The issue is caused by the MLE estimating the probability of seeing the \(k\)-th word as zero because the \(k\)-th word does not appear in the training data. Generally speaking, it is a bad idea to estimate the probability of an event as zero because the event does not occur in the training data. To avoid this issue, we replace the MLE’s of \(f_1\) and \(f_0\) with smoothed versions:

\[\def\tf{\widetilde{f}} \begin{aligned} \tf_1(k) &\gets \frac{1+\sum_{i=1}^n\sum_{j=1}^{d_i}\ones\{X_{i,j}=k,Y_i=1\}}{V+\sum_{i=1}^nd_i\ones\{Y_i=1\}}, \\ \tf_0(k) &\gets \frac{1+\sum_{i=1}^n\sum_{j=1}^{d_i}\ones\{X_{i,j}=k,Y_i=0\}}{V+\sum_{i=1}^nd_i\ones\{Y_i=0\}}. \end{aligned}\]

This is called Laplace smoothing: we added 1 to the numerator and \(V\) to the denominator to keep the smoothed histogram normalized. This avoid the issue because \(\tf_y(k)\) is never zero, even if the \(k\)-th word never appears in the training dataset.

Posted on September 21, 2022 from Ann Arbor, MI