A Neural Implementation of NBSVM in Keras

6 min readJan 30, 2019

NBSVM is an approach to text classification proposed by Wang and Manning¹ that takes a linear model such as SVM (or logistic regression) and infuses it with Bayesian probabilities by replacing word count features with Naive Bayes log-count ratios. Despite its simplicity, NBSVM models have been shown to be both fast and powerful across a wide range of different text classification datasets. In this article, we cover the following:

An NBSVM model is implemented as a neural network using the deep learning framework, Keras.
Using the well-studied IMDb movie review dataset, we demonstrate that this Keras implementation achieves a test accuracy of 92.5% with only a few seconds of training. This is competitive with deeper and more sophisticated neural network architectures that take much longer to train. It is 2.1 percentage points away from the current state-of-the-art.
Source code and results are available in the form of a Jupyter notebook on GitHub here.

Let’s begin by importing some necessary modules.

import numpy as np
from keras import backend as K
from keras.models import Model
from keras.layers.core import Activation
from keras.layers import Input, Embedding, Flatten, dot
from keras.optimizers import Adam
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import load_files

Loading the IMDb Dataset

The IMDb training set consists of 25,000 movie reviews labeled as either positive or negative. The test set consists of another 25,000 labeled movie reviews. We will use the first set of 25,000 reviews to train a model to classify movie reviews as positive or negative and evaluate the model on the second set of 25,000 review. The dataset is first loaded as a document-term matrix (DTM) where each row represents a review and each column represents a word spanning the entire vocabulary of the corpus. Each “word” here is a string of either one, two, or three consecutive words in a review. That is, features consist of unigrams, bigrams, and trigrams. Entries in the matrix are binarized word counts (i.e., 1 means the word appears at least once in the review and 0 means otherwise). The IMDb dataset is available for download here. The PATH_TO_IMDB variable should be set to the full path of the extracted aclImdb folder. We compute and load this document-term matrix for both the training and test set.

PATH_TO_IMDB = r'./data/aclImdb'def load_imdb_data(datadir):
    # read in training and test corpora
    categories= ['pos', 'neg']
    train_b = load_files(datadir+'/train', shuffle=True, 
                         categories=categories)
    test_b = load_files(datadir+'/test', shuffle=True,
                         categories=categories)
    train_b.data = [x.decode('utf-8') for x in train_b.data]
    test_b.data =  [x.decode('utf-8') for x in test_b.data]
    veczr =  CountVectorizer(ngram_range=(1,3), binary=True, 
                             token_pattern=r'\w+',
                             max_features=800000)
    dtm_train = veczr.fit_transform(train_b.data)
    dtm_test = veczr.transform(test_b.data)
    y_train = train_b.target
    y_test = test_b.target
    print("DTM shape (training): (%s, %s)" % (dtm_train.shape))
    print("DTM shape (test): (%s, %s)" % (dtm_train.shape))
    num_words = len([v for k,v in veczr.vocabulary_.items()]) + 1
    print('vocab size:%s' % (num_words))
  
    return (dtm_train, dtm_test), (y_train, y_test), num_words(dtm_train, dtm_test), (y_train, y_test), num_words = load_imdb_data(PATH_TO_IMDB)

Converting a Document-Term Matrix to Word ID Sequences

In a binarized document-term matrix, each document is represented as a long one-hot-encoded vector with most entries being zero. While our neural model could be implemented to accept rows from this matrix as input, we choose to represent each document as a sequence of word IDs with some fixed length, maxlen, for use with an embedding layer. An embedding layer in a neural network acts as a lookup-mechanism that accepts a word ID as input and returns a vector (or scalar) representation of that word. These representations can either be learned or preset.

In our case, the embedding layer will return preset Naive Bayes log-count ratios for the words represented by word IDs in a document. A model accepting documents represented as sequences of word IDs trains much faster than one accepting rows from a term-document matrix. While these two architectures technically have the same number of parameters, the look-up mechanism of an embedding layer reduces the number of features (i.e., words) and parameters under consideration at any iteration. That is, documents represented as a fixed-size sequence of word IDs are much more compact and efficient than large one-hot encoded vector from a term-document matrix with binarized counts.

Here, we convert the document-term matrix to a list of word ID sequences.

def dtm2wid(dtm, maxlen):
    x = []
    nwds = []
    for idx, row in enumerate(dtm):
        seq = []
        indices = (row.indices + 1).astype(np.int64)
        np.append(nwds, len(indices))
        data = (row.data).astype(np.int64)
        count_dict = dict(zip(indices, data))
        for k,v in count_dict.items():
            seq.extend([k]*v)
        num_words = len(seq)
        nwds.append(num_words)
        # pad up to maxlen with 0
        if num_words < maxlen: 
            seq = np.pad(seq, (maxlen - num_words, 0),    
                         mode='constant')
        # truncate down to maxlen
        else:                  
            seq = seq[-maxlen:]
        x.append(seq)
    nwds = np.array(nwds)
    print('sequence stats: avg:%s, max:%s, min:%s' % (nwds.mean(),
                                                      nwds.max(), 
                                                      nwds.min()) )
    return np.array(x)maxlen = 2000
x_train = dtm2wid(dtm_train, maxlen)
x_test = dtm2wid(dtm_test, maxlen)

Computing the Naive Bayes Log-Count Ratios

The final data preparation step involves computing the Naive Bayes log-count ratios. This is more easily done using the original document-term matrix. These ratios capture the probability of a word appearing in a document in one class (i.e., positive) versus another (i.e., negative).

def pr(dtm, y, y_i):
    p = dtm[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)
nbratios = np.log(pr(dtm_train, y_train, 1)/pr(dtm_train, 
                                               y_train, 0))
nbratios = np.squeeze(np.asarray(nbratios))

NBSVM in Keras

We are now ready to define our NBSVM model. Our model utilizes two embedding layers. The first, as mentioned above, stores the Naive Bayes log-count ratios. The second stores learned weights (or coefficients) for each feature (i.e., word) in this linear model. Our prediction, then, is simply the dot product of these two vectors.

def get_model(num_words, maxlen, nbratios=None):
    # setup the embedding matrix for NB log-count ratios
    embedding_matrix = np.zeros((num_words, 1))
    for i in range(1, num_words): # skip 0, the padding value
        if nbratios is not None:
            # if log-count ratios are supplied, then it's NBSVM
            embedding_matrix[i] = nbratios[i-1]
        else:
            # if log-count ratios are not supplied, 
            # this reduces to a logistic regression
            embedding_matrix[i] = 1    # setup the model
    inp = Input(shape=(maxlen,))
    r = Embedding(num_words, 1, input_length=maxlen, 
                  weights=[embedding_matrix], 
                  trainable=False)(inp)
    x = Embedding(num_words, 1, input_length=maxlen, 
                  embeddings_initializer='glorot_normal')(inp)
    x = dot([r,x], axes=1)
    x = Flatten()(x)
    x = Activation('sigmoid')(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy',
                  optimizer=Adam(lr=0.001),
                  metrics=['accuracy'])
    return model

This simple model achieves a 92.5% accuracy on the IMDb test set with only a few seconds of training on a Titan V GPU. In fact, the model trains within seconds even on a CPU. Interestingly, this accuracy is higher than the result reported in the original paper¹ (which was only 91.22% using bigram features).

model = get_model(num_words, maxlen, nbratios=nbratios)
model.fit(x_train, y_train,
          batch_size=32,
          epochs=3,
          validation_data=(x_test, y_test))

These results are competitive with more sophisticated (and deeper) neural network architectures. Moreover, this model outperforms a number of well-known approaches including Facebook’s fastText architecture. Note that, when setting nbratios to None, our function get_model sets the embedding matrix, r, to all ones, which reduces the model to a logistic regression. Such a logistic regression model yields a lower (but surprisingly respectable) accuracy of 91.6% (versus 92.5% for NBSVM). Try it out yourself using our Jupyter notebook on GitHub available here.

This article was inspired by a tweet² from Jeremy Howard in September 2017.

References

¹ Sida Wang and Christopher D. Manning: Baselines and Bigrams: Simple, Good Sentiment and Topic Classification; ACL 2012.

² https://twitter.com/jeremyphoward/status/905841365241565184?lang=en