TensorFlow — Text Classification

Illia Polosukhin
5 min readNov 19, 2016

--

On Nov 9, it’s been an official 1 year since TensorFlow released. Looking back there has been a lot of progress done towards making TensorFlow the most used machine learning framework.

And as this milestone passed, I realized that still haven’t published long promised blog about text classification. Even though examples has been there in TensorFlow repository, they didn’t have very good description.

Text classification is one of the most important parts of machine learning, as most of people’s communication is done via text. We write blog articles, email, tweet, leave notes and comments. All this information is there but is really hard to use compared to a form or data collected from some sensor.

There been classic NLP techniques dealing with this, by mostly using words as symbols and running linear models. This techniques worked but were very brittle. Recent adoption of embeddings and deep learning opened up a new ways of handling text.

Join 30,000+ people who read the weekly 🤖Machine Learnings🤖 newsletter to understand how AI will impact the way they work and live.

Difference between words as symbols and words as embeddings is similar to described in Part 3 of tutorial — among other things, allowing to compress similar categories (words) into a smaller space, thus allowing next layers of neural network using this similarity to do job better.

Now, simplest model that everybody should start solving their problem with (or baseline in ML community) is a bag-of-words model. Something that takes words independent of their order and uses it to predict your goal.

For example, we will take a DBPedia dataset described in this paper. The dataset contains first paragraph of the wikipedia page for ~0.5M entities and the label is on of 15 categories (like People, Company, etc). This is usually called “Topic classification” and can be used in variety of cases, from analyzing comments on your website to sorting incoming emails.

Note, that exactly same techniques would work for sentiment analysis (categorizing if the text is positive or negative sentiment) and even for Question Answering.

Full example can be found in TensorFlow examples: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py (note, that code there will be updated with new APIs so it’s better to check out there).

First, we need to retrieve and prepare data:

# Prepare training and testing data  
dbpedia = learn.datasets.load_dataset('dbpedia', size='')
x_train = pandas.DataFrame(dbpedia.train.data)[1]
y_train = pandas.Series(dbpedia.train.target)
x_test = pandas.DataFrame(dbpedia.test.data)[1]
y_test = pandas.Series(dbpedia.test.target)
# Process vocabulary
vocab_processor = learn.preprocessing.VocabularyProcessor(
MAX_DOCUMENT_LENGTH)
x_train = np.array(list(vocab_processor.fit_transform(x_train)))
x_test = np.array(list(vocab_processor.transform(x_test)))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

TensorFlow has a handy learn.datasets module that contains few of example datasets, like DBPedia. By accessing it, it will download it and load it in memory. Note, load_dataset has a size argument, that by default for DBPedia loads a small subset. To load full dataset, pass an empty string.

Going from sentences (strings) to matrices (what TensorFlow or any ML can work with), requires to find all words in the text and remap them into IDs — a number per each unique word. This is exactly the same as for categorical variables in previous section of this tutorial, but now instead of one value per example, we get a list of values per each word in sentence. For example “my work is cool!” would map into [23, 500, 5, 1402, 17] (where 17 is “!”).

We also want to make sure that each sentence is the same length, so we provide MAX_DOCUMENT_LENGTH to identify how long each sentence will be (longer sentences will be truncated, and shorter ones padded).

Now resulting x_train and x_test contain just a matrices that we can pass to our learning algorithm.

def bag_of_words_model(features, target):  
"""A bag-of-words model. Note it disregards the word order in the text."""
target = tf.one_hot(target, 15, 1, 0)
features = tf.contrib.layers.bow_encoder(
features, vocab_size=n_words, embed_dim=EMBEDDING_SIZE)
logits = tf.contrib.layers.fully_connected(features, 15,
activation_fn=None)
loss = tf.contrib.losses.softmax_cross_entropy(logits, target)
train_op = tf.contrib.layers.optimize_loss(
loss, tf.contrib.framework.get_global_step(),
optimizer='Adam', learning_rate=0.01)
return (
{'class': tf.argmax(logits, 1),
'prob': tf.nn.softmax(logits)},
loss, train_op)

We create a simple TensorFlow model function, that takes features (list of word IDs) and target (one of 15 classes). We use simple bow_encoder which combines creation of embedding matrix, lookup for each ID in the input and then averaging them. Then we just add a fully connected layer on top and the use it to compute loss and classification results tf.argmax(logits, 1). Adding training regime (Adam with 0.01 learning rate) and that’s our function.

Now by simply invoking it with training data we prepared we can see how well bag of words work for this problem:

classifier = learn.Estimator(model_fn=bag_of_words_model) 
# Train and predict
classifier.fit(x_train, y_train, steps=10000)
y_predicted = [ p[‘class’] for p in
classifier.predict(x_test, as_iterable=True)]
score = metrics.accuracy_score(y_test, y_predicted)
print(‘Accuracy: {0:f}’.format(score))

Note, you can play with training steps and training regime (different learning rate and other parameters optimize_loss has).

But as we all know the bag of words is not really modeling how languages work — order of words matter (even though less then you would think in practice) and we want to handle that as well.

There are few ways how one can do — add bi-grams, use convolution to learn n-grams over text or Recurrent Neural Network to handle long term dependencies in text. For various problems any of this methods can work better. You can see examples of all of them implemented here (including characters): https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/learn#text-classification

In this post let’s review the Recurrent Neural Network implementation:

def rnn_model(features, target):  
"""RNN model to predict from sequence of words to a class."""
# Convert indexes of words into embeddings.
# This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and
# then maps word indexes of the sequence into [batch_size,
# sequence_length, EMBEDDING_SIZE].
word_vectors = tf.contrib.layers.embed_sequence(
features, vocab_size=n_words, embed_dim=EMBEDDING_SIZE, scope='words')
# Split into list of embedding per word, while removing doc length
# dim. word_list results to be a list of tensors [batch_size,
# EMBEDDING_SIZE].
word_list = tf.unstack(word_vectors, axis=1) # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE. cell = tf.nn.rnn_cell.GRUCell(EMBEDDING_SIZE) # Create an unrolled Recurrent Neural Networks to length of
# MAX_DOCUMENT_LENGTH and passes word_list as inputs for each
# unit.
_, encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32) # Given encoding of RNN, take encoding of last step (e.g hidden
# size of the neural network of last step) and pass it as features
# to fully connected layer to output probabilities per class.
target = tf.one_hot(target, 15, 1, 0)
logits = tf.contrib.layers.fully_connected(
encoding, 15, activation_fn=None)
loss = tf.contrib.losses.softmax_cross_entropy(logits, target)
# Create a training op. train_op = tf.contrib.layers.optimize_loss(
loss, tf.contrib.framework.get_global_step(),
optimizer='Adam', learning_rate=0.01, clip_gradients=1.0)
return (
{'class': tf.argmax(logits, 1),
'prob': tf.nn.softmax(logits)},
loss, train_op)

Hopefully comments inlined with code give a good description what is done on each step. As you can see the code is not very different from bag of words model, replacing just “encoding” part with rnn function call.

The same Estimator call with different model function will allow to run this model on the data and see improvements from understanding the sequence in which words appear.

You now know how to apply some of the basic architectures for text / document classification. Other things to consider is to to load pre-trained embeddings (like GloVe) and doing semi-supervised training, which allows model spend more time training for your problem instead of learning about language from scratch. I’ll try to talk about this in some of the next posts.

Additionally, I’ll talk more about how to make this models to converge / perform better with some of the tricks implemented in optimize_loss and tf.layers.

Since writing this post, I founded NEAR Protocol. Read more about our journey.

--

--

Illia Polosukhin

Co-Founder @ NEAR Protocol. Simple. Secure. Scalable. I'm tweeting as @ilblackdragon.