Deep Learning in Production: Sentiment Analysis with the Transformer Model

In 2017, the Google Brain team published Attention Is All You Need, a paper which studied the effectiveness of attention mechanisms in neural networks. At a time when recurrent and convolutional based architectures dominated sequence transduction tasks, we were introduced to the Transformer, a neural network architecture composed solely of attention based encoder-decoder networks. The authors presented state-of-the-art results in machine translation, and follow up work has shown that the attention based architecture generalizes to solve a variety of language understanding tasks, including sentiment analysis.

Model architecture diagram from Attention Is All You Need

Although there has already been substantial research done on the architecture itself, there is a disconnect in the literature when it comes to applying the model in a production software engineering setting. In this guide, we will attempt to close this gap by showing how we can integrate the authors’ implementation of the Transformer model into a scalable machine learning pipeline and apply it to the real world problem of sentiment analysis.

Creating an Estimator

The TensorFlow Estimator API enables a separation of concerns between the model code and the input pipeline. Estimators are portable across datasets, and can be used for training, evaluation, and prediction.

Although it can be a fun learning exercise, implementing the Transformer model from scratch can also be daunting. Luckily for us, the official implementation is published in Tensor2Tensor (a library of deep learning models).

Importing any Tensor2Tensor model is just a matter of calling the tensor2tensor.utils.trainer_lib.create_estimator function with the model name, which gives us the model implementation wrapped in a tf.estimator.Estimator. Tensor2Tensor encapsulates parameters specific to a problem domain in the Problem classes. Using the SentimentIMDB problem parameters will let Tensor2Tensor know what modality and shape the inputs to the model should be. Since we plan on only training with 1 GPU, we will tune model hyperparameters like num_hidden_layers in the encoder/decoder networks and hidden_size of each attention component to reduce the total number of parameters:

The standalone estimator by itself is not useful in a production environment. We could write ad-hoc scripts to prepare the data, train the model, and serve the model. However, it would be even better to build a production application that can automatically scale to handle more data, update when new data becomes available, and serve real-time predictions. This usually requires a lot of DevOps work, but we can do it with minimal effort using Cortex, an open source machine learning infrastructure platform.

We’ll only train a baseline model on a standard research dataset. But by the end of this guide, when our machine learning pipeline is in place, it will be trivial to try larger datasets or more complex models. When we decide to redeploy with new data, Cortex ensures that there is no downtime by routing traffic to the previous model until the new one is trained and deployed.

Setting up Data Ingestion

We’re going to train the Transformer model with the Large Movie Review Dataset. We’ve removed a few columns and saved the data into a CSV file on S3:

review                                                     | label |
"Once again Mr. Costner has dragged out a movie for far..."| "neg" |
"This is an example of why the majority of action films..."| "neg" |
"First of all I hate those moronic rappers, who could'n..."| "neg" |

Setting up the infrastructure to manage Spark clusters to ingest raw data can be difficult. Cortex automates this process, spawning the necessary Spark workers to ingest the data at deployment time. The YAML below tells Cortex how to ingest our movie reviews by describing where our data is, what columns to expect in the raw data, the type of each column for validation, and CSV metadata to consider when ingesting:

Defining Data Transformations

We can collect all of the words into a vocabulary and use the index of each word as its numerical representation. Our vocabulary aggregation on PySpark will use regex to match words in each review, then select the most frequently occurring words to use for our vocabulary. We’ll also reserve indices 0 and 1 to represent padding characters and unknown words:

We can configure Cortex to create a vocabulary of the 10,000 most common words in the reviews corpus by aggregating over the reviews column we ingested from the raw dataset:

When training deep models, it’s common practice to batch inputs to reduce variance between weight updates. Since the inputs to our model are tensors, we need to ensure that our batched reviews are rectangular in shape by padding each review to the same length. The PySpark code to calculate the length of the longest review in our dataset is similar to that of building the vocabulary, and you can find it here.

With max_length and vocab, we have what we need to transform our data into fixed-length lists of numbers. We’ll implement a transformation which converts each word into its vocabulary index and pads each review to max_length:

We can configure Cortex to transform each raw review and save the output into a new column, tokenized_reviews:

Configuring Model Training

We can use Cortex to automatically train the Transformer model that we defined earlier. We’ll specify an 80/20 training/evaluation split, 250,000 training steps with a batch size of 16, and a single GPU to run on. We’ll also use tokenized_reviews as a training feature and pass in the vocab and max_length values defined earlier:

Configuring Prediction Serving

We can configure Cortex to automatically serve the model as a JSON API once it’s trained. Cortex ensures that our service is highly available, and performs rolling updates when we’re ready to update our model. We can adjust the number of replicas based on the throughput demands of our API.

Ship it!

Now that the pipeline is configured from data ingestion to model serving, we are ready to run cortex deploy to execute the pipeline on our cluster. The deployment starts with Spark workers executing the ingestion, aggregation, and transformation we defined above. After the transformations finish, the results are exported to TFRecord files and training is started. Training should take around a day for this model on a single AWS p2.xlarge instance. We expect a sequence accuracy of around 90% given the hyperparameters we’ve chosen.

Once our application is deployed, we can query our JSON API:

$ curl -k \
-H "Content-Type: application/json" \
-d '{ "samples": [ { "review": "I love TensorFlow!" } ] }' \
{"classification_predictions": [{"predicted_class_reversed":"pos"}]}

And voilà, our sentiment analysis microservice is up and ready to roll out!

Running this Yourself

Cortex is open source ( and free to download. The full sentiment analysis example code can be found here.