Productionizing NLP Models

Pratik Bhavsar
Modern NLP
Published in
12 min readSep 10, 2019


Photo by RYU Wongs on Unsplash

→ ML in production series

Come join Maxpool — A Data Science community to discuss real ML problems!

Problem statement 💰

Lately, I have been consolidating my experiences of working in different ML projects. I will tell this story from the lens of my recent NLP project to classify phrases into categories — A multiclass single label problem.

Central embedder architecture for NLP

Team structure 👪

Making AI teams is quite tricky. If you don’t have the skillsets inside your company, you have to plan hiring. Since every project has a start and end time, it’s difficult to have the entire team from the start. Luckily we had most of the people for the project and our squad consisted of following members.

  • Product owner(1) — Sets up the requirement of the project
  • Project manager(1) — Takes care of the project planning and tech issues
  • Scrum master(1) — Ensures Agile execution and resolves impediments
  • Data analysts(2) — Transfers domain knowledge and assists in gathering data from various data stores for the data science
  • Data scientists(2) — Make data pipeline, ML POC, software engineering and deployment planning.
  • Devops/Python developer(1) — Design and make deployment pipelines, python software engineering, server sizing, serverless pipeline and retraining models

Data 📊

This was an NLP project and the data was present in an RDBMS database. To be frank we were lucky and didn’t have to do much to get the training data. Just a few joins here and there. The query ownership belongs to the data team we work with while the data pipeline is created by the data scientists.

If you do not have the training data, you might have to follow one of the below routes.

Creating training data can take time and its good to have an annotation tool. If you do not have one, you can try out by Spacy team for text data.

There is also another tool Doccano which is open source.

History optimisation

As we were training the models we realised we might not need all the data we have which happened to be 5 years in our case. We tried modelling with different amount of history and found 3 years enough.

Having the least possible history without sacrificing the metric allowed us to train models faster and learn recent patterns better.


After much iterations with classical and deep learning, we decided to go with (feature extractor + head) word embedding based approach for our classification task.

We also had to deal with imbalance in the data and tried many techniques.

Metric 🙈

Since we were dealing with a multiclass(~190) imbalanced dataset, we selected weighted f1 as the metric as it’s robust to minorities and easy to understand.

Software engineering 👀

Project structure


│ ├───classifier-a
│ ├───classifier-b
│ └───classifier-c
│ ├───data(common sample training data)
│ ├───preparation
│ ├───modelling
│ ├───evaluation
│ └───final

├───tc(acronym of text_classifier. contains core modules)
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ │
│ ├───config
│ │
│ │
│ │
│ │
│ ├───nlp
│ │
│ │
│ │
│ └───utilities

│ ├───classifier-a
│ ├───classifier-b
│ └───classifier-c


│ base.yaml
│ cpu.yaml
│ gpu.yaml
│ build.yaml

│ └───terraform




It’s just super important to fix up the project structure in the beginning for the code to evolve in a structured way. We took considerable time and did many discussions before we converged. Have a look at this to start with a basic scaffold.

This is how we train models on AWS EC2/local and backup code, data, models and reports on AWS s3. The directory structure is created automatically by the preparation and train class.

└───region(we have models trained on many regions)
├───model-a(model for predicting a)
├───model-b(model for predicting b)
└───model-c(model for predicting c)
└───2019-08-01(model version as per date)
├─── backup)

While doing the POCs, we have no idea which modules will be part of the final solution and pay less importance to modularity or reusability. But as soon as we are done with the POC, we should consolidate the final code into notebooks and keep them in /notebooks/final. We had one notebook for preparation steps and another for modeling.


These notebooks also became our presentation material.

Inheritance/imports ⏬

We wrote the training classes in a way to be used again by the predict classes. So every time we make any changes on preprocessing or encoding steps, we just do them on the training class.

Class imports

Inference class

Our inference modules use the predict class along with certain checks on the data for the failure cases such as empty strings. We also save the inferences to a central PostgreSQL inference database.

Our router is a simple flask router with methods for different models. All the important exceptions are caught and returned with appropriate messages.

Inference database

We save all the inferences to analyse the models in production like input values, predicted values, model version, model type, probability etc.

One of our next steps is to create APIs for creating reports on ML performance.

Design patterns 🐗

Singleton pattern to initialize embeddings and use the same object for different models. This saves memory usage of ec2.

Factory pattern to initialize model training classes with different configs.

Decorator pattern

  1. A decorator to time functions to understand which ones take more time.
  2. A decorator to retry DB queries if they fail. This ensures the fetching of the data and doesn’t fail the pipeline of training.
  3. A decorator for Splunk logging of start-end of function execution. We save logs on Splunk as well as AWS Cloudwatch.

Scalability 🌀

From the beginning, we wanted to develop the codebase for using it for different data. So we parameterised everything through configs for input data and model hyper-parameter.

Refactoring 🐵

After we were done making the project we had many common utilities which can be used for any projects.


Numbers are in % of total project time. This can vary for projects.

Innersourcing allows an ecosystem of contributors to develop and use reusable components for everyone. We observed that good software engineering takes way more time than doing a POC. By creating libraries, the developers and data scientists can now focus on developing and deploying models faster.

Removing common utilities also made the project code-base lighter and easier to understand.

Deployment 🐙

AWS Infrastructure

  • S3
  • EC2
  • ECR
  • ECS
  • Cloudwatch

We use Conda, Docker, Terraform, Jenkins, ECR, ALB and ECS for our deployment pipeline.

Environments 🛠

After much experimentation and debate, we chose to take care of all python dependencies of pip/conda for cpu/gpu on windows/linux through 4 yml configs.

  • base.yml → All non-DL packages installed via pip and conda
  • cpu.yml → Tensorflow cpu install through pip (since pip will not install cuda toolkit and cuDNN. This keeps our env light)
  • gpu.yml → Tensorflow gpu install through conda (since conda takes care of cuda toolkit and cudnn)
  • build.yml → Extra packages required for serving model installed via pip and conda. We use gunicorn for serving the model. gunicorn is not available for windows and we install it in our Linux docker env for production)

Local env for testing code

conda env create -f env/base.ymlconda env update -f env/cpu.yml

Docker/EC2 env for training models

conda env create -f env/base.ymlconda env update -f env/gpu.yml

Docker/EC2 env for serving models on cpu instances (through Docker containers)

conda env create -f env/base.ymlconda env update -f env/cpu.ymlconda env update -f env/build.yml

Every time we start using a new package, we add it manually to the yml. We tried the pipreqs and conda export — no-builds for exporting packages automatically but found a lot of dependencies and package-build-info also getting exported and made our env look dirty. By adding packages manually, we are sure of the usage of packages and also removed some unused packages after the POC.

Initially, we were using AllenNLP for generating embeddings and installing it added many packages to our env. Since we are using Keras for modelling, we decided to switch completely to tensorflow ecosystem and get model from tensorflow-hub instead.

Load tests 💥

Initially, load testing was pretty straightforward. We optimised serving for these parameters by testing our load case using JMeter.

  • No. of tasks in ECS
  • No. of gunicorn workers in the task
  • No. of threads per worker

We gave a good thought on autoscaling which can be triggered by mean/max RAM usage, mean/max CPU usage and number of API calls. Nothing worked for us as we didn’t want to waste the resources of EC2 by keeping space for autoscaling. Not keeping a space led to EC2 creation which takes time. Knowing it takes 1–5 minutes to create the instance, all the requests would go to the existing service and nothing would be sent to the new task deployed on new EC2.

We also considered AWS Fargate but it’s 2x costly compared to EC2.

The only thing which made sense at the end was to allocate full CPU to the task and half the RAM. RAM is needed for autoscaling and so we kept space for 1 more task to be deployed but made sure not to waste CPU as it was the bottleneck.

We selected AWS t3 instances instead of t2 for their default burstable behaviour which helps us use the accumulated credits.

Cost optimisations 🔥


As you might be knowing, unlike word2vec and glove which are fixed vocab non-contextual embeddings, language models like ELMo and BERT are contextual and do not have any fixed vocabulary. The downside of this is that the word embedding needs to be calculated every time through the model. This became quite a trouble for us as we saw heavy CPU spikes due to model processing.

Since our text phrases had an average length of 5 and were repetitive in nature, we cached embeddings of the phrase to avoid re-computations. By just adding this small method to our code we got a 20x speedup 🏄

Cache size optimisation

Since LRU(Least recently used) cache has BigO of log(n) the smaller the better. But we also know that we want to cache as much as possible. So bigger the better. This meant we had to optimise cache maxsize empirically. We found 50000 as the sweet point for us.

Revised load testing method

By using cache we couldn’t use just a few test samples as the cache would make them compute free. Hence we had to define variable test cases so as to simulate the real text samples. We did this with the help of a python script to create request samples and tested with JMeter.

Central embedder architecture 💢

At last when we were scaling from 3 to 21 models, we had to think about how to make this robust yet cost-effective. The language model was turning out to be a heavy component while the text cleaner and feed-forward head model were light on compute.

Since the language model was common for all models, we decided to make a separate service to be used by all models. This led to a heavy cost reduction for us 🙌

Thanks to Han Xiao’s BERT as a service for the inspiration.

Central embedder architecture

Currently, we are also thinking about serving models with AWS lambda and get rid of the infrastructure.

Learnings 😅

After the completion of projects, you really wish you had done some things better and hadn’t done some at all. A few of the suggestions that come on top of my mind are:

  • Retraining of models — Avoid making tagged data for a project otherwise you have to create it every time you retrain models. You can include creating tagged data in your workflows for training models. You can understand how raw/collected data is saved and write scripts to create training data for retraining models with new data. You can also leverage semi-supervised learning if you do not have the above options.
  • Model compression — If the latency of your neural network model is more than your requirement, you can use pruning and quantisation to make them faster
  • Check for biases using this framework by MIT

Historical bias — Because data distribution can change with time

Representation bias — When certain parts of data are under-represented

Measurement bias — When labels are used as proxies for real labels

Aggregation bias — When same model is used for different datasets

Evaluation bias — When test data doesn’t match the real world data

  • Interpretability — Use libraries like eli5 to understand model predictions and biases
How eli5 can explain scikit pipeline predictions

Till next time… 🏃

All in all, I wanted to throw light on the other elements of data science which also play a critical role in the ML pipeline. If you want to know more about how others make there ML pipeline, you can check out my other story.

Let me know if you have any solutions, ideas or feedback :)


I have now created a checklist for keeping me on the right track. Sometimes it is just easy to get lost in the hustle.

Modelling checklist 📘

  1. What is our model metric and business metric? Are they same or different?
  2. Will more data improve the metrics? Can we get more data?
  3. Have we used fp16 and multi-GPU for reducing training time? Have we optimised batch size and tried one_cycle_fit for reducing training time? Are we using Adam, Radam, ranger or a new optimiser?
  4. If the problem was solved with deep learning, have we tried enough classical approaches? What’s the difference in metric and inference time between classical and DL?
  5. Is the model hyper-tuned manually or algorithmically? Will we need hypertuning layer for retraining? Is the data changing rapidly with time and will the current model parameters be enough down the line?
  6. What is the difference between train, validation and test metrics?
  7. Have the data scientist and domain experts done error analysis?
  8. Can we try interpretability? Have we tried interpretability on errors?
  9. Is there a pattern in the mistakes of the model? Can it be solved with a postprocessing layer or a new feature?
  10. Is manual intervention needed after prediction and client usage? How can it be reduced?

Deployment checklist 📗

  1. Have we backed code, data and metrics with the model and encoders?
  2. Have we checked opportunities for caching?
  3. Have we defined a realistic load test as per traffic? If the peak load seems rare, can we design for mean traffic and let the failed request to be retried?
  4. Are we serving with flask, WSGI, uWSGI or gunicorn?
  5. Have we done cache, worker and thread sizing? (Do not blindly go with gunicorn’s 2n+1 advice on workers. Test everything empirically)
  6. Have we tackled all edge cases for incoming text or numerical fields in request?
  7. Are the fields in inference DB kept in accordance with the incoming data? Can inference DB error cause response error?
  8. Are we sending appropriate messages/flags in response to debug errors coming from the client?
  9. Should we deploy on CPU or GPU? Which components require GPU?
  10. What are the commonalities or bottlenecks in the prediction pipeline? Can we isolate them out?
  11. Can we keep all models in a single docker image? (We have 22 working on 8 different ECS. We are wondering if we should bring them all to one ECS. It’s a tradeoff between flexibility, simplicity and cost reduction)
  12. Can we go serverless for deployment? (We have a variable load through day, week and year with moderate traffic. We are wondering how to exploit serverless.)
  13. Do we have a model rollback plan?
  14. Is data going to change or increase with time? Is retraining of models needed? Have we planned it in the total project planning?
  15. Is the code and deployment pipeline flexible enough to be used for retraining with minimal changes?
  16. How will you analyse the model performance in production? How frequently will the reports be generated?

Subscribe to Modern NLP for latest tricks in NLP!!! 😃



Pratik Bhavsar
Modern NLP

NLP & Semantic search engineer | Now writing on | | @nlpguy_

Recommended from Medium


See more recommendations