Checklist for NLPOps — Bringing an NLP Model to Production

Larissa Haas
6 min readApr 27


This blog post is based on a talk me and Jonathan Brandt held at PyConDE and PyData Berlin 2023. The recording is currently in processing and will be available soon.

In various NLP projects in the past, we experienced how different requirements and changing requirements can alter the options you have to deploy your model in the end. During the whole process of machine learning, you are facing decisions, and depending on those decision outcomes, your deployment options change.

I don’t know whether the term NLPOps exists, but if it doesn’t, I am coining it as of today. In this article, I will show you an NLPOps checklist that helped us during the last projects and clarified many questions about deployment. The checklist also made the reasons for the deployment requirements precise for the customer.

Why is the topic of NLPOps so important?

With the emergence of transformer models in the NLP area, the size of state-of-the-art models began to increase quickly. There were always big machine learning models, whether NLP or non-NLP models. But those were single exceptions in a row of manageable-sized models.

When I started my career as a Data Scientist in 2019, BERT and transformer models were the hottest topics. Research and development were on fire from then on, and new models emerged quickly. Right now, we are talking about GPT-4, which exceeds the size of those first language models by an unbelievable factor.

I’m not planning to deploy ChatGPT or something similar by myself, this is not the reason behind this article, but you can see the tendency for NLP models, in general, to become bigger and bigger over time. With this, the complexity and the size of challenges in deployment will rise as well.

Emergence of large language models over the years (data source)

Journey to NLPOps

As I already said: During your journey of developing NLP models, you are making many decisions that can impact your choice of deployment later. Those decisions might change as your project and development evolve, but you still should be aware of the implications.

Starting with the use case in general: What are we talking about? A service? A testing frontend? A multi-language model? All together?

You might do further brainstorming: To solve this use case, what tools do I have? Which restrictions do I have? What are my possibilities in general?

Then you will collect more requirements: Do I need regular retraining? What does my model need to do? Do I need to fit in some standards/infrastructure?

Finally, you might get some data: How is my training data looking like? Do I need data augmentation? Do I need to replicate specific steps also during inference?

And then you can train a model, but what kind of model do I need? How big do I expect it to be? Can I apply pruning etc.?

And then we start working on deployment, don’t we?

NLPOps Checklist

But this is not true. All those decisions that you made before are influencing your deployment options. Take this checklist as a guideline to ask the right questions right at the beginning of your project. Unfortunately, the checklist will not result in a rule like “if this, then that.” It will instead make some options more or less likely. You will always have to balance choices and pros and cons.

  • Model Type: This is probably the most crucial factor that specifies the way of deployment. You can nearly deploy it anywhere with a small, “plain” NLP setup, such as a TFIDF and a simple classifier. It can get more complicated when you have a more extensive setup, for instance, with spaCy and large language model dependencies. When we are finally talking about transformers, you have problems very quickly. Factors you should consider: model size, the need for GPU support, etc.
  • Languages: Several deployment options are possible depending on the amount of training data you have and the number of languages you need to support. For example, if you have enough training data to train a model for each language, it can make sense to duplicate micro-services, which means one service for each language. If you do not have enough training data, you might choose a multi-language approach, which will lead to a single service that can be bigger and lead to deployment problems.
  • Size of Dependencies: Big Python dependencies, like PyTorch or TensorFlow, can lead to problems when you want a sleek, fast-starting micro-service. There are options to reduce the size of those dependencies, for example, using the smaller (CPU) versions of those packages, or getting rid of them by packaging your model with ONNX.
  • Retraining Frequency and Location: If your use case requires regular retraining, you need to ensure that retraining is also possible at your deployment location. Or you need to ensure you can exchange newly trained models quickly and in an automated process.
  • Request Frequency and Availability: Make sure you know about the loads your service needs to cover and the availability requirements. Some services will be used in batch mode, which leads to higher demand in specific points, but only sometimes. Some services will be requested constantly, so a load balancer and automated scaling would be a good idea. Some cloud providers and deployment locations already provide such functionality, so it can be good to evaluate the options and the effort to build it yourself in comparison.
  • Response Times: Strongly connected to the last point, you must also be clear about response times. When your service communicates directly with human beings, it might be a good idea to have the response times as low as possible. This leads to different requirements for the deployment than a nightly job that can take hours to run.
  • Running Perspective: Is your project perceived as a PoC, or do you plan to work on it for a longer time? If there is a longer-running perspective, consider that requirements may change and get more, and models tend to grow bigger, the more training data you have, and the more cases you want to cover. So plan ahead and with enough space for your model to grow and evolve.
  • Monitoring / Security: You also should clarify your requirements about monitoring and security. Is your data allowed to leave the company infrastructure? Do you need an alerting system and dashboards with uptimes and health checks? This might also change your options for deployment. Some cloud providers come with very handy prebuild solutions here; for others, you might want to build something for yourself. The company may already have a monitoring setup ready to use so that you can use it.
  • Running Context / Existing Infrastructure: As mentioned in the last point, our services nearly never run isolated. They always have some context. This context can have a significant impact on our decisions. For example, other services can run on some infrastructure heavily tied to your service, so they should run in the same environment / on the same infrastructure. Or you need to take care of data privacy issues, which only allow you to deploy on providers within your company’s network / have an existing contract with your company.

I hope you got a feeling for all the points on this checklist and how they can help you if you ask the right questions right at the start of the project. Please note that this checklist is incomplete, so please contact me if you have some additions! And please note: Do not use this as a fixed set of rules; there is no “if this than that” rule. You can return to your checklist whenever requirements change and re-evaluate your options.

Read more

If you want to learn more about how we used this checklist in two of our use cases, look at our talk or the slides. There are also plenty of other resources linked for you below.



Larissa Haas

Data Scientist with focus on NLP and conversational AI @sovantaAG. Co-Creator of Likes to chat about AI, SciFi, bots gone rogue or the Art of Python