DS in the Real World

Serving GPT-2 in production on Google Cloud Platform

A CloudOps Journey

Lukas Batteau
Published in
4 min readFeb 14, 2020


Have You Tried Turning It Off And On Again?

Our mission at Deepdesk is to unburden contact centers by applying AI. We provide real time response recommendations (think Smart Compose), and automation of repetitive dialogs.

We do this by training Machine Learning models with actual conversations. First we extract frequently used answers by running a clustering algorithm over the dataset. Then we train a neural network to predict the best answer, given the conversation up until that point.

This has worked quite well for us so far. We see typed text (aka time, aka money) reductions from 8% to 15% average, and even up to 30% for some contact center agents.


Although these are great numbers, there is a limitation to our approach. Because we use a predetermined set of frequently used answers, we are seeing that most of our recommendations are being used at the beginning and the end of the conversation, where there is relatively low variability. In the middle, where the variability is highest, there is room for growth.

In a shocking finding…

This past month, we have been looking into Transformer models, if they could add value on top of our current approach. It’s hard not to be blown away by the GPT2 Unicorn story right? And the promise of Language Generating models actually seems like a perfect fit for us — offering text completion suggestions anywhere in the conversation.

It comes at a cost though. Transformer models are huge, and inference (getting model predictions) in production really requires the use of GPUs.

After a month of experimenting, we had a GPT-2 fine-tuned model that was worth trying out in production. Hopefully we can share our data science learnings in a separate blog. Here I want to share our journey of deploying the model to production.


Up until now, we have been happily running our platform in a Docker container on Heroku. I can recommend it to any team with low DevOps capacity. We serve 50 million requests per month, and it’s been a breeze. Although Heroku gets pricy soon, it’s nowhere near the cost of extra DevOps engineers.

A challenge though, when serving ML models in production, is the shear size of the image. Autoscaling will behave erratically, because it takes minutes to start a new instance. And without extra measures, requests will be routed to instances that are still loading. It’s something we have struggled with a lot, and it basically requires more control over the autoscaling and health check parameters. Both of which Heroku does not offer. Up until now we’ve been able to manage it, but with our GPT-2 Docker image being a whopping 6GB and more, it is now a real problem.

Another requirement for GPT serving is the real blocker though: we need GPUs. Early tests revealed that CPU-only inference could take seconds, where using a GPU brought it down by a factor 10. Heroku currently does not offer GPU-dynos though, so we had to look elsewhere. Although there are promising platforms like FloydHub, we were already using Google Cloud Platform (GCP) to train our existing models, so it made sense to explore the possibilities there.

Google Cloud Platform


After ruling out the non-GPU solutions, and preferring to work with Docker containers, we chose Kubernetes. Here is a summary of the stack:

  • GKE cluster of n1-highmem-2 machines with one Tesla T4 GPU
  • Daemon script to install Nvidia/CUDA drivers
  • Nvidia Pytorch Docker base image, GPU ready
  • Simple Flask micro service serving the model
  • Model based on HuggingFace’s GPT-2, fine-tuned for our specific tasks
  • Cloud Build to build our production image
  • Cloud Endpoints as proxy, handling SSL and authentication


Stepping into the world of Transformers is quite daunting. Thanks to the work of Open AI and HuggingFace though, there are a lot of models and example scripts available. The hardware requirements for serving Transformer models reminded me of the early gaming years, where you spent a lot of time installing the right Nvidia drivers to get your GeForce to work. Also here there are many prebuilt solutions available, like the GCP Deep Learning VM, or the Nvidia Docker images. Kubernetes is not for the faint-of-heart, and you should expect at least one member of your team spending a lot of time, if not full time, on setting up the configuration and further maintenance.

Special thanks to Tom Hastjarjanto for the GCP overview.