DS in the Real World

Serving GPT-2 in production on Google Cloud Platform

A CloudOps Journey

Lukas Batteau
Feb 14, 2020 · 4 min read

Have You Tried Turning It Off And On Again?

We do this by training Machine Learning models with actual conversations. First we extract frequently used answers by running a clustering algorithm over the dataset. Then we train a neural network to predict the best answer, given the conversation up until that point.

This has worked quite well for us so far. We see typed text (aka time, aka money) reductions from 8% to 15% average, and even up to 30% for some contact center agents.


In a shocking finding…

It comes at a cost though. Transformer models are huge, and inference (getting model predictions) in production really requires the use of GPUs.

After a month of experimenting, we had a GPT-2 fine-tuned model that was worth trying out in production. Hopefully we can share our data science learnings in a separate blog. Here I want to share our journey of deploying the model to production.


A challenge though, when serving ML models in production, is the shear size of the image. Autoscaling will behave erratically, because it takes minutes to start a new instance. And without extra measures, requests will be routed to instances that are still loading. It’s something we have struggled with a lot, and it basically requires more control over the autoscaling and health check parameters. Both of which Heroku does not offer. Up until now we’ve been able to manage it, but with our GPT-2 Docker image being a whopping 6GB and more, it is now a real problem.

Another requirement for GPT serving is the real blocker though: we need GPUs. Early tests revealed that CPU-only inference could take seconds, where using a GPU brought it down by a factor 10. Heroku currently does not offer GPU-dynos though, so we had to look elsewhere. Although there are promising platforms like FloydHub, we were already using Google Cloud Platform (GCP) to train our existing models, so it made sense to explore the possibilities there.

Google Cloud Platform


  • GKE cluster of n1-highmem-2 machines with one Tesla T4 GPU
  • Daemon script to install Nvidia/CUDA drivers
  • Nvidia Pytorch Docker base image, GPU ready
  • Simple Flask micro service serving the model
  • Model based on HuggingFace’s GPT-2, fine-tuned for our specific tasks
  • Cloud Build to build our production image
  • Cloud Endpoints as proxy, handling SSL and authentication


Special thanks to Tom Hastjarjanto for the GCP overview.


Unburdening contact centers, one model at a time…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store