Eureka Engineering
Published in

Eureka Engineering

How to make GPU inference environment of image category classification production-ready with EKS/Kubernetes

This is the December 20 article for Eureka Advent Calendar 2021.



These days I work on data lifecycle privacy project/data platform migration/moderation ML systems and so on…(Wondering if I’m Site Reliability Engineer engineer)

Let’s move on to the article topic. I would like to introduce the story of setting up a production environment for image categorization in our moderation service with EKS. There are many points for production-ready setting of EKS Cluster, including security items, but this time let me focus on the unique points of GPU inference environment.

Target readers

  • People who are interested in how to develop a production environment for GPU inference with EKS/Kubernetes.

How to release production environment of GPU inference

1. Set up GPU environment with EKS Node/Container

Appearing components

  • Nvidia Driver

A driver that provides an interface to use Nvidia GPUs

  • nvidia-container-runtime

Runtime wrapping runc which allows prehook to control GPU.

  • nvidia-device-plugin

Exposes the number of GPUs installed on each node of the cluster, enabling Kubernetes clusters to run GPU-enabled containers.

  • CUDA

CUDA is a parallel computing platform and programming model that makes using a GPU for general purpose computing simple and elegant. (quoted from

  • Pytorch

An open source machine learning framework we adopt

1.1 Set up node with Nvidia Driver/Nvidia container runtime with Amazon EKS optimized accelerated Amazon Linux AMIs and MNG

  • Use Amazon EKS optimized accelerated Amazon Linux AMIs

In addition to the standard Amazon EKS optimized AMI configuration, the accelerated AMI includes the following: NVIDIA drivers and nvidia-container-runtime (as the default runtime) (quoted from

Kubernetes v1.21 will be the last version with Docker container runtime support.

  • Create MNG by specifying GPU instance and optimized AMI

This is Terraform code example.

1.2 Install Nvidia device plugin to k8s node.

1.3 Make Dockerfile of CUDA/Pytorch container and k8s deployment

We needed to base on the version of Pytorch that Data Scientist uses. (For details: )

Finally, I decided to manage the Python and CUDA environments together as a base image so that that version of Pytorch can run on it. ( ref: and

2. Separate CPU inference and GPU inference workload

And we needed to release this image classification model without impacting its existing text classification workload.

So, by using Node Affinity, CPU models can only be placed in the MNG for the CPU, and GPU models can only be placed in the Node Affinity for the GPU with limit/require. As a result, we have complete workload isolation!

The following picture is an overview.

3. Set GPU monitoring with Datadog

I used Nvidia NVML to get and visualize GPU metrics with Datadog.

But, The official page is not enough to set up and has poor reusability, so I added more details with sample code. Please check it for details:

The official datadog helm chart does not support nvml, so you need to provision it with daemonset.
Although it is not explained in the manual, you need to specify the environment variable `DD_CLUSTER_NAME` in datadog agent configuration for EKS.

The result setting dashboard is as follows.

4. Decide pod/node scale strategy(Load Test/Capacity Planning)

  • Use HPA(horizontal pod scaler)
  • Decide whether we make scaling policy based on GPU Utilization or CPU Utilization with load test to stage environment(load test environment: instance spec: g4dn.2xlarge and deployment count: 1)

I used vegeta and nappa as a simple load test tools.

According to Result that CPU utilization is saturated before GPU utilization is saturated, we decide to set HPA based on CPU utilization

Capacity planning

  • Load test with max requests count we anticipated and decided first server capacity we deploy
  • here is the sample command and result
  • We can also visualize results with HTML like below.

Node Scale strategy

  • As for the rest, not needed for a special setting for GPU node scaling

Cluster Autoscaler is an excellent tool and will scale nodes appropriately according to settings such as Affinity/tolerance and GPU/CPU/Memory requests/utilization.

The result of all these sets is as follows.

Future Prospect

1. Modernize moderation ML pipeline

Data Scientist executes the data collection, data cleaning, model training, and evaluation in the local notebook to produce a model and Data Scientist and SRE/MLOps engineers turn it into an API endpoint in the Production Environment.

This non-reproducible flow is not really designed to scale. Therefore, we would like to build an automated pipeline using kubeflow or other ML pipeline management tools to improve both reliability and release cycle.

2. Adopt feature store

(quoted from

We are hiring SRE/MLOps Engineers!!