How to make GPU inference environment of image category classification production-ready with EKS/Kubernetes

TAKASHI NARIKAWA

Published in

Eureka Engineering

6 min readDec 20, 2021

This is the December 20 article for Eureka Advent Calendar 2021.

日本語版はこちら: https://fukubaka0825.hatenablog.com/entry/2021/12/20/020908

Introduction

Hi, I’m nari/wapper from SRE team at eureka, Match Group.👋

These days I work on data lifecycle privacy project/data platform migration/moderation ML systems and so on…(Wondering if I’m Site Reliability Engineer engineer)

Let’s move on to the article topic. I would like to introduce the story of setting up a production environment for image categorization in our moderation service with EKS. There are many points for production-ready setting of EKS Cluster, including security items, but this time let me focus on the unique points of GPU inference environment.

Target readers

People who have basic knowledge about Kubernetes (Cluster/Pod/Deployment etc.)
People who are interested in how to develop a production environment for GPU inference with EKS/Kubernetes.

How to release production environment of GPU inference

1. Set up GPU environment with EKS Node/Container

Here is this capture’s overview architecture. (⚠︎ This is only my understanding, not assured to be correct)

Appearing components

Nvidia Driver

A driver that provides an interface to use Nvidia GPUs

nvidia-container-runtime

Runtime wrapping runc which allows prehook to control GPU.

https://github.com/NVIDIA/nvidia-container-runtime

nvidia-device-plugin

Exposes the number of GPUs installed on each node of the cluster, enabling Kubernetes clusters to run GPU-enabled containers.

https://github.com/NVIDIA/k8s-device-plugin

CUDA

CUDA is a parallel computing platform and programming model that makes using a GPU for general purpose computing simple and elegant. (quoted from https://blogs.nvidia.com/blog/2012/09/10/what-is-cuda-2/)

Pytorch

An open source machine learning framework we adopt

https://pytorch.org/

1.1 Set up node with Nvidia Driver/Nvidia container runtime with Amazon EKS optimized accelerated Amazon Linux AMIs and MNG

In this case, I decided to use the full range of managed features provided by AWS for once.

Use Amazon EKS optimized accelerated Amazon Linux AMIs

In addition to the standard Amazon EKS optimized AMI configuration, the accelerated AMI includes the following: NVIDIA drivers and nvidia-container-runtime (as the default runtime) (quoted from https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html)

Kubernetes v1.21 will be the last version with Docker container runtime support.

Create MNG by specifying GPU instance and optimized AMI

This is Terraform code example.

1.2 Install Nvidia device plugin to k8s node.

Install Nvidia device plugin as DaemonSet with kubectl

1.3 Make Dockerfile of CUDA/Pytorch container and k8s deployment

One of the most difficult things to do around machine learning libraries such as Pytorch and CUDA is to control which versions to install for compatibility with each other.

We needed to base on the version of Pytorch that Data Scientist uses. (For details: https://pytorch.org/get-started/previous-versions/ )

Finally, I decided to manage the Python and CUDA environments together as a base image so that that version of Pytorch can run on it. ( ref: https://github.com/qts8n/cuda-python/blob/master/runtime/Dockerfile and https://github.com/docker-library/python/blob/master/3.9/bullseye/Dockerfile)

2. Separate CPU inference and GPU inference workload

We have already CPU inference models for text classification in the moderation service (For details: https://medium.com/eureka-engineering/aws-solution-days-ai-machine-learning-day-tokyo-で登壇しました-45563b66d10 )

And we needed to release this image classification model without impacting its existing text classification workload.

So, by using Node Affinity, CPU models can only be placed in the MNG for the CPU, and GPU models can only be placed in the Node Affinity for the GPU with nvidia.com/gpu limit/require. As a result, we have complete workload isolation!

The following picture is an overview.

3. Set GPU monitoring with Datadog

I used Nvidia NVML to get and visualize GPU metrics with Datadog.

But, The official page is not enough to set up and has poor reusability, so I added more details with sample code. Please check it for details: https://github.com/fukubaka0825/nvml-for-datadog-in-k8s.

The official datadog helm chart does not support nvml, so you need to provision it with daemonset.
Although it is not explained in the manual, you need to specify the environment variable `DD_CLUSTER_NAME` in datadog agent configuration for EKS.

The result setting dashboard is as follows.

4. Decide pod/node scale strategy(Load Test/Capacity Planning)

Pod scale strategy

Use HPA(horizontal pod scaler)
Decide whether we make scaling policy based on GPU Utilization or CPU Utilization with load test to stage environment(load test environment: instance spec: g4dn.2xlarge and deployment count: 1)

I used vegeta and nappa as a simple load test tools.

According to Result that CPU utilization is saturated before GPU utilization is saturated, we decide to set HPA based on CPU utilization

Capacity planning

Load test with max requests count we anticipated and decided first server capacity we deploy
here is the sample command and result

We can also visualize results with HTML like below.

Node Scale strategy

Use Cluster Autoscaler(https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)
Label the accelerator settings so that Cluster Autoscaler can scale down/up appropriately considering the GPU optimize process(REF: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html#cluster-autoscaler)

As for the rest, not needed for a special setting for GPU node scaling

Cluster Autoscaler is an excellent tool and will scale nodes appropriately according to settings such as Affinity/tolerance and GPU/CPU/Memory requests/utilization.

The result of all these sets is as follows.

Future Prospect

1. Modernize moderation ML pipeline

Now, As for the moderation system, our Teams started with a manual workflow.

Data Scientist executes the data collection, data cleaning, model training, and evaluation in the local notebook to produce a model and Data Scientist and SRE/MLOps engineers turn it into an API endpoint in the Production Environment.

This non-reproducible flow is not really designed to scale. Therefore, we would like to build an automated pipeline using kubeflow or other ML pipeline management tools to improve both reliability and release cycle.

2. Adopt feature store

In addition, we will introduce a feature store that makes it possible to use common processing for multiple feature sets, both online and offline to reduce the inconsistency of data due to different data sources between ML training and online provisioning (Feature store is already adopted in our Recommendation Service: https://medium.com/eureka-engineering/vertex-ai-mlops-b74cdff19681)