Engineering Insights

Using Karpenter to manage GPU nodes with time-slicing

TarrantRo
Jina AI
4 min readJul 25, 2022

--

Introduction

For machine learning, utilizing GPUs will power the computation workloads. However, nowadays, companies and users are keen to use the cloud instead of individual bare machines. With cloud computing, you only need to pay what you use and don’t need to buy expensive machines with graphic cards, which you may not use frequently.

That raises the question — How can we optimize the cost in GPU utilization in the cloud? Well, when you are using virtual machines, you need to pay for the whole device including GPUs even though you don’t need it 24x7. Compared to virtual machines, kubernetes provides elastic node scaling methods and it’s more cloud native. I will use Karpenter as a node scaler since I’m using eks. You can know more about Karpenter with this doc.

Also, a NVIDIA’s k8s plugin s needed as well. The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:

  • Expose the number of GPUs on each nodes of your cluster
  • Keep track of the health of your GPUs
  • Run GPU enabled containers in your Kubernetes cluster.

It also supports time-slicing. So it will allow users to share the GPU between pods, saving…

--

--

TarrantRo
Jina AI

IT guy who love movies, Japanese manga. Have some experiences in Linux system, container/k8s, devops, cloud, etc.