Engineering Insights

Using Karpenter to manage GPU nodes with time-slicing

TarrantRo

Published in

Jina AI

4 min readJul 25, 2022

Introduction

For machine learning, utilizing GPUs will power the computation workloads. However, nowadays, companies and users are keen to use the cloud instead of individual bare machines. With cloud computing, you only need to pay what you use and don’t need to buy expensive machines with graphic cards, which you may not use frequently.

That raises the question — How can we optimize the cost in GPU utilization in the cloud? Well, when you are using virtual machines, you need to pay for the whole device including GPUs even though you don’t need it 24x7. Compared to virtual machines, kubernetes provides elastic node scaling methods and it’s more cloud native. I will use Karpenter as a node scaler since I’m using eks. You can know more about Karpenter with this doc.

Also, a NVIDIA’s k8s plugin s needed as well. The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:

Expose the number of GPUs on each nodes of your cluster
Keep track of the health of your GPUs
Run GPU enabled containers in your Kubernetes cluster.

It also supports time-slicing. So it will allow users to share the GPU between pods, saving…

Engineering Insights

Using Karpenter to manage GPU nodes with time-slicing

Introduction

Written by TarrantRo