Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Mastering Complex Workloads with Kubernetes JobSet and GKE metrics

--

Look, we’ve all been there. Kubernetes Jobs are solid for simple batch tasks, but when things get complicated, you start running into walls. You’ve got complex workflows, dependencies, and resource requirements that basic Jobs just can’t handle.

The Limitations of Indexed Jobs

The Kubernetes 1.21 release introduced Indexed Jobs, a step up from basic Jobs, allowing for indexed parallel execution. Indexed Jobs let you run a batch job where each pod needs to know its own index within the job. Think of it like a numbered list of tasks. Each pod gets assigned a unique index, which can be used to process a specific slice of data or perform a distinct part of the overall job.

Here’s a simple example: imagine we have a large file that we want to split and process in parallel. We can use an Indexed Job to assign each pod a different chunk of the file.

apiVersion: batch/v1
kind: Job
metadata:
name: indexed-file-processor
spec:
completions: 10 # Total number of jobs to complete
parallelism: 5 # Number of jobs to run in parallel
completionMode: Indexed
template:
spec:
containers:
- name: processor
image: your-processor-image:latest
command: ["python", "process_chunk.py", "--index=$(JOB_COMPLETION_INDEX)"]
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: jobCompletionIndex
restartPolicy: Never

In this example:

  • completions: 10 means we want 10 pods to run and complete.
  • parallelism: 5 means we want 5 pods running at the same time.
  • completionMode: Indexed is the key part – it tells Kubernetes to create an Indexed Job.
  • $(JOB_COMPLETION_INDEX) and jobCompletionIndex are used to pass the index of the pod to the python script.

Each pod will run the process_chunk.py script with a different --index argument, allowing it to process a unique chunk of the file.

However, even with Indexed Jobs, managing complex, distributed workloads like those in machine learning or HPC can be a headache. You’re still dealing with individual Jobs, and coordinating dependencies, networking, and resource placement becomes increasingly difficult.

JobSet to the Rescue

That’s where JobSet comes in, and trust me, it’s a game-changer. Think of JobSet as your workload’s conductor. It takes a bunch of related Kubernetes Jobs and treats them as a single, orchestrated unit. Forget wrangling individual Jobs; JobSet lets you manage them all together, like a well-oiled machine. This is huge for heavy-duty stuff like high-performance computing (HPC) and machine learning, where you’re dealing with intricate, interconnected tasks.

Let’s say you’re training a massive machine learning model. You’ve got parameter servers, worker nodes, the whole shebang. With a regular Job, you’d be managing each component separately, a headache waiting to happen. JobSet lets you define these different pieces as “ReplicatedJobs” within a single JobSet. You set the rules, define the pod templates, and JobSet makes sure everything launches and runs in sync. Plus, it handles the network setup, so you don’t have to sweat the small stuff.

And if something goes wrong? JobSet’s got your back. You can set up custom failure and success policies, telling it exactly how to react. It’s like having a safety net for your complex workloads.

Another thing that often trips people up is placement. In HPC and distributed training, where network speed is king, you want your tasks running close together, ideally on the same rack or zone. JobSet lets you specify these topology constraints, so your workloads end up where they need to be.

JobSet Metrics on GKE: Deeper Insights, Less Effort

Recently GCP make the JobSet metrics automatically available on new GKE Standard and Autopilot clusters starting from version 1.32.1-gke.1357001 or later which exports those new rollup metrics by default at no additional charge

  • kubernetes.io/jobset/times_between_interruptions: Distribution of times between the end of last interruption and beginning of current interruption for a JobSet. Each sample indicates a single duration between last and current interruption. The data is sampled within 60s after the current interruption starts, and emitted within 24h. The metric does not include a sample for duration between interruptions longer than 7 days. This metric is only applicable for JobSets running on nodes with GPU/TPU and having a single replicated job.
  • kubernetes.io/jobset/times_to_recover: Distribution of recovery period durations. Each sample indicates a single recovery operation for the JobSet to recover from a downtime period. The data is sampled within 60s after the completion of JobSet recovery, and emitted within 24h. This metric does not include samples for downtime periods longer than 7 days. This metric is only applicable for JobSets running on nodes with GPU/TPU and having a single replicated job.
  • kubernetes.io/jobset/uptime: Total time the JobSet has been available. The data is sampled every 60s and emitted within 24h after sampling. This metric is only applicable for JobSets running on nodes with GPU/TPU and having a single replicated job.
  • kube_jobset_specified_replicas : The number of specified replicas per replicated Jobs in a JobSet. Sampled every 30 seconds
  • kube_jobset_ready_replicas : The number of replicas in a ‘READY’ state per replicated Jobs in a JobSet. Sampled every 30 seconds.
  • kube_jobset_succeeded_replicas : The number of replicas in a ‘SUCCEEDED’ state per replicated Jobs in a JobSet. Sampled every 30 seconds.
  • kube_jobset_failed_replicas : The number of replicas in a ‘FAILED’ state per replicated Jobs in a JobSet. Sampled every 30 seconds.
  • kube_jobset_active_replicas : The number of replicas in a ‘ACTIVE’ state per replicated Jobs in a JobSet. Sampled every 30 seconds.
  • kube_jobset_suspended_replicas : The number of replicas in a ‘SUSPENDED’ state per replicated Jobs in a JobSet. Sampled every 30 seconds.
  • kube_jobset_status_condition : The current status conditions of a JobSet. Sampled every 30 seconds.

Here is an example JobSet. It runs a distributed PyTorch training workload.(from the JobSet docs)

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: pytorch
spec:
replicatedJobs:
- name: workers
template:
spec:
parallelism: 4
completions: 4
backoffLimit: 0
template:
spec:
containers:
- name: pytorch
image: gcr.io/k8s-staging-jobset/pytorch-resnet:latest
ports:
- containerPort: 3389
env:
- name: MASTER_ADDR
value: "pytorch-workers-0-0.pytorch"
- name: MASTER_PORT
value: "3389"
command:
- bash
- -xc
- |
torchrun --nproc_per_node=1 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT resnet.py --backend=gloo

If you are looking for a real world example I recommend checking this GCP doc where it shows how to orchestrate multiple multislice workloads on GKE for improved resource utilization, deploying a Jax workload as an example, running it on TPU Multislice, and implement Job queueing with JobSet and Kueue. Kueue determines when Jobs should run based on available resources, quotas, and a hierarchy for fair sharing among teams.

There is this Kubecon talk from 2023 as well from Abedullah and Lawrence that show some really nice use cases.

Some of the amazing features from JobSet are:

  • Exclusive Job to topology placement: The JobSet annotation alpha.jobset.sigs.k8s.io/exclusive-topology defines 1:1 job to topology placement. For example, consider the case where the nodes are assigned a tpu label.
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: exclusive-placement
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: tpu # 1:1 job replica to node pool assignment
spec:
failurePolicy:
maxRestarts: 3
replicatedJobs:
- name: workers
replicas: 3 # set to number of node pools
template:
spec:
parallelism: 3
completions: 3
backoffLimit: 10
template:
spec:
containers:
- name: sleep
image: busybox
command:
- sleep
args:
- 1000s
  • spec.failurePolicy.maxRestarts defines how many times
    to automatically restart the JobSet. A restart is done by recreating all child jobs. A JobSet is terminally failed when the number of failures reaches the maxRestarts
  • spec.coordinator : If defined, a jobset.sigs.k8s.io/coordinator annotation and label with the stable network endpoint of the coordinator Pod will be added to all Jobs and Pods in the JobSet. This label can be useful by other Pods. For example:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: pytorch
spec:
coordinator:
replicatedJob: leader
jobIndex: 0
podIndex: 0
replicatedJobs:
- name: leader
replicas: 1
template:
spec:
parallelism: 1
completions: 1
...
- name: workers
replicas: 1
template:
spec:
parallelism: 8
completions: 8
template:
spec:
containers:
- name: worker
env:
- name: LEADER_ADDRESS
valueFrom:
fieldRef:
fieldPath: "metadata.labels['jobset.sigs.k8s.io/coordinator']"
...
  • DNS for Pods: By default, JobSet configures DNS for Pods by creating a headless service with name spec.network.subdomain which defaults to .metadata.name if not set. The hostname for pod will have the following format: <jobSetName>-<spec.replicatedJob[*].name>-<spec.replicatedJob[*].replicas[*]>-<pod-index>. The FQDN for pod will have the following format: <jobSetName>-<spec.replicatedJob[*].name>-<spec.replicatedJob[*].replicas[*]>-<pod-index>.<subdomain>.
  • Groups of Jobs of different templates: The list .spec.replicatedJobs allows the user to define groups of Jobs of different templates. Each entry of .spec.replicatedJobs defines a Job template in spec.replicatedJobs[*].template, and the number replicas that should be created in spec.replicatedJobs[*].replicas. When unset, it is defaulted to 1. Each Job in each spec.replicatedJobs gets a different job-index in the range 0 to .spec.replicatedJob[*].replicas-1. The Job name will have the following format: <jobSetName>-<replicatedJobName>-<jobIndex>.

Basically, JobSet takes the headache out of managing complex batch workloads on Kubernetes. If you’re dealing with distributed training, HPC, or anything that involves coordinating multiple Jobs, give JobSet a try. You’ll wonder how you ever managed without it, especially with the added bonus of automatic metrics on newer GKE clusters.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Felipe Martinez
Felipe Martinez

No responses yet