Notes on Kubeflow Summit 2022 (1)

Jian Wu
4 min readNov 6, 2022

--

Aurora’s ML Infrastructure with AWS EKS, Kubeflow & SageMaker for Autonomous Driving

Last year, at AWS re:Invent 2021, Aurora and AWS presented Hyperscale ML Infrastructure built on top of AWS Services, including EKS, Kubeflow & SageMaker, for Autonomous Driving:

AWS re:Invent 2021– Chris Urmson of Aurora on using AWS to enable self-driving technology

AWS re:Invent 2021 — AWS for autonomous driving with Aurora Innovation

To process hyperscale datasets collected from autonomous vehicles with the “infinity” workflow including data management, processing, labeling, ML model development & training, evaluation, simulation, MLOps & orchestration, Aurora built its cloud development platform on AWS utilizing EC2, S3, DynamoDB, RDS, EKS, Kubeflow, Sagemaker and other AWS Services:

To help ML Engineers and Data Engineers concentrate on their challenging works with minimal overhead, the Aurora development process and workflows can be started automatically from code change/commits in GitHub repos. First, the CI/CD system with Bazel is trigged to build required artifacts including docker container images, then different jobs and workloads, including data processing, training and simulation, are orchestrated with Kubeflow pipelines to run on multi-tenant EKS clusters across multiple availability zones:

Recently, at Kubeflow Summit 2022 in San Francisco, we had an opportunity to learn more about Aurora’s ML workflow orchestration powered by Kubeflow Pipelines & Components:

How Aurora Uses Kubeflow Pipelines to Accelerate ML Model Development for Autonomous Vehicles (Kubeflow Summit 2022)

At system architecture level, Aurora’s ML workflows are orchestrated with Kubeflow Pipelines & Components, including dataset generation with BatchAPI, ML Model training with Sagemaker, ML Model evaluation, and converting ML Models to TensorRT for deployment:

As Kubeflow is Aurora’s ML Workflow orchestration engine, all the Aurora’s computing services and resources are integrated through Kubeflow Components, including:

  1. BatchAPI for data extraction and model evaluation
  2. ML Training Jobs using Amazon SageMaker
  3. Job Notification with Slack

To make it easy for ML Engineers and Data Engineers to run ML training jobs, the pipelines can be trigged by a GitHub pull request, it first builds all required docker container images using Bazel, then the docker images are published to AWS ECR repositories, next the ML training jobs are started:

Aurora’s ML Training jobs are launched through Kubeflow using open-source Amazon SageMaker Components for Kubeflow Pipelines, the ML training jobs are running on Sagemaker multi-GPU training instances with optimized networking, trained ML models are saved to configured S3 Buckets with proper access control, training logs and metrics are sent to AWS CloudWatch.

Training configuration parameters (50+) are passed to an internal “Training Main Wrapper” using Sagemaker hyperparameters, the “Training Main Wrapper” forwards these training configuration parameters to the PyTorch Training Script to train specific ML Models.

Aurora’s development platform with Kubeflow pipelines went to production around mid-2021, using S2T Pipeline as example, S2T pipeline is Aurora’s Sensor to Tensor core pipeline, and also the first ML pipeline built in Aurora:

Before Kubeflow pipelines in production, it took weeks for ML engineers to work through S2T pipeline. After it is captured and automated with Kubeflow pipelines, it only takes one week to complete with zero touch from the ML engineers, and as a result the trained S2T model is landed at minimum bi-monthly now.

If you want to know more about Kubeflow Summit 2022, please continue to read:

--

--

Jian Wu

Developer/DevOps Engineer building React Web UI, REST API Server, ML related Tools and Kubernetes clusters linkedin.com/in/hellojianwu