Mobileye journey towards scaling Amazon EKS to thousands of nodes leveraging Intel® Xeon® Scalable Processors and Habana’s Gaudi AI accelerators
Authors: Diego Bailon Humpert, AWS EMEA and Global Automotive GTM Lead & David Peer, Mobileye AI Engineering DevOps specialist & team leader.
Mobileye is a company that develops autonomous driving technologies and advanced driver-assistance systems (ADAS) including cameras, computer chips, and software.
In this blog post, we will review how Mobileye enables its AI-Engineering group to run more than 250 workflows daily leveraging Amazon EKS and how the development and time-to-market are reduced by leveraging both Amazon EC2 instances based on Habana Gaudi accelerators and Intel Xeon Scalable processors.
All production workloads run on Amazon EKS. Due to the variety of workflows, we have adapted different compute configurations per EKS cluster.
More precisely, to meet the compute needs we are using Amazon EC2 R5 instances powered by 2nd Generation Intel Xeon Scalable processors. For our AI/ML requirements we shifted some of our Training workflows to Amazon EC2 DL1 instances.
The diagram below describes the configuration of our Amazon EKS cluster. For simplicity, we described how a single Availability Zone is configured. The same configuration applies for all other Availability Zones that our workloads are running in.
This configuration enabled us to scale our clusters to ~3,200 nodes used by ~40,000 pods, and more than 100,000 vCPUs in a single cluster, while more than 95% of the cluster data-plane uses Spot instances.
Argo workflows — exposing Kubernetes to the crowds
Given the above EKS architecture our developers are not and should not be aware of the node group setup, Availability Zones (AZs)or any other aspects. To achieve that abstraction, we are using Argo workflows.
The below diagram demonstrates a complete workflow generated by a developer. We are currently running more than 250 workflows daily.
Our workflows are mainly CPU powered. For our Deep Learning Training environment, we have started deploying Habana Gaudi-based Amazon EC2 DL1 Instances for (2D and 3D) models.
In the next paragraphs, we will share how to properly configure and set-up DL1 for EKS workflows.
Launching Habana Gaudi-based Amazon EC2 DL1 instances
Getting started with Gaudi accelerators is as simple as adding 2 lines of code — as you see in the sample code below.
You load Habana module and simply run using the software container.
from TensorFlow.common.library_loader import load_habana_module
More information on the usage of DL1 instance type can equally be found here.
When running the script make sure to use the appropriate python executable. This depends on your setup of choice as documented here.
Labeling DL1 node-groups to provide scheduling flexibility to our users
We are running several types of workflows that require different instance configurations. This is where node labels, taints & tolerations, and node affinity come in handy. It allows us to verify that the right workload gets the right infrastructure leveraging the right instance type.
Below are the examples of the configuration of node-groups in EKSCTL
Managed node group example:
- name: spot-workflows-gaudi-a3
- key: “Habana.ai/gaudi”
To specify the instance type there is the option of “nodeSelector” parameter. For example, when a pod needs a Habana Gaudi based instance, the following pod manifest will allow us to achieve just that:
accel: “ dl1.24xlarge”
- key: “Habana.ai/gaudi”
DL1 workflows — reducing time to market
Setting up our Deep Learning training batch workflows through DL1 has allowed us to train more and spend less. The DL1 instances feature up to 8 Gaudi processors, and deliver up to 40% better price performance than the latest GPU-based instances as stated on the AWS DL1 instance site.
More information on the product details can be found here
DL1 Use cases
Primarily, we are leveraging DL1 workflows to train 2D and 3D models. We are consistently seeing cost-savings compared to existing GPU-based instances across model types, enabling us to achieve much better Time-to-Market for existing models or training much larger and complex models.
The following tables are examples of Mixed Precision/FP32 training results comparing DL1 to the common GPU instances used for ML training that were published by AWS here.
Framework: TensorFlow 2
Model: Bert Large — Pretraining
Framework: Pytorch 1.9
Our decision to start testing DL1 for our Deep Learning Training workflows has allowed us to see consistent cost-savings and has accelerated our pace for innovation.
Notices & Disclaimers
Performance varies by use, configuration and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.