Introducing MindX DL: Efficient Deep Learning Cluster Scheduling for Ascend Devices

Hüseyin Çayırlı
Huawei Developers
Published in
4 min readJul 14, 2023
MindX DL

Introduction

Deep learning has revolutionized the field of artificial intelligence, enabling breakthroughs in various domains such as computer vision, natural language processing, and robotics. However, training and deploying deep learning models at scale in data centers can be a complex task. That’s where MindX DL comes into play. MindX DL is a comprehensive set of deep learning components specifically designed to support data center training and inference hardware powered by Ascend AI Processors. You can find detailed information about Ascend AI Processors in the article World of Huawei Ascend: Future with NPUs. In this article, we will explore the features and benefits of MindX DL and how it simplifies the development and deployment of deep learning platforms.

Let’s begin.

MindX DL Components

MindX DL offers a range of components that provide crucial functionalities for efficient deep learning operations in data centers. Let’s take a closer look at some of these components:

  • Ascend Docker Runtime: This component enables containers to utilize Ascend NPUs, providing a runtime environment for deep learning applications.
  • Ascend Device Plugin: The device plugin supports NPU device management, allowing efficient utilization and resource allocation.
  • Volcano: Volcano’s integration with MindX DL optimizes NPU scheduling, allowing for resumable training and rescheduling in case of inference card faults, ultimately improving the reliability and availability of deep learning workloads.
  • HCCL-Controller: The HCCL-Controller generates the ranktable file (hccl.json) required for NPU training jobs.
  • NodeD: NodeD supports resumable training upon node faults, ensuring training jobs automatically resume on healthy nodes in case of hardware or network issues.
  • NPU-Exporter: NPU-Exporter enables monitoring of NPU device management status, providing insights into the health and performance of Ascend NPUs.
  • Resilience-Controller: The Resilience-Controller component supports the minimum service system, ensuring the essential services required for deep learning operations are maintained even in the presence of faults.
  • Elastic-Agent: Elastic-Agent provides the dying gasp function for resumable training, allowing training jobs to gracefully handle hardware or network failures.
Structure of MindX DL and Other Components

Let’s take a look at what can be done with the components mentioned with MindX DL.

  1. Cluster Scheduling: MindX DL enhances NPU (Neural Processing Unit) scheduling based on Kubernetes and provides advanced capabilities for checking the NPU and node status. This ensures optimal resource allocation and performance for training and inference tasks.
  2. Encryption and Decryption: MindX DL includes encryption and decryption functions throughout the model lifecycle, enabling secure deployment and protection against unauthorized access or theft. This is crucial for safeguarding the valuable AI models developed by enterprises.
  3. Toolbox: The Toolbox component offers a suite of useful functions such as bandwidth testing, computing power testing, and power consumption testing. It supports standard PCIe cards, board cards, and modules of Atlas products. These tests help evaluate and optimize the performance of Ascend AI Processors and gather essential insights for system optimization.

Scenarios and Use Cases

MindX DL components can be utilized in various scenarios to build powerful deep learning platforms. Here are a few examples:

  1. Training and Inference Jobs: Users can leverage MindX DL components to quickly build training and inference jobs based on Ascend AI Processors. The components provide essential functionalities for job scheduling, resource management, and performance optimization.
  2. Model Protection: The model protection component enables developers to integrate and develop training and inference services based on model encryption and decryption. This ensures the security and integrity of AI models throughout their lifecycle.
  3. Performance Testing and Optimization: The MindX Toolbox component allows users to test the computing power, bandwidth, and power consumption of Ascend processors. This information is crucial for identifying performance bottlenecks, optimizing system configurations, and maximizing the efficiency of deep learning workloads.

Advanced Features

MindX DL offers advanced features that enhance the efficiency and robustness of deep learning operations. Let’s explore three key features:

  1. Resumable Training: In the event of a hardware or network fault, resumable training ensures that the training job automatically resumes on a healthy NPU device or node. This feature reduces training disruptions and improves overall job reliability.
  2. Minimum Service System: MindX DL introduces the Minimum Service System feature, which provides fault tolerance for training nodes managed by the cluster scheduling component. In the event of a faulty training node, the cluster scheduling component isolates the node and reschedules the job based on the preset job scale and available nodes.
  3. Inference Fault Tolerance: If an inference processor resource becomes faulty, the cluster scheduling component automatically isolates the faulty resource and triggers rescheduling to ensure uninterrupted inference operations. This feature enhances the reliability and availability of inference services.

Conclusion

MindX DL offers a powerful solution for data center training and inference using Ascend AI Processors. Its comprehensive set of deep learning components simplifies the development and deployment of deep learning platforms, ensuring optimal performance, security, and reliability. The cluster scheduling component optimizes NPU scheduling, maximizing the utilization of Ascend AI Processors. The model protection component safeguards AI models through encryption and decryption, protecting them from unauthorized access or theft. The Toolbox component provides valuable insights for system optimization, enabling organizations to fine-tune hardware configurations. MindX DL’s advanced features, such as resumable training and inference fault tolerance, enhance the efficiency and reliability of deep learning operations. Overall, MindX DL empowers users to harness the full potential of deep learning and drive innovation in AI applications. For detailed information about MindX DL, you can visit this link.

Referances

--

--