Develop hardware-efficient AI without being a hardware expert

Huizi Mao
OmniML
Published in
6 min readNov 28, 2022

A common pattern we have observed is that building and deploying a production-purpose ML model are done by different teams, typically an ML algorithm team and a device operation team. The ML team handles model training and evaluation, while the device team is responsible for migrating the model to the production environment.

Such a separation is partly due to the fact that training and inference have diverged, regarding both hardware platforms and software stacks. In the past, we were using Caffe on a GPU server for both training and serving. Nowadays, people use powerful tools and servers for training, and then deploy models to highly optimized runtimes and diverse devices. Deployment issues are frequently encountered due to model complexity and hardware limitations, and the ML team usually needs to rely on the device operation team’s feedback to solve these issues.

As a result, Machine Learning engineers (MLEs) often lack very basic insights into the deployability of their own models. Today’s model deployment is often an in-house-built pipeline composed of multiple Bash/Python scripts spanning across different machines and environments. It also involves multiple open-source libraries or vendor-specific toolkits for model conversion, quantization, performance tuning, and accuracy verification. It is not a pleasant experience, compared with model development in a cloud-native Python environment.

Many choices and tools for PyTorch model deployment.

In addition to the tooling complexity, the lack of performance interpretability is another issue. Profiling reports from downstream toolchains often require domain knowledge to understand and convert into actionable model insights, like in the following TensorRT example. Along with the long model conversion pipeline, it is difficult for ML developers to identify actual performance bottlenecks of their own models and make the right changes.

Example profiling report from TensorRT doc, which requires domain knowledge to understand.

Despite these drawbacks, the separation of designing and deploying models is still the norm in the industry, as they usually require completely different skillsets. “It’s already difficult to hire an ML expert or a device expert, let alone an expert of both”, that’s what we keep hearing from our customers. But it doesn’t mean we should keep tolerating the current ML workflow.

In Software 1.0, it is hard to imagine that a program is written by one engineer and compiled by another engineer. Programmers can compile the code themselves, with little or no knowledge of the underlying stages such as assembly and linking, while still being able to obtain meaningful insights to fix their codes. Without such insights, the debugging process could become an endless back-and-forth between two engineers who don’t speak each other’s language.

So far, the most common problems we have seen that delay the deployment of ML models are:

  1. Intolerable latency/throughput
  2. Unsupported operators
  3. Accuracy mismatch

They are much easier to solve if one single engineer has the full context of model design and deployment. However, nowadays these problems can easily take weeks or even months due to the communication overhead between multiple stakeholders. The abundant tools in the industry partially alleviate these problems, but fail to completely solve the ecosystem fragmentation and the knowledge gap of model design and deployment.

A simple and comprehensible workflow for model deployment and diagnosis is the solution. An interface that ML engineers can use and understand by themselves will greatly improve productivity.

Building a super powerful ML compiler is just part of the solution, because there are some fundamental differences between Software 2.0 and Software 1.0: first, new operators are constantly rolled out from academia and many of them cannot be composed of existing ones; second, ML models can tolerate non-functional-preserving operator replacement and still keep similar accuracy, which provides more customization flexibility for ML developers. We will not delve into details though, because it is probably worth a separate blog to talk about model customization for deployment.

At OmniML, we started by building an internal tool for our engineers to deploy and profile their ML models without the hassle of studying the hardware and its ML toolchain. We soon realized the performance boost of such a tool. In addition, this unified, information-rich interface enables both humans and algorithms to unlock great model optimization opportunities. Therefore, those features are now formally available in OmniML’s products: Omnimizer and Omnimizer Enterprise.

Omnimizer — a unified model optimization and deployment platform

Omnimizer is primarily designed for ML engineers who design and train PyTorch models. It helps them identify the design flaws and cut down time to production.

Omnimizer provides a PyTorch-native and cloud-native interface to quickly test a model on target hardware. Users only need to specify a high-level deployment configuration, and then send the request to the OmniML-hosted device cloud. Key deployment information, including overall latency, layerwise latency, and deployment errors (if any), will be reported back in the simplest way that a non-hardware expert can understand.

Omnimizer allows users to compare the on-device accuracy with the original model easily. It gets rid of the hassles of transferring the model and data to a different device, getting familiar with different OS and toolchains, and maintaining a bunch of disorganized and bug-ridden scripts. In the following example, users can get the model output in a PyTorch-like interface, while the actual inference happens on a remote device, e.g., a server-class GPU or a smartphone.

Omnimizer is not only a software library, but also an MLOps platform that provides a user-friendly interface to navigate the performance details, share model insights, and reproduce deployment environments. Users are able to view the deployment steps, obtain the benchmarked latency, and get a better understanding of their model on the hardware.

Omnimizer Enterprise — Unleash the full potential of AI hardware

Compared with the community version which assists model deployment and optimization, the enterprise version provides automated model optimization based on years of research on Neural Architecture Search (NAS) and extensive customization for enterprise needs.

NAS has always been regarded as a costly process that requires deep expertise in search space and proxy task design. With Omnimizer, every user can apply NAS to customize their models for target hardware. This process requires just a few lines of code change, low training costs, and most importantly, no requirement to be an expert in model design and hardware performance.

Omnimizer can easily integrate with open-source repositories and accelerate off-the-shelf models with little manual optimization. So far OmniML and its customers have demonstrated 1.2–6.8x speed-ups on NVIDIA and Qualcomm platforms. These repositories will be open to enterprise users as examples:

In addition, enterprise users will have access to more hardware platforms, such as NVIDIA Jetson and Qualcomm Robotics platforms, together with the choices of on-premise installation.

Try Omnimizer!

Omnimizer Beta has just been released. Sign up now at OmniML’s website and start optimizing your workflow!

Di Wu, Song Han, Lucas Liebenwein, Riyad Islam, Keval Morabia, 
Asma K.T. Beevi, Henry Guo and Shengliang Xu also contributed to this article.

--

--