Portability of AI stack

Jaideep Ray
Better ML
Published in
4 min readJan 22, 2024

Context:

AI authoring stacks have rapidly evolved over the last few years. There have been massive changes in model architectures, capabilities of GPU, distributed training methods, model optimization techniques, serialization formats, and serving format and so on.

The popularity and adoption of AI stacks have fluctuated over time, as different frameworks have emerged and evolved. The current leaders in deep learning are Pytorch and its ecosystem, Tensorflow, and JAX.

https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2023/

Many recent advancements have been achieved by scaling up the training and inference of large models using GPUs. Gradually, other accelerators such as TPUs, Inferentia, Trainium, and Habana are also making progress in both training and inference. Each framework has specialized kernels implemented within it to support a subset of these accelerators.

As both software and hardware continue to evolve, each with their own strengths and weaknesses, it is quite common for enterprises to use multiple frameworks and accelerators internally. For instance, Pytorch with its GPU support has been widely used for Large Language Models (LLMs), with ~85% of Hugging Face LLMs being exclusive to Pytorch. However, Tensorflow and its ecosystem (TFX, TFServing) has been widely used for traditional recommendation models.

As the utilization and operations of machine learning expand within enterprises, a significant challenge arises from the use of multiple frameworks, which can lead to a decrease in developer productivity. All developers, whether they are involved in modeling, machine learning operations, platforms, or infrastructure, must exert considerable effort to gain expertise in various frameworks. This is necessary to build high-quality models and efficient machine learning systems.

Why portability is a growing concern?

Choosing one framework for most production use cases might seem like a smart move for improving developer productivity. It would save the hassle of working with different tools, frameworks, and services that depend on the framework. But AI/ML are still evolving.

We will witness rapid innovations in model architectures, accelerators, and tooling ecosystems that will boost efficiency and accessibility while lowering costs. ML platform needs to be more modular to adapt to these changes. However, this is not an easy choice, as it involves significant technical and organizational challenges that we will explore next.

Technical challenges:

Function Portability: The portability of functions across different frameworks is not always straightforward. This can lead to type mismatches (for instance, support for specific types like bfloat16), performance discrepancies, and unimplemented kernels for certain accelerators. As per the paper [1], nearly 20% of Tensorflow benchmark functions fail on GPU and nearly 40% of Pytorch functions fail on TPUs.

Failure rates between top 20 used functions and the overall failures [1]

Performance: Deep learning frameworks exhibit a “first-class citizen” effect with hardware lines. For instance, Pytorch is optimized for GPUs, while Jax performs well on TPUs. Consequently, porting Pytorch to TPUs can result in performance degradation, if not outright portability issues.

Ecosystem: The scope of machine learning operations extends beyond just training and serving. A plethora of tools and services are maintained internally to assist in the machine learning lifecycle, including data ingestion, validation, benchmarking, evaluation, and more. These tools are typically not framework-agnostic. Therefore, migrating from one framework to another necessitates the migration of all associated tooling as well.

Organizational challenges:

In addition to the technical hurdles, transitioning machine learning operations from one framework to another often presents numerous organizational obstacles. This shift can significantly affect the productivity of machine learning developers and consequently delay the delivery of product outcomes.

  1. Re-skilling: Mastering the intricacies of a new dl framework requires a lot of time and effort. Building high-performance ML systems involves many non-standard optimizations that need to be relearned.
  2. Impact on product deliverables: Ideally, offline validation of models would be enough, but online metrics often differ from offline ones. This is especially true when a slight change in model can cause change in output distribution and different product outcomes. Strategic ML models need to be rigorously tested before replacing existing ones. The experimentation pipeline (including human feedback-based evaluation, A/B testing for monitoring engagement changes, etc.) can take multiple weeks and more to fully verify the results. Experimenting on these models affects product speed, as developers become occupied with running experiments and understanding its impact.

Porting from one stack to another is not an easy task. The best way to ease the transition is to plan ahead. Establish benchmarks for both the quality and performance of the model. Identify the key optimizations that are applied. Develop tools to get faster feedback, such as tools to detect label distribution changes, layer-wise debugging, and so on.

To conclude with some success stories, Pinterest improved their internal ML platform net promoter score by 43% by unifying frameworks and offering a standardized environment for ML development [2].

Meta was able to scale their model training on GPUs by switching from Caffe2 to Pytorch and integrating the backends. This was a multi-year project that boosted developer productivity considerably [3].

--

--