LazyTensor in Action at Facebook & Google

Gayan Samuditha

Published in

Expo-MAS

4 min readApr 11, 2021

*******************************************************************

Facebook & Google’s LazyTensor Enables Expressive Domain-Specific Compilers

A team from Facebook and Google has proposed LazyTensor — a technique for targeting domain-specific compilers without sacrificing define-by-run ergonomics.

Facebook and Google researchers have united to introduce a novel technique that combines eager execution and domain-specific compilers (DSCs) to exploit the benefits of both. The proposed “LazyTensor” enables the full use of all host programming language features throughout the Tensor portion of users’ programs.

========================================

Eager execution is an imperative, define-by-run interface that is both expressive and easy to debug and forms the basis for most widely adopted programming languages. Optimizing DSCs, meanwhile, is a proven way to improve the performance of machine learning (ML) models, but suffers from a “language subset problem” that makes it less expressive. The two are made to work together in the new paper LazyTensor: Combining Eager Execution with Domain-Specific Compilers.

##. The researchers summarize their study’s main contributions as:

A technique for combining an eager programming model of Tensor programs with domain-specific compilers that do not restrict the expressivity of the user’s programming language. The approach is general enough to be applied to any define-by-run machine learning framework.
An implementation of the LazyTensor design across two different machine learning frameworks in two different programming languages: PyTorch and Swift for TensorFlow.
An evaluation of our design across multiple languages, Tensor types, and accelerators (GPUs and TPUs).

A Tensor is a generalization of vectors and matrices that can be easily understood as a multidimensional array. In deep learning, this refers to multidimensional array abstraction in the training phase. DSCs are used for Tensor abstraction to target domain-specific hardware (e.g. TPUs) and improve the performance of a given hardware footprint. DSCs consume source programs as input in a compiler-specific intermediate representation (IR), e.g. XLA HLO IR, where the syntax is extremely verbose and memory allocation is inflexible. This contrasts with eager execution or “define-by-run” libraries that provide users with the full power and expressivity of a general-purpose programming language and are easier to debug and more flexible.

** To find a solution that combines both strengths, the researchers first built a Tensor API, which provides advantages such as its ability to use the complete host language for function abstraction and control flow and data structure.

** How LazyTensor Works?

Based on the Tensor API, the LazyTensor system then builds on an underlying eager runtime and an XLA domain-specific compiler. LazyTensor has three main components: a custom Tensor type with an identical API to an existing Tensor type; a mapping from the high-level Tensor operations to XLA HLO sequences implementing the semantics of the requested operation; and a runtime that lowers sequences of Tensor operations into XLA HLO IR, and orchestrates compilation and execution of the resulting program.

To test LazyTensor system performance, the researchers conducted experiments across a number of dimensions and applications, including Code Reuse, Training Transformers on Cloud TPUs, and Scaling ResNet-50 on TPUs.

ResNet-50 with TensorFlow 2.0

Results for Code Reuse. Time required to train HuggingFace’s Roberta-base (125M parameters) on the raw WikiText-103 dataset for three epochs using half-precision.

Time spent in respective operations of the WordSeg algorithm based on the different Swift for TensorFlow Tensor approaches when run on a GPU.

The experiments validated LazyTensor’s reusability across different programming languages. PyTorch LazyTensor enables the popular HuggingFace Transformer library to run on Cloud TPUs using XLA and demonstrates significant performance improvements on TPUs compared to GPU hardware. LazyTensor is also able to scale to large TPU supercomputers.

############################################

The paper LazyTensor:

Combining Eager Execution with Domain-Specific Compilers is on arXiv.