The Architectures Powering Machine Learning at Google, Facebook, Uber, LinkedIn

Despite the hype surrounding machine learning (ML) and artificial intelligence (AI), most efforts in the enterprise remain in a pilot stage. Part of the reason for this phenomenon is the natural experimentation associated with machine learning projects, but also there is a significant component related to the lack of maturity of machine learning architectures. This problem is particularly visible in enterprise environments where the new application lifecycle management practices of modern machine learning solutions conflict with corporate practices and regulatory requirements. What are the key architecture building blocks that organizations should put in place when adopting machine learning solutions? The answer is not very trivial, but recently we have seen some efforts from research labs laying down the path for what can become reference architectures for large-scale machine learning solutions.

Goutam Biswas
4 min readNov 23, 2021

The challenge of establishing reference architectures for large-scale machine learning solutions is accentuated by two main factors:

  1. Machine learning frameworks and infrastructure have evolved considerably faster than the adoption of those technologies in mainstream environments.
  2. The lifecycle of machine learning solutions is fundamentally different from other software disciplines.

One thing that we can do to mitigate those risks is to draw inspiration from some of the biggest companies in the world that are deploying machine learning at scale. Today, we would like to discuss some of the reference architectures used by AI powerhouses like Google, Facebook, LinkedIn, and Uber to enable their machine learning pipelines. Let’s dive in.

Uber’s Michelangelo

One of the best-known efforts in this area, Uber’s Michelangelo is the runtime powering hundreds of machine learning workflows at Uber. From experimentation to model serving, Michelangelo combines mainstream technologies to automate the lifecycle of machine learning applications. The architecture behind Michelangelo uses a modern but complex stack based on technologies such as HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow.

Image source: Uber blog

Michelangelo powers hundreds of machine learning scenarios across different divisions at Uber. For instance, Uber Eats uses machine learning models running on Michelangelo to rank restaurant recommendations. Similarly, the incredibly exact estimated time of arrivals (ETA) in the Uber app are calculated using incredibly sophisticated machine learning models running on Michelangelo that estimate ETAs segment-by-segment.

Facebook’s FBLearner Flow

FBLearner Flow is the backbone of machine learning applications at Facebook. The platform automates different elements of machine learning workflow such as feature extraction, training, model evaluation, and inference. FBLearner Flow integrates with several machine learning frameworks and tools like Facebook’s own Caffe2, PyTorch and ONNX.

Image credit: Facebook Blog

Google’s TFX

Google has also created its own runtime for executing machine learning workflows. TFX is based on a recently published research paper that proposes an architecture for streamlining the operationalization of TensorFlow programs. TFX includes several key components of TensorFlow architectures, such as a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally, infrastructure for serving models in production.

Image credit: SIGKDD

The ideas behind TFX were incorporated into the TensorFlow framework in the form of an automation pipeline known as TensorFlow Extended (also TFX 😉 ). Conceptually, TensorFlow Extended is a collection of components that automate the end-to-end lifecycle of a machine learning pipeline. The architecture is outlined in the following figure and includes components from all aspects of a machine learning pipeline, from data ingestion to model serving.

Image credit: GitHub

LinkedIn Pro-ML

The core of LinkedIn’s machine learning infrastructure is a proprietary system called Pro-ML. Conceptually, Pro-ML controls the entire lifecycle of machine learning models, from training to monitoring. In order to scale Pro-ML, LinkedIn has built an architecture that combines some of its open-source technologies, such as Kafka and Samza, with infrastructure building blocks like Spark and Hadoop YARN.

While most of the technologies used as part of LinkedIn’s machine learning stack are well-known, there are a couple of new contributions that deserve further exploration:

  • Ambry: LinkedIn’s Ambry is a distributed immutable blob storage system that is highly available, very easy to scale, optimized to serve immutable objects ranging from a few KBs to multiple GBs in size with high throughput and low latency, and enables end to end streaming from the clients to the storage tiers and vice versa.
  • TonY: TensorFlow on YARN (TonY) is a framework to natively run TensorFlow on Apache Hadoop. TonY enables running either single node or distributed TensorFlow training as a Hadoop application.
  • PhotonML: Photon ML is a machine learning library based on Apache Spark. Currently, Photon ML supports training different types of Generalized Linear Models (GLMs) and Generalized Linear Mixed Models (GLMMs/GLMix model): logistic, linear, and Poisson.

As machine learning evolves, we should see more and more of these reference architectures become an integral part of the software stack in enterprises around the world.

--

--