Walmart e-commerce relies heavily on Machine Learning to support every conceivable aspect of business, from automation of catalog curation and item attribution to enhancing end-user experience through personalization and item discovery via search. Walmart Data Scientists build and deploy hundreds of ML models every week across business groups, trained over catalog, seller and user data. These models are built using frameworks such as Scikit, Pytorch and TensorFlow, and include custom dependencies.
In production settings, the ML models have to serve predictions in real-time and batch-modes, with varying SLAs depending on the business context. Therefore, it was necessary to design and implement an ML Inferencing platform, which could let Data Scientists easily custom deploy their trained ML models and create service endpoints. Furthermore, customization of deployment must encompass the specification of compute targets and other settings to achieve desired SLAs for their use-cases.
From an engineering perspective, the inferencing platform was required to address the following functional/non-functional criteria to support the diversity in use cases at Walmart Labs:
- Simplicity: To easily onboard new model and deploy it in production while having the support for diverse ML model configurations including those built using Scikit, Pytorch and Tensorflow and any other future framework.
- Scalability: To scale instances of single model and to support increasing number of new model deployments.
- Availability: To be deployable in HA configuration to support real-time traffic in fault tolerant manner.
- Predictive Latency: To provide model prediction in predictable response time.
- Resource Optimization: To run optimized model execution for real-time/batch/streaming use cases.
We explored the existing opensource/cloud platforms to understand if they can address our needs.
We started with Tensorflow-serving, which can load model binaries trained via Tensorflow and serve at scale. However, this would restrict Data Scientists into using TF to train and deploy models.
While Google Cloud, Azure and AWS Sagemaker provide inferencing solutions, they don’t provide sufficient customization capability to deal with varying SLAs and deployment scenarios.
Hence, we came up with the in-house inferencing platform, Ultron, that is now used by many groups within Walmart to host their models. The Ultron-Inferencing platform incorporates best practices in containerization and container orchestration over a cluster, with back-end optimizations to reduce latency and increase throughput.
Machine Learning Models can be thought of as stateless compute intensive idempotent workloads that takes an input and output a response based on a set of learned parameters.
This abstract representation allows us to build a resilient infrastructure by having the retries in place as it won’t affect the result by dispatching the request to different backend in case the previous backend is not available.
There is one interesting tradeoff that all the existing ML serving runtime deals with as they either support model implementation for real-time predictions or for batch predictions, which requires data scientists to write two different implementations for each use case. So, we aspired to have support for both real-time and batch prediction use case via a single implementation, that would substantially simplify the efforts to onboard new model and can unify the infrastructure for both the use cases.
To solve the unification challenge, our idea was to have a single method API to be implemented by the model adapter, with the following outline and characteristics:
def predict_all(self, inputs: List[str]) -> List[str]
The model to be onboarded is required to implement a single method adapter for prediction, taking a list of input strings and return back the list of output string, providing full control over the input/output format by the model adapter implementation. Each item in the inputs is a request for that model serialized to string and similarly each item in the outputs is a prediction response serialized to string. The order of output list predictions should be consistent with the input list. As model input/output just needs to be serializable to a string, allowing infrastructure to be completely agnostic to the individual model input/output format.
Since we are providing a bunch of input strings to the model, it gives a chance to the model adapter implementation to optimize the execution by computing the predictions for all the inputs in a single call through a single Linear Algebra computation that will ensure an optimized execution compared to the approach of looping over all the inputs for predictions.
This abstraction also provides the right kind of flexibility in terms of delegating the choice of latency/throughput optimization to the infrastructure level by having tunable real-time micro-batching in-place, which collates a bunch of synchronous real-time requests that arrive concurrently and dispatch it to a single backend. Providing following options to configure the behavior:
micro-batch size, limits the maximum number of collated request to be dispatched to the backend.
max wait time, maximum wait time for new requests before dispatching the currently collated requests to the backend.
Tunable real-time micro-batching can significantly increase the effective processing rate, especially if the compute target is a GPU. This in turn can reduce average latency and allow workloads to be processed faster. We kept the micro-batch size to 4 and wait time to 20ms by default, providing quasi-real-time behavior. However, it can be tuned to bigger/smaller batch size/wait time to support different throughput/latency characteristics.
If the model adapter has already implemented the optimized prediction by having single computation for a bunch of inputs then we automatically get the benefit of optimized execution. So, in practice real-time micro-batching is able to decrease mean requests latency by reducing the queuing delays with lesser number of model instances. This reduction becomes more pronounced if payload arrivals are bursty and stochastic.
One interesting angle of this abstraction is that the model implementation now can write a fairly complicated pre-processing and post-processing steps within it’s implementation. For example, input normalization, output formatting, download image URL at adapter, is much quicker as image caching infrastructure is in place within the datacenter.
The infrastructure can provide both HTTP and GRPC based consistent interfaces to uniformly access any model deployed on the infrastructure.
So we came up with following Platform Architecture built on Kubernetes to make it fully portable to run on any on-premise/cloud providers.
- Ultron API: Core interface to support various model implementation with associated framework to onboard any kind of model adapter implementation with corresponding dependencies.
- Model Service: Load and Serve the model adhering to Ultron API, wrapped in a dockerized packaging to provide immutable runtime with dependencies, installed during the docker image build phase. It serves the request over a single thread, so that multiple requests won’t bump on each other for compute resources providing a deterministic computation time for a model request. Also we make sure that we have the right environment variable/configuration set for the libraries being used to limit the number of threads as by default these computation libraries use all the available cores in the system.
- Model Proxy: Proxy the incoming request to the backend model instances, supporting Realtime Micro-batching to offload throughput/latency optimization at infrastructure level by tuning micro-batch size/wait time, which helps in supporting both real-time and batch predictions with the desired SLAs.
- Model Gateway: Proxy the incoming GRPC request to the backend model-proxy instance with in-cluster retries (providing in cluster fault tolerance/resiliency). Using the deterministic algorithm to select the downstream model-proxy instance to have effective realtime microbatching in place as we can collate more requests. It provides the GRPC interface to interact with the model via upstream registration on K8s Ingress for GRPC routing.
- Model HTTP Gateway: Proxy the incoming HTTP request to the backend model gateway. It provides the HTTP interface to interact with the model via upstream registration on K8s Ingress for HTTP routing. Optional to deploy, required only in case we need HTTP access to the model.
- K8s Ingress: Providing Path based routing for HTTP and Host based routing for GRPC requests and ensuring Single Port access to the entire cluster with all the deployed models. It’s the routing layer to route both HTTP/GRPC traffic to appropriate model destination.
- Ultron Client: Java Library for clients to access all the models on the platform on a single persistent TCP connection in realtime, performing transparent multiplexing/demultiplexing on the fly while keeping the simple model prediction interface for both Sync/Async usage.
Single method API to be implemented by model adapter which is free to choose it’s corresponding dependencies as everything gets baked into an immutable docker image during the build phase. Also, there is a simplified way to access the model via HTTP endpoint or via Java Client that the platform provides. We were able to onboard 60 different kind of models within a single quarter, thus showcasing the intrinsic simplicity of the design.
Infrastructure supports both the ways to scale.
- Scale number of pods for a single model, which gets updated in real-time in the upstream model-proxy layer as load balancing pool. Able to achieve 1000+ request per second for a deep learning model with just 16 model instance with 2 CPU and 2 GB memory each at P95 latency of 50ms. It also runs a few models at 90+ model instances to support the workload.
- Scale to deploy new model, which creates new infrastructure/model pods within the Kubernetes, with just a single routing entry at the K8s Ingress for that model instance.
Currently we have a Kubernetes cluster hosting the architecture with the following specifications that supports the scale we desired and has the intrinsic design in place to support deployment of the high number of new models per quarter:
K8s Pods: ~600+
None of the components has a single point of failure if deployed in HA configuration by having at least 2 instances of each. Infrastructure will automatically start utilizing the extra deployed components as every component keeps a watch on the available downstream instances. Infrastructure ensures the requests get successful prediction despite component failures as upstream layer does the necessary retries to dispatch the request to healthy instance.
Since we are maintaining the persistent connection throughout the infrastructure, we are able to achieve predictable latency as a function of model response time + constant overhead. By dispatching the request to the already established connections to the corresponding downstream, we ensure the only overhead is data transfer over the TCP that’s in practice is pretty fast, considering 4ms latency per hop within the data center.
With real-time micro-batching in place we can tune the latency/throughput characteristics by choosing the corresponding batch size/wait time parameter at the infrastructure level. This can be tuned for real-time/ batch / quasi-realtime mode to support different use cases.
The Ultron-Inferencing platform has been used by the Catalog Data Science team and other groups to deploy over 60 Machine Learning models in FY20Q4 for both real-time and batch processing needs. Deployed models range from ML models built using Scikit Learn to Deep Learning models such as BERT based models that are memory and compute intensive. These models process over 200 million items every day to enrich the Walmart catalog and drive item discovery and placement. Adoption of this platform has been increasing across Walmart since it allows users to customize their deployment configurations to address their scalability and availability needs.
We hope this blog provides a decent overview of how we tackle the challenge of taking AI to production at Walmart Labs and some of the design/implementation considerations that were helpful in overcoming those challenges. We are fortunate to have a great team of engineers and data scientists to collaborate with on this interesting project as part of Catalog Data Science Team.
Special thanks to @Karthik Deivasigamani for encouraging me to write this blog.