Taking Big ML systems to productions — Part 1

Published in

Walmart Global Tech Blog

6 min readJun 8, 2020

All the data preparation, Feature Engineering (henceforth, FE) has been done and thoroughly tested; algorithms has been designed, coded and optimized. Testing results on the historical data also look promising, but still the solution is far from being production ready. A Data Science (henceforth, DS) solution developed in a research/lab environment is not always in the shape or form to take it directly into production. Even after that solution is morphed into the shape, a lot of moving pieces need to be considered to take an entire DS solution to production.

One must bear in mind that the DS solutions are generally a part of a bigger engineering effort. There are upstream systems which provide data to these DS solutions, and there are further downstream systems which utilize the output of the DS solutions. All the orchestration and the handshake (information transfer) happens between these systems based on certain contracts, e.g. DS systems will receive data in a particular format and will generate output CSVs with given particular columns.

There are some considerations which are ingrained in a Data Scientist’s thought process — e.g. availability of feature in real-time run, training time, complexity of models, volatility of data. There are statistical ways to test and take care of these challenges. However, not every DS problem/Competitive DS problem will require thoughts on the challenges to process the billions of rows of data in a certain amount of time with failsafe for errors from upstream processes, e.g. there could be some missing values from the upstream processes — the DS solution should either handle it algorithmically, or fail gracefully and notify the users in both the cases.

Smart Forecasting in Walmart Labs is one good example of such Data Science work involving multiple complex models, multiple point of failure and tight integration with Data Engineering. Smart Forecasting produces the forecast at store-item level for ~500MM such combinations at weekly level for coming 52 weeks These forecasts are refreshed weekly. The entire items are divided into multiple groups based on topological clustering, and these groups will be referred as categories henceforth.

There are Six major components which need to be finetuned to keep a DS solution in good health in production –

Solution scalability
Infrastructure scalability and compatibility
Solution durability
Production feedback
Solution adaptability
Cost-effectiveness of the solution

Solution scalability, compatibility is discussed in this post, and durability, production feedback, adaptability, and cost effectiveness is discussed in subsequent post in part 2

Solution Scalability

Vertical scalability of systems is generally a difficult task. That’s why data Scientists generally do take care of some of the solutions constraints, e.g. memory requirements, GPU requirements if any, storage requirements, time taken to train and score the algorithms, and try to match the production requirements. More often than not, the scalability of solution is tested by the future requirements. Existing solutions fail to scale for future requirements. Such an example could be an algorithm either failing or breaching service level agreements (SLA) when provided with a larger data in future.

To deal with such challenges, the developer/Data Scientist must figure out the bottlenecks / rate limiting steps in the process and determine the max threshold for that step. If input data sizes are larger than this threshold, there must be a step to divide the data in smaller chunks, e.g. some kind of clustering or segmentation.

In Smart Forecasting, there were such issues with Singular Value Decomposition. An in-memory process could not handle a matrix larger that certain size, so the Data (historical sales files at category level) had to be broken in smaller chunks using item hierarchy.

Generally, memory comes out as rate limiting steps. There are processes where a storage can also be rate limiting as well specifically when storage needs to write using a network, e.g. on a network file system (NFS) or a blob store on cloud. In Smart forecasting, for one of the algorithms, the model size written using the default model writer provided by python was in 10s of GB, and an in-house method was developed to write the models, and the new method writes the entire model in 100s of MBs, as a lot of meta information has been removed.

In case there is a time bound on service level agreement (SLA), not just memory and storage, but processing also becomes one of the limiting components. A lot of times, the implemented solution may not be the globally optimal solution, but the optimal one within runtime constraints. This may even require to not just go with off-the-shelf packages, and rather write some of the indigenous solutions tailored for specific needs. For Smart Forecasting, new version of Random Forest was written and tested where the ideology of Decision Tree C4.5 was used in Regression Tree.

Infrastructure Scalability and Compatibility

The production infrastructure must be designed in a way that it can be scaled horizontally in future. The requirement of vertical scalability is handled by the code/solution itself by clustering/segmentation. DS solution can be expanded to a larger base only with the expansion of infrastructure. This part requires a lot of collaboration with engineering team and architects, as they will provide the tools to manage the cluster and orchestrate the entire process.

Infrastructure and solution must also remain compatible with scalability, e.g. a new upgrade in GPU Card with only high CUDA version supported may render some of the solution compiled with old CUDA version useless. This is important specially if the open source tools with large contributor community is being used as the requirement of one package may be conflicting with another. In Smart Forecasting, there have been issues due to upgrade of python libraries, which were not supported in the code.

This problem is further enhanced in case of two or more dependent processes e.g. in case of Smart Forecasting — Training and Scoring. These two processes run independently but scoring uses the models generated in Training. Training uses GPUs to train, but model needs to be created in way, which could run on CPUs only (as GPU always has a huge cost), and this require packages in Training and scoring both in-sync.

Infrastructure compatibility also encompasses choosing of the best tool for the job. Data Scientists sometimes are not the best judges of the tools to be used for data preparation and Data Engineering. A lot of times, their codes needs to be translated to the best tool suited for the purpose. In Smart Forecasting, a lot of feature generation was being done on Apache Spark, which required a lot of shuffles, and was slow because of that. Later it was moved to in-memory compute in C++ for faster computing.

Summary

A production ready data science solution should work with reasonable resources, and must be horizontally scalable. This may require restructuring of entire problem. Also, the entire solution must be module based so that it can be easily scaled with infrastructure.

For discussion on durability, production feedback, adaptability, and cost effectiveness please follow the second part of article

Taking Big ML systems to productions — Part 1

Solution Scalability

Infrastructure Scalability and Compatibility

Summary

Written by Ritesh Agrawal