Taking Big ML systems to productions — Part 2

Ritesh Agrawal
Walmart Global Tech Blog
6 min readJun 8, 2020

In part-1 of the discussion, solution scalability and compatibility was discussed. In this part, solution durability, production feedback, solution adaptability and cost-effectiveness will be discussed.

Solution Durability

Image Source : Google Images

Any algorithm will have a certain bias towards the data on which it has been trained. Data Scientists try to remove the bias with the rigorous testing on various holdout and blind samples. However, this testing is all on the historical data. Even the holdout adds the biasness towards past, i.e. if future does behave completely different, the model will perform poorly.

The model performance can be treated with making model more generalizable and a bit loose (not too much overfitting), however system must be fault tolerant to any kind of data, and must be able to notify user of such issues, e.g. system must be able to handle new fields, too many missing values in a field, too less values. There may be some checks in place to compare with past data/productions runs.

In Smart Forecasting, algorithms exit gracefully when provided with too less data (<5 observations) and provide such information in log. There is also a default algorithm which runs for all the cases, and generate forecast, which in the worst case is sent to downstream system.

Though well coded algorithms will work on any type of data, it is important to create channels to let Developers/Data Scientists know the changes in state of system — e.g. changes in Data (some metrices such as moments of numeric fields) , freshness of data, data quality (fill rate and distributions), and model performance (on development and validation both). These metrices are different than the ones used by business. Some of these can also be presented in form of drill-down dashboard to keep track of important aspects on entire system.

Solution durability also stems in how the corner cases and worst-case scenarios are treated, e.g. Engineering and infrastructure teams create Disaster Recovery (DR) strategies and Business continuity plans (BCP) respectively. For Smart Forecasting, multiple stages of fallback have been created –

  • Default Algorithm — A simple State Space model, which runs fast, and with very small failure rate, has been created as a default fallback model. In case any other algorithm fails, forecast defaults to this model
  • New Item Model — A simple new item model, which was originally used for forecasting of new items is also used to forecast the missing items (which are missed by Data Science Approaches due to lack of data)
  • Forecast Rotation — Since next 52 weeks forecast is generated every week, in case all other strategies fail, last week’s forecast is rotated by one week

Production Feedback

There is limit to what can be tested using the back-testing on historical data, as all that will have a bias towards the past, and completely new future trend may be missing. There must be a way to quickly get the feedback of live production system to the developers/Data Scientists. This feedback can only come from the end users.

In production systems, the user feedback in form of adjustment of results does not just change the downstream feeds, but also works as a feedback for Data Scientists. During the onset of COVID-19, since there was no signal from data, Smart Forecasting, systematically, could not gauge the increased demand for grocery items, but Demand Managers overwrote the forecasts, resulting in increased sales — in-turn a feedback for system. For Smart Forecasting, changes made by Demand Managers results in change in sales and provides feedback.

Other than these indirect feedbacks, there should be mechanism to get the direct feedback on system performance by users who may be able to see a deteriorating pattern much quickly. Generally, any fraud related system relies heavily on the user feedback, and these systems chase the ever-changing fraudsters. In Smart Forecasting, Data Scientists get feedbacks from end users in form of Jira tickets, and identify some trends/themes from these tickets, which helps in improvement of solution. In one such case, the issue with longer horizon (Horizon 16–17) forecast was identified by few tickets from demand managers

Solution Adaptability

Image Source : Google Images

Data Scientist team keeps working on improving the solution with some experimentation on the backend. The production system must be flexible enough to implement all the changes. Some of these may be smaller code changes, however some might be a complete change in algorithms, the systems must be designed to accommodate those changes.

The system written should be able to incorporate various changes not only in terms of data and structure, but also algorithms and packages. In Smart Forecasting, a Model Propagation pipeline has been built by Machine Learning Engineers. It gives flexibility to Data Scientists to use their own choice of programming language and tools to develop the algorithm and can also deal with the package changes easily. This also provides ability to scale the solution horizontally.

Also, in terms of incorporating multiple models, there is a script written, which can take any number of models, provided their preference order, it can generate final forecast. It can also handle various ensembling techniques, such as linear stacking.

The solution must also be flexible enough to implement on larger geographical base, and there are features of importance locally, solution must be easily able to include the local variables. Smart Forecasting models are written in such modular fashion, that a core set of features are generated for all geographies, then any number of country specific features can be generated though new scripts and various configuration files.

Cost Effectiveness of Solution

This aspect of solution is often undermined and is taken up only by product managers usually after a pushback from business. If a solution is not saving more than some `X` times of its cost, and often this `X` should be greater than 5, but is always greater than 1, then the solution is not economically viable in business scenario. There are research projects, which do overlook on this cost effectiveness aspect, but that’s a separate discussion, and in those cases production readiness is also not in play. The solution sometimes needs to be modified to reduce costs, even if there is an offset to optimality.

In Smart Forecasting, a module was written to perform Singular Value Decomposition (SVD) and multiple large matrices parallelly on GPUs, and that was saving some time in production environment, however it was determined, that this particular SVD was not a rate limiting step. Moving this portion would have required additional GPUs, and will increase the cost of solution, and that’s why the SVDs were performed on CPUs only

Closing Discussion

To make a solution robust enough, there are a lot of other things to consider, on integration with upstream and downstream systems, orchestration of the system, where a strong engineering and architecture is required.

Whatever system is created, it will require constant improvement, which will require significant Data Science experimentation. Please bear in mind that the experimentation requires much more computation power than production run, so a very good development environment and infrastructure is also one of the needs of a successful production implementation.

The regression run of Data Science systems just provide whether all the integration is in place — which is of utmost importance to keep the system live. The real test of performance is either the real-time feedback or the back-testing on historical data, and these must be monitored continuously for issues.

Like any other systems, these DS solutions also require a production support, not just in terms of engineering, but also the issues with the algorithms as well.

The continuous development efforts by multiple Data-Scientists will lead to significant tech debt in Data Science streams, and this should be well tracked using productivity management tools, e.g. in Smart Forecasting, pricing calculations were implemented using complicated and sub-optimal logic due to time-bound implementation. It was tracked as a tech-debt and later it was implemented in simplified manner.

Lastly, these solutions are generally built using a lot of small components which can be reusable across projects and teams, and these components must be built to promote reusability, e.g. Model Propagation pipeline is generic module, and can be used in any of the Machine Learning Pipeline.

--

--