When we started our journey to build a demand forecasting product (a.k.a Smart Forecasting), we had the unique opportunity to build a system that can influence how our Business manages the demand for 500 Million Store-Item forecasts (across just US stores). Add to this the 11000+ stores across multiple markets & channels Walmart operates worldwide.
Generating forecasts is just the start of the journey. Getting the best forecasts that don’t need any manual intervention is the Nirvana state of our platform (hope we will reach that state soon 🙂).
Until then we need an application that allows users to adjust the forecasts.
Not that forecasts are bad but maybe the models did not consider some data points which are not in the system yet, like,
- Local weather fluctuations
- Customer demographics
- Events near to a store
- Any promotions planned by the business
This is as demanding (pun intended 🙂) an Engineering problem, as it is a Machine Learning one.
The success of the platform depends on a perfect blend of Business, Product, Engineering & Data Science talent to come together and work in tandem.
I want to highlight how we organize ourselves into different teams & focus on excelling in each of those individually, before delving into technical aspects of the solution (in subsequent posts).
In any Data Science product, it is not just Data Science algorithms that make a product successful but equally important is the below functions,
- Data Engineering
- ML Engineering
- Application Development
- User Experience
- Product Management
There is no denying that Data Science related Key Performance Indicators (KPIs) like accuracy in our case is the most important needle mover, but we cannot undermine the importance of the other areas.
Ex: An awesome algorithm just running in laptop & that does not scale, OR an algorithm that scales but does not meet the needed SLAs are just good in theory & can never make into production.
- The Brain behind what demand we forecast
- The freedom of developing an algorithm using any library or language is very important.
- It is important to note that there is no one algorithm that fits all needs
Ex: Time series algorithms does not very well predict all horizons of time accurately
- How the underlying infrastructure is enabling the Data Scientists to experiment (at scale) more frequently will be the key to agility
This is mandatory, especially when we have Large scale diverse data to deal with.
We have close to 100 TB of data & 10’s of data sources to tap from day one (and growing every day).
- Interact with Walmart’s ocean of data spread across various systems and provisioning of the same for better collaboration
- Foundational guarantees on the robustness of pipelines
- Workflows are always time-bound with strict SLAs to downstream systems
- Guaranteed freshness & quality of data
This indeed is the Backbone of the system and is pivotal in ensuring stability & accuracy of forecasts.
Demand Workbench (REST APIs & UI)
This is the Face of Smart Forecasting. Just having a forecast without any visibility to its end users, will not add much value.
This function becomes even more critical when we have 500 Million Store-Items to provide visibility across history & future.
- Get visibility into various metrics influencing the demand like historical sales, waste, inventory, etc.
- Metrics space comprising a minimum of 75 billion historical & 25 billion future data points.
- Personalized to each user’s experience showing what really matters at a given time.
- Allow users to manage demand by adjusting the forecasts.
- These APIs also power some of the critical on-demand descriptive analytics applications in Supply Chain.
MLOps / ML Engineering
This is the Central Nervous System that controls the Brain (ML Algorithms) & Backbone (Data) of our product, taking care of many production aspects of our models.
- At the confluence of the two pillars i.e. Data Science and Data Engineering.
- Automate systems that will help propagate the machine learning models from development to production environment
- Optimal scheduling of modeling workloads.
- High availability & Failure handling in every part of the pipeline run
- Build the ability to perform A/B testing of new models
- Visibility into how any new feature in ML/Data affects the overall KPIs of the project (Accuracy, Bias, etc.)
They instill blend of Service Engineering & Release Engineering best practices into our engineering teams.
- Ensure the Infrastructure is up, running & monitored 24x7
- High priority Alerts are notified to appropriate teams
- Ensure we have proper quality checks in our build automation
- Every part of the product is continuously tested out at different levels so that every commit meets the needed quality
- Not just ensure you can develop & deploy features fast, but Mean Time To Recover (MTTR) should be low in failures
- Look out for constant cost optimizations. Especially on Public cloud.
- Infrastructure built around the agreeable measures around Recovery Time Objective (RTO) & Recover Point Objectives (RPO).
Product management is where it all comes together
- Always ensure that we keep our Business in the center of everything we do in the product
- Facilitate with Business on what is the need of the hour
- Ensure every feature added to the product provides incremental value
- Voice of the Business & Champions of the product
- Seek constant feedback, both from Business & other pillars
Hope you have got a better understanding of what teams constitute the basic building blocks of a successful ML product.
Through this & the series of other blog posts (I intend to write later), I am trying to highlight the key technical concepts which went a long way in making the product successful.