Image for post
Image for post
Image Source: free3d.com

Demand Forecasting Tech Stack @ Walmart

Choosing the right foundational elements for the biggest forecasting platform at Walmart

Ramu Malur
Sep 5, 2019 · 8 min read

Continuing on the discussion that started in Pillars of Walmart’s Demand Forecasting, technology choices that you make in any product will go a long way in ensuring that you have a solid foundation for enabling features built on it.

In this post, we will look at the most important aspects that should never be ignored in choosing technology components

Overall Product Needs

  • Non-negotiable run-time SLAs to generate the forecast (less than 4 hours every day)

Guiding Principles in selecting the Tech Stack

  • Be clear on what you are looking for. Write down the Non-Functional needs (like SLAs, latency, throughput, etc.)

Data Engineering

What we need

  • Scalable & Distributed storage, supporting a variety of datasets, not just as SQL tables

Batch Datastore

  • A Hadoop Distributed File System (HDFS) compatible storage was the strong contender & probably was the easiest decision.

Compute

Map/Reduce is a very popular & oldest among all the engines supporting BigData workloads. It has served its purpose quite well for many years & it is time to look ahead.

Hive is probably the most commonly used compute engine for BigData workloads. With Hive, we have to always express jobs as SQL queries. While it is not impossible to build complicated SQLs, it has its own challenges,

  1. Not all operations can be easily expressed via SQL. Ex: Recursions

Spark needs no introduction & has revolutionized how Data Engineering is realized in products dealing with BigData. It also supports SQL & is backed by a strong community (ever-growing).

Also, we chose Java as our mainstream programming language. This is an interesting choice given how popular Scala is when workloads are written using Spark.

Some of the considerations (debatable 😃) that favored Java in our use case are,

  1. While functional programming is awesome, Scala is not a pure functional language. You can easily end up writing Java code in scala.

Now, after 2+ years of development & production runs, when you look at the codebase & adding more features, we don’t regret our decision

Scheduling & Orchestration

Most of the early Data Engineering workloads followed a design of configuring pipelines, rather than authoring them as code. Some of them went a little ahead in designing a User Interface to support development via Drag & Drop. While this approach served the Data Engineering community for some time & some enterprise tools are still in heavy use today, the community is shifting towards more programmer-friendly tools.

Oozie is probably the most widely used tool for both scheduling & Orchestration. But the problem with Oozie is its inability to program & create dags/dependencies dynamically, rather programmatically.

Apache Airflow to the rescue. It was a tough decision to make & we did try out the tool for close to a month before finalizing.

Some of the positives first,

  1. Very easy to monitor the workflows & managing their run. You do not need to get away from the Web UI to debug issues
Image for post
Image for post
Courtesy: Airflow official documentation

Some of the drawbacks now,

  1. More than 1 scheduler will create a mess. So to achieve HA, you need to use an active-passive design (Implemented by teamclairvoyant)

Tech Stack

Image for post
Image for post

ML Engineering

What we need

  • Development workspace allowing Scientists to test their code in isolation

Scaling our production workloads can be achieved in 2 ways,

  1. Distributed. Frameworks like Tensorflow allows this

The approach of dividing the store-items (via a suitable Clustering algorithm) & then taking up training or scoring for each worked well for our needs.

Runtime

Containers (Docker) allowed us to have flexibility in isolating each algorithm environment differently.

From a very high level, we ended up having different base dockers for each type of run time & extending them with our algorithm code

  1. GPU based (Including NVIDIA libraries & Custom GPU algorithms)

Orchestration

K8S (Kubernetes) allows us to schedule & orchestrate Docker containers like below.

  1. Use kubectl commands to create Job Specification (Including what code to run, resourcing needs, etc)

We are running close to 30,000 jobs in the manner orchestrating the run to finish within 2–3 hours for generating all forecasts, every single day.

Model & Meta Store

  • Models, which are the result of Training, have to be stored. These are serialized versions of calculations that will be used during scoring.

Tech Stack

Image for post
Image for post

Demand Workbench

What we need

  • APIs powering some of the descriptive analytics applications, including our own Demand Workbench

Workloads

Transactional

  • Most of the transactional data are also relational in nature & are ACID compliant

Analytical

  • Size of the data set (500Million+ store-item metrics & 100 Billion+ Data points, amounting to 50–60 TB of raw data)

These unique needs necessitated us to look at systems like Cassandra, Solr & Elastic.

Finally, after a thorough POC simulating the real workloads, we selected Elastic for our needs. Elastic not only allowed us to build indices where we can typically search on a multitude of Item/Store attribute & also aggregate the results within 100ms.

Tech Stack

Image for post
Image for post

What Next

In this post, we saw what all factors went in to consider a tech stack for the most demanding forecasting application in Walmart.

In subsequent posts, I’ll cover some of the best practices that we learned over the course of building & operating this application.

Walmart Global Tech Blog

We’re powering the next great retail disruption.

Ramu Malur

Written by

Principal Software Engineer, Walmart Labs India, https://www.linkedin.com/in/ramumalur/

Walmart Global Tech Blog

We’re powering the next great retail disruption. Learn more about us — https://www.linkedin.com/company/walmartglobaltech/

Ramu Malur

Written by

Principal Software Engineer, Walmart Labs India, https://www.linkedin.com/in/ramumalur/

Walmart Global Tech Blog

We’re powering the next great retail disruption. Learn more about us — https://www.linkedin.com/company/walmartglobaltech/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store