Machine Learning Platform at Walmart

Published in

Walmart Global Tech Blog

19 min readSep 6, 2023

Authors: Thomas Vengal, Pamidi Pradeep, Bagavath Subramaniam, Hema Rajesh, Girish Ramachandran Pillai, Ravishankar K S, Anirban Chatterjee, Kunal Banerjee, Rahul Rawat, Anil Madan

Walmart Global Technology Center

Abstract

Walmart is the world’s largest retailer, and it handles a huge volume of products, distribution, and transactions through its physical stores and online stores. Walmart has a highly optimized supply chain that runs at scale to offer its customers shopping at lowest price. In the process, Walmart accumulates a huge amount of valuable information from its everyday operations. This data is used to build Artificial Intelligence (AI) solutions to optimize and increase efficiencies of operations and customer experience at Walmart. In this paper, we provide an overview of the guiding principles, technology architecture, and integration of various tools within Walmart and from the open-source committee in building the Machine Learning (ML) Platform. We present multiple ML use cases at Walmart and show how their solutions leverage this ML Platform. We then discuss the business impact of having a scalable ML platform and infrastructure, reflect on lessons learnt building and operating an ML platform and future work for it at Walmart.

Introduction

Walmart (Walmart 2023a) Inc is a people-led, tech-powered omni-channel retailer helping people save money and live better — anytime and anywhere — in stores, online, and through their mobile devices. Each week, approximately 240 million customers and members visit more than 10,500 stores and numerous eCommerce websites in 20 countries. With fiscal year 2023 revenue of $611 billion, Walmart employs approximately 2.1 million associates worldwide. Walmart continues to be a leader in sustainability, corporate philanthropy, and employment opportunity. The staggering size and scope of operations of Walmart influencing the retail industry and the global economy cannot be overstated. Walmart’s supply chain is extensive and complex, it works with over 100,000 suppliers worldwide, 150 distribution centers in United States, largest private fleet of trucks in the world with over 10,000 tractors and 80,000 trailers, and Walmart also extensively used ocean freighters and air cargo to move goods across the globe. Walmart sells a wide variety of products from groceries, apparel, home goods to electronics and more. With the massive scale of operations, it brings the challenges of smooth, efficient, and timely operations.
Walmart leverages its size and scale, invests in efficient supply chain management and latest technologies like the advanced AI capabilities for lower prices and cost savings for its customers.

Fig 1: Footprint of AI Use cases in Walmart

AI is increasingly gaining strategic importance in Walmart, to build better products, improve service to customers, being efficient with planning and resources, and hold a differentiated competitive advantage to stay ahead in the industry. As shown in Figure 1, Walmart uses AI in a wide range of areas like the supply chain management to predict the demand for products and optimize inventory at stores and warehouse, personalization of customer preferences, provide recommendations for promotion and advertising, fraud prevention such as the credit card fraud at stores and online, improved customer services using AI chatbots and conversational AI. But implementing AL/ML solutions comes with key challenges such as accuracy of the solutions, long time to market, and the cost involved in developing and running the solutions together with a positive return on investment for implementing ML solutions for many enterprises is still elusive. A way to address this is to develop high quality ML solutions faster, more of them, at a lower cost. This brings to four key challenges for organizations to handle:

Faster Innovation: The fierce competition in the retail domain is forcing us to innovate for newer business cases on a constant basis that demands a faster time to market. It also drives the need to adopt to the latest technologies and products in the market faster.
Higher Scale: The need to solve bigger and more complex problems is increasing day by day with the increasing volume of data. There is a need to scale resources on demand across multiple cloud and regions with higher system availability.
Reduce Cost: The overall cost control needs efficient use of cloud resources by implementing design patterns for preventing wastage and improving utilization such as idle time reduction, auto scaling, effective resource sharing and adopting high performance compute options like GPUs and newer distributed frameworks. Reducing cost should not deteriorate functionality or performance.
Stronger Governance: ML models should be efficient and performant for high quality ML models. There is a higher enforcement of ML Model governance to save the company from any ethical and legal threats. This also adds the demand and complexity to constantly monitor the model efficiency and model decay.

This paper is organized as follows: In the Sections “Related Work” and “Platform Vision and Guiding Principles”, we discuss related work and our vision about the ML platform respectively. In the section “Element ML Platform”, we present the architecture, integration of various tools within Walmart in detail along with some open-source technologies that we adopted for each component in architecture.

In the section “Use-cases”, we analyze multiple ML use cases at Walmart and show how their solutions are shaped differently due to unique design requirements. We then discuss the business impact of having a scalable ML platform and infrastructure in the Section “Business Impact”. We share the lessons we learned about building and operating ML infrastructure at Walmart in the section “Lessons Learned”. We conclude the paper with our plan for future in the section “Conclusion and Future Work”.

Platform Vision and Guiding Principles

A scalable machine learning platform helps data science teams at Walmart solve most of the problems mentioned above.With the vision to ‘provide competitive advantage by developing transformational platform capabilities that simplify the adoption of AI/ML at scale’, this platform holds the key for all the personas connected with the AI solution lifecycle such as data scientists, data engineers, ML Engineers, and application developers consuming AI results. The guiding principles for ML Platform are:

Best of the breed: Leverage the best of the tools, services, and libraries available out in the market from a combination of proprietary, open-sources and inner-sourcing methods. Teams get access to the latest and greatest offerings assisting in building high quality AI solutions faster.
Speed and Scale: Automation, reusability, standardization,
and fast access to the required resources for the data science teams provided by the platform speed up the development and reduce time to market and help teams build more solutions in the available time.
Cost Platform: brings together multiple operational efficiencies,
cost management tools and economies of scale assisting in better negotiated prices, all together helping in reducing the costs incurred in the development and running of the AI solutions.
Governance: Compliance and responsible use of AI is enabled with standardized processes and mechanisms put in place on the platform.

Element ML Platform

‘Element’ is Walmart’s ML Platform that is developed as an end-to-end platform with tools and services across the typical data science lifecycle for data scientists, data engineers, ML Engineers, and application developers.
Technology Choices for ML Platform.

Walmart ML platform’s strategy is to leverage open-source technologies, build strong industry partnerships, facilitate inner-source, and integrate with existing enterprise services,to improve productivity, speed, and innovation at scale. Element is built ground up to address all the requirements at various levels and building capabilities where there were no existing solutions or reusing services/frameworks/products wherever possible. In this section, we highlight the various ML development stages, lifecycle and platform integration with various ML tools/frameworks and the customization built to meet our guiding principles.

Fig 2: Element ML platform built with latest technologies and aligned with various stages of Data Science project lifecycle.

Data Ingestion and Preparation
Walmart has data present across multiple data source systems and across clouds. To simplify and standardize access to data irrespective of the source, we built dataset Application Programming Interfaces (APIs) to source from different data sources. We have support for over twenty
data sources including databases, NoSQL databases, applications,
and file formats. We developed APIs for different runtimes ((Python 2023a), (R 2023), (Wendel 2023)), abstracting the complexities of individual connectors from the developers. The credentials are encrypted and stored in our database.
Feature Engineering and Model Training
Element was built with a best of breed principle. Hence, we build and package all popular libraries in the default environment or let users build their own custom environment. To ensure portability of development environment into production, we went with a decoupled environment approach to remove any dependencies that might create inconsistencies between development and production. For Python runtime, we initially experimented with virtual environments (Python 2023b), however, for certain ML libraries it was not possible to instrument and bundle their corresponding OS dependencies. Hence, we adopted a more comprehensive, portable environment management system for data scientists. We leverage the open-source cloud IDE images and add customizations like proxy configurations, custom plugins for resource instrumentation, user workspace setup etc. We continue to enhance the platform with options to optimize resource usage, especially for high compute resources like Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) (Google 2023). We also integrate with performance optimized container images for GPU workloads. Also, to keep the environment and runtime the same between an interactive development and a production deployment we decided to support Notebooks as a deployment option. Users can develop code in a notebook/IDE on cloud, can build workflows using these notebooks which they have used for development and deploy them in production. To keep up with the latest innovation in this space, we partnered with top cloud vendors to bring in the latest tools for ML development within Walmart. Our integration with major cloud ML tools enabled users to have complete flexibility to leverage them for development and seamlessly deploy these models through Element’s native multi-cloud deployment framework.
Model Experimentation
We leveraged opensource experimentation repository and added multi-tenancy capability to it as we wanted to host a single instance to have a combined view of all experimentations and models across the organization. We brought in authentication and authorization controls to ensure sharing across multiple teams in the organization.
Model Evaluation
Model evaluation happens during training as well as predeployment as part of our deployment pipelines. The metrics are captured in the experimentation repository.
Model Deployment
Our model deployment framework is built on ML Operations (MLOps) principles. We decided to build the deployment platform on top of open-source frameworks for realtime deployments with features like multi-model serving, batching, payload logging, A/B testing, monitoring libraries. These were deployed to our Kubernetes clusters on private and public cloud. Batch deployments (MLOps pipelines) can be either based on git codebase or notebooks orchestrated through our batch deployment platform.
Model Monitoring, Feedback and Retrain
We have built a framework for model monitoring using open-source libraries as there is no existing integrated framework available in the open-source world. Once a model is deployed through our MLOps pipelines monitoring is enabled by default for the application and model performance. The logs and metrics are integrated to our enterprise logging and monitoring platforms for automated alerting as well as user interfaces.
We have built a framework for model monitoring using open-source libraries. Once a model is deployed through our MLOps pipelines, monitoring is enabled by default for the application and model performance. The logs and metrics are integrated to our enterprise logging and monitoring platforms for automated alerting as well as user interfaces.
Feedback of Model monitoring is captured, evaluated, and considered if a model needs to be retrained due to model decay over time.
Model Governance
We help the AI Governance process by capturing all ML Model metadata and store in a common AI catalogue for discovery, reuse, and audit. We have built gates in our ML Continuous Integration and Continuous Deployment (CI/CD) process where users can configure metrics threshold for models going into production. This provides automated quality checks on models that are deployed through MLOps process.
Integrated Services
These are services which are already present as shared services in the organization and element as a platform decided to integrate into them instead of replicating the functionality.

i) Authentication & Authorization: The platform integrates with the enterprise SSO (Single Sign On) to provide seamless authentication. In addition to that, the platform also lets users create flexible authorization policies based on users and groups. This helps implement the principle of least privilege — starting from data access to model deployments. All roles and privileges are based on AD (Active Directory) groups so that access provisioning can be tracked easily.

ii) Secret Management: The platform provides integration with AppVault (an internal REST based service for managing sensitive data) for user secret management. This ensures no sensitive information is stored in the user repository or baked in as part of the container images
built in training and serving CI/CD process.

iii) Code Versioning: The platform is integrated with Enterprise GitHub (GitHub/Microsoft 2023). It lets users organize code in their repository and access it from platform for both development and deployment. This lets us have proper version control for all our artifacts.

iv) Artifact Management: All user artifacts — model training images, model serving images, training pipelines, inference pipelines are stored in the enterprise repository manager. The platform can store and consume artifacts from the repository. It ensures the right checks are done during the artifact publishing process to ensure all Walmart standards are met.

v) Build & Deploy: The MLOps functionality of the platform has CI/CD pipelines built based on Concord (WalmartLabs 2023) and Looper (Walmart’s customized Jenkins tool) which are offered as Enterprise Managed Services. This facilitates the rapid iteration of model training and model serving artifact development and deployment processes.

vi) Integrated Logging: Managed Log Service (MLS) is the enterprise logging system based on Splunk (Splunk 2023) for all applications. This is leveraged by the platform for integrated logging requirements for users and platform-related logs. Logs are made available in a scalable and secure manner for all user workloads executed on the platform.

vii) Integrated Monitoring: Integrated Monitoring is one of the key requirements for a mature AI platform. The platform ensures a seamless integration with the enterprise standard Managed Metrics Service (MMS) for both systems and user application-related metrics monitoring. Canned metrics are automatically available for all AI workloads deployed on the platform.

viii) Alerting: X-matters along with Slack (Stewart Butterfield and Mourachov 2023), Microsoft Teams (Teams 2023) and email are the standard channels to alert users.

Element Architecture

Figure 3: Layered view of Element ML Platform architecture

Element ML platform (S 2023) tech stack is built on best of- breed open-source tools with its core as Kubernetes for container orchestration. As depicted in Fig 3, the Web UI acts as the standard interface for the platform irrespective of the cloud where the workloads are executed. It also hosts
a DAG (Directed Acyclic Graph) Designer interface which lets users build DAGS visually. All these UI functions are supported by backend microservices which are deployed in the same Kubernetes cluster as the front-end pods. Standard Python/R workloads are executed as Kubernetes workloads on the same cluster in team specific namespaces. Distributed
workloads using Hive and Spark are sent to the Hadoop clusters. Network File Systems (NFS) solutions are used for shared workspaces. Datasets and Models are stored on Object stores.

Fig 4: Hybrid multi-cloud triplet (Walmart 2023b) overall Architecture of element ML platform

In figure 4, the User Interface (UI) layer built on React.js which lets users interact with all our backend services and act as the launchpad for the supported cloud IDEs (Integrated Development Environments). The underlying services are responsible for distinct functions (project management, notebooks, workflows, datasets, cluster management, model repository, etc.). All these services along with the user workloads (interactive notebooks / batch jobs) run on a shared Kubernetes instance. Kubernetes clusters are set up on top of Google/Azure/OneOps (OneOps 2023) (Walmart private cloud) Virtual Machines (VMs) to provide redundancy and data locality. Different types of computes (CPUs/GPUs/TPUs) are provisioned and deprovisioned dynamically for optimum utilization. The platform provides options for library management to ensure seamless reproduction of runtimes across different environments. Object Storage is used for storing datasets as well as model artifacts.
The platform is optimised to execute the ML workloads with least cost by running them under shared, cost-efficient infrastructure and charge back users according to their usage.
Element Kubernetes clusters are deployed across different clouds within the Walmart eco-system. This lets us shield users from the nuances of working on different cloud platforms and at the same time provide the freedom of choice for our users to place workloads as per their requirements as shown in figure 4. This architecture also lets users to switch from one cloud to another without having to spend hours in reconfiguring and tuning for the new cloud environment.

Use-cases

In this section, we present several different ML use cases across 4 broad categories in production at Walmart and show how they use above mentioned ML architecture at Walmart to achieve their business goals.

Channel performance: Walmart provides its channel partners a tool which provides details on sales data, promotions, and control over shelf assortments along with actionable insights and recommendations, empowering them to make informed decisions. This tool leverages AI to analyse data on individual items sold at each store location. Considering the massive scale of the store operations, with hundreds of thousands of items sold at each location, this presents a significant challenge in terms of data analysis and feature engineering. Data scientist brings in a subset of data for an individual item to Element notebook and performs feature engineering and ML modelling. Element provides seamless connectivity to various data systems and distributed runtimes, enabling scalable and efficient data analysis and modelling for large-scale retail operations and deploy the solution as a production-ready pipeline.With the usage of Element, training time was improved which overall reduced the cost per supplier.
Search: Search Data Science team of Walmart.com provides capability for fast and efficient retrieval of information from Walmart.com product catalogues. Search capability drives high impact customer facing metrics such as Gross Merchandise Value (GMV), CTR (Click Through Rate),SAR (Search Abandonment Rate) etc. One of the key challenges for Search team is to provide high training throughput, considering that over 90% of models degrade over brief period (Vela 2023). Search by nature is also temporal. For example, ‘hats’ in winter means ‘winter hats’ vis-a-vis ‘summer hats,’ which means models need to evolve over time. As we move towards being the best discovery and acquisition system for our users, we benchmark against similar systems by other vendors.
Given the nature of these challenges, it is particularly important, that we keep training on ever more complicated hypothesis, and do so very efficiently with a goal to remove the cognitive dissonance between the search results we provide vis-a-vis what the customer expects to see. This requires Data Scientists to work on large nonlinear models, and ever growing large feature sets which require users to understand, operate and manage large clusters, big data pipelines, GPU based infrastructure, plumbing platforms, experiment frameworks along with understanding data science algorithms. This slows down overall experimentation velocity and increases iteration costs.
Element’s Notebook IDE that allows the data scientists of search team to rapidly experiment on a new hypothesis. Element’s in-built workflow engine, that helps to provide improved hyper parameter tuning, expedites experiments by parallelization of multiple iterations. We integrated with the latest version of model registry with metric emitters which can easily be invoked in experiment code and workflow engine. Data Scientists can then visually compare various values, and rapidly identify the best parameters.
Market Intelligence: Market Intelligence is a business intelligence solution that helps Walmart make better decisions by providing insights into competitors pricing, assortment, and marketing strategies. The core of this consists of a product matcher which uses a variety of methods, algorithms and machine learning techniques for competitive price determination. GPU-enabled notebooks and an inference service from Element offer a short time-to-market for building efficient machine-learned models required for market intelligence. The MLOps pipeline supports deploying the model as an auto-scalable, highly available solution, making it cost-effective while always available for the downstream application to consume. It leverages Multi-cloud technology for the model to be developed on one cloud provider using GPU enabled notebooks and deployed on another cloud provider’s infrastructure for inferencing. MLOps on element manages a peak load of over multiple million requests per day per region.
Last Mile Delivery: Intelligent driver dispatch system by the last mile delivery team helps reduce cost of delivery of customer orders and deliver lead time while keeping the high on-time delivery rates. System helps timing the driver search, matching the best driver and trip to deliver customer orders on time. The system, built on element platform, uses a combination of machine learning, optimization, and heuristic models.

Business Impact

Adopting to best of breed tools and technologies by the Element platform team has saved a lot of time for individual teams to evaluate multiple commercial or opensource tools and spending time with external vendors. With the usage of Element, the overall startup time for the data science teams have significantly reduced. Teams now get onboarded quickly to any of the multi-cloud cloud resources they need and with ready to use development and deployment tools with their development environment immediately. Effort and time involved in operationalizing the models for deployment into multiple cloud environment has significantly reduced. With standardized MLOps processes and its integration for deployments on different cloud, teams deploy faster. Time taken to operationalize models has reduced from a couple of weeks in the past to under an hour. We negotiate and offer better infra resources (even the once that are scarce) from the third-party vendors on behalf of the data science teams. Every team benefit from the improvements in the processes on the platform by subscribing to all the evolutionary best practices and standardization. While creating a common monitoring and managing utilization requirement, we were able to build tools into the platform to reduce wastage which resulted in the overall cost reduction. Typically, usage of resources is extremely low for notebook users. As a part of resource allocation optimization, the overall infra-allocation was increased with auto-scaling and multi-tenancy capability.

Lessons Learned

Open-source Adoption
The pace of innovation in the open source is remarkably high. The decision to adopt open source has reaped the benefits of innovation from the best technology community across different companies and not just from one vendor company, if we had gone with a proprietary closed source product.
Inner-Sourcing
As our platform adoption grew so did the feature asks. We could not cater to them in a period acceptable to the users sometimes which led us to adopt the inner-source model where we could have our users to contribute to certain features that are critical to them. This enriched the platform and simultaneously brought in commitment from the contributing customers for increasing platform adoption.
Developer Productivity
The platform started as an open and un-opinionated way, but after a couple of years we realized that we need to bring in more prescriptive approach for certain use cases like MLOps and AI Governance. This made customers to adhere to certain standards but let the platform automate a lot of things like DevOps (Gitops) model, security configurations and AI model catalog.
Cloud Agnostic Platform
As an organization we have evolved in the last few years from where we started with private datacentres and then expanded to more than one public cloud partners. During this journey, there has been instances where we had to migrate huge set of workloads from one cloud to another. Our decision to build a cloud agnostic platform on top of Kubernetes helped us greatly as we made it very transparent for users regarding their development environment and workload placement. Apart from choosing the target clouds, rest of the developer experience is the same irrespective of the cloud and region.
Resource Utilization and Cost Management
Efficient resource utilization has been one of the major challenges with siloed tool adoption across the enterprise. We have realized major gains in improving the resource utilization when converged on a common platform. Antipatterns (like high idle time, TTL, over-provisioning, underutilization) are easier to identify and fix with a common platform. We can share resources across tenants in an efficient way to bring the overall spend for AI related workloads.

Conclusion and Future Work

As a scalable ML platform, we continue to identify more avenues of cost savings, bring in best of class tools, and offer a flexible platform to improve the overall efficiency and reduce time to market of AI solutions without compromising on the governance aspects.
We have identified the following areas in which we will strategically evaluate and invest soon to improve the platform capabilities:

Annotation and Labelling: Extending the platform to offer the best of the tools for data annotation and labeling to streamline and automate data generation with speed and quality of input data.
Feature Store: A scalable, flexible, and cost-effective feature store to be made available across multi-cloud.
Resource Utilization and Estimation: Automatic resource estimation based on the data and methods used in the model training. Pre-emptiable and serverless training will improve utilization of resources and dynamic scaling to meet the demands of the workloads.
Distributed Compute: Methods and systems to distribute the computing infrastructure and to migrate from high-cost infrastructure to many low-cost infrastructures to improve speed and cost efficiency of AI solutions.
Edge Deployments: Today there are many challenges in deploying to wide ranges of edge devices We intend to standardize and manage edge deployments for Walmart use cases and enable federated learning to distribute training and inferencing across the edge devices.
AR/VR: Platform capabilities to enable Augmented Reality (AR) and Virtual Reality (VR) use cases and workloads.

Related Work

Many cloud providers offer a wide range of services for building and deploying AI and ML models. They bring cloud-based IaaS (Infrastructure as a Service) and PaaS (Platform as a Service) Services for AI development. Each of these players have AI platforms, tools and services that can be leveraged for developing AI solutions. While there are benefits in some areas, there are some drawbacks when developing large scale ML solutions. The primary ones being, the vendor lock-in, where the enterprises get locked into the technology stack of the cloud provider and with time the switching cost increases as companies invest time, effort, and money in building solutions on these platforms. This reduces bargaining power due to the lock-in and weaker negotiation power. Dependency on a single cloud provider, which can limit their availability and reliability in certain regions or during service disruptions.
Third party AI platform providers and others offer a cloud agnostic solution where the platforms can be deployed and used in any of the cloud or on-prem instances. While they help address the issue of not being tied down to a single cloud vendor, they bring in their own challenges. License cost starts becoming a bigger chunk of the costs as the ML adoption scales. The tighter integrations with the capabilities of the platform also locks into the tool, making it expensive to switch between tools. Customizations are difficult as they are dependent on the roadmaps and consultation provided by these makers. While there are tools in the market that assist data scientists at various stages of the life cycle, there is no one tool that fits the multi-cloud, best of the breed and bending the cost curve strategy effectively. Avoiding vendor locking, being flexible and having bargain power given the scale of usage at Walmart, no one tool/platform was found to be suitable.
Hence Element an ML platform was built to leverage the goodness of the best of the available solutions and do so at a lower cost.

References

GitHub/Microsoft. 2023. GitHub. https://github.com/.
Google. 2023. TensorFlow TPUs.
https://docs.python.org/3/library/venv.html.
OneOps. 2023. OneOps. https://oneops.github.io/overview/about.html.
Python. 2023a. Python. https://www.python.org.
Python. 2023b. Python Virtual Environment. https://docs.python.org/3/library/venv.html.
R. 2023. R. https://www.r-project.org.
S, B. 2023. Walmart ML Platform. https://medium.com/@bagavath/element-walmarts-machinelearning-platform-b8a1f7870784.
Splunk. 2023. Splunk. https://www.splunk.com/.
Stewart Butterfield, C. H., Eric Costello; and Mourachov, S.2023.
Slack. https://slack.com/.
Teams. 2023. Teams. https://teams.microsoft.com/.
Vela, S. A. Z. R. e. a., D. 2023. Temporal quality degradation in AI.
Models. https://www.nature.com/articles/s41598-022-15245-z#Sec5.
Walmart. 2023a. https://corporate.walmart.com/.
Walmart. 2023b. Walmart Multi-Cloud ML Platform. https://medium.com/walmartglobaltech/walmartsmulti-cloud-machine-learning-platform-a1ab08ff1e4a.
WalmartLabs. 2023. Concord. https://concord.walmartlabs.com/overview/index.html.
Wendel, P. 2023. Spark. https://spark.apache.org/.