How we enabled product and pricing-availability feeds as APIs for external partners

Published in

Walmart Global Tech Blog

7 min readSep 2, 2020

Walmart being the country’ largest grocer, wants customers to be able to shop at their local store digitally and plan their trip even before they choose curbside pickup or delivery.

We are from Affiliates and OpenAPI engineering team. Our team works on enabling third party integrations to drive traffic and conversion for Walmart. One of our use cases needed a smart and efficient infrastructure to handle variability of products across thousands of stores thereby helping to integrate with many 3rd parties in the future who can help Walmart customers better integrate their lives with Walmart as a retailer.

In this post, I will give an overview of how we enabled product and pricing-availability information as downloadable feeds for external partners. With these APIs, a Walmart prospective customer can do product discovery for their local store on a 3rd party publisher’s app/website and can directly add their chosen items to Walmart cart. Walmart handles the fulfillment, be it curbside pickup, or delivery.

Overview

As part of the feeds APIs, we provide a daily snapshot of product metadata, product pricing-availability information across all the active stores. In addition to snapshots, we also provide deltas every 15 minutes. The typical usage pattern for the partners is to use snapshots to bootstrap the product metadata and product pricing-availability information and then incrementally apply deltas as they arrive every 15 minutes.

Feeds based integration works well for bulk data exchange. For point lookups on individual products we provide real time pricing-availability API that can be used to fetch the most up to date data for an individual item.

All the APIs are filterable by category paths, the list of which is exposed through our taxonomy API.

Key design considerations

Following are the key design considerations:

To address bandwidth concerns seen in some of the existing APIs which provide downloading feeds from an on-premises cluster.
To ensure security of the APIs we expose to external partners.
To ensure that the data we expose is of proper quality as any data discrepancy can cause improper state resulting in add to cart errors for customers.
To make the feeds APIs multi tenant. Multi-tenancy is needed to ensure that only the inventory available to a given partner is included in the respective feed or API. Think of it as a category or item level access control for inventory information.
Monitoring and alerting of the pipelines to ensure that pipelines are running without any issues.
To make the feeds queryable by category.
To make the on-boarding process easier for partners.

Snapshot feed API architecture

Snapshot feeds provide a snapshot of product catalog metadata and product pricing-availability information for a given day. Following is the high level architecture for the same:

The product catalog and pricing-availability data is collected from a number of upstream systems where the inventory information is managed. Several AirFlow Spark jobs collect the upstream data and transform into consistent, normalized representation in Hive tables.

Previous versions of this system(with partial functionality) resided on-premises, in internally hosted Hadoop deployment. Interestingly, we’ve observed significant performance improvement as systems leveraged GCP and optimized Spark Jobs— we’ve observed a reduction of 75% processing time as a result of the pipelines running in GCP and optimized Spark Jobs.

Google cloud storage is used as the storage layer for the feeds to be exposed to external partners. Data is stored in hierarchical structure and partitioned by categoryId — the identifier of the category where a given product belongs. Category based partitioning allows us to utilize GCP storage access control to implement per-partner inventory filtering — only locations of categories that are enabled for given partner are exposed to the partner.

When partners make an API call for the feeds, a downloadable google cloud storage signed url for the feed data is generated based on the partner configuration and returned in the response. Partners can download the feed data from google cloud storage signed url. This has addressed the bandwidth concerns we have seen in downloading feeds from our on-premises cluster. Security of the feeds APIs is taken care of by enabling necessary authentication and authorization mechanisms at API Gateway and also by providing a google cloud storage signed url for data feeds which expires after a pre-configured time.

Following is a reference document of google cloud storage signed url.

Signed URLs | Cloud Storage | Google Cloud

This page provides an overview of signed URLs, which you use to give time-limited resource access to anyone in…

cloud.google.com

Monitoring and alerting pipeline that runs in AirFlow sends slack/email alerts if there are any failures in any of the DAGs.

As part of the data quality check for the snapshots, we do the following:

Any anomalies in the data generated are alerted if the percentage difference exceeds a pre-configured threshold.
We also ensure that the feed file has the correct format, schema and if all the required values are present or not as part of data quality check. We are looking to build additional functionality on top of this data quality check pipeline as we observe trends over the course of time.

Delta feed API architecture

Delta feeds APIs provide deltas in product pricing-availability information. The downloadable feed file has hourly dumps and the dumps are updated every 15 minutes. The high level architecture looks like the following:

The delta feeds pipeline runs entirely in GCP. We listen to a set of upstream Kafka topics to determine product pricing-availability. Spark structured streaming jobs running in dataproc cluster are used to listen to kafka topics and store the transformed data in our base tables in BigQuery. AirFlow DAG that has deltas pipeline runs every 15 minutes and deltas are derived by running sql queries on our base tables and snapshot table in BigQuery and stored back to our deltas table in BigQuery.

Deltas are queried from BigQuery and updated to per-partner downloadable feeds every 15 minutes. Deltas are also stored in hierarchical structure based on the categoryId and are queryable per-categoryId.

When partners query for deltas, a downloadable google cloud storage url is provided similar to what was described in snapshot feed architecture. Partners can also get the deltas at daily/hourly frequency from a particular date-hour.

Monitoring and alerting ensures that the streaming jobs and the deltas pipeline in AirFlow is running continuously and deltas are generated without any delays. Slack/email alerts are provided for any failures/delays. Automatic Streaming job restarts when the jobs go down is taken care of by the scheduler that monitors the streaming jobs.

As part of the data quality check for the deltas, we do the following:

Data quality check that runs every 15 minutes compares the deltas we generate against a realtime price-availability API and alerts any discrepancies seen. This helps us to proactively troubleshoot any data discrepancies seen and apply fixes if needed.
We generate a snapshot of the deltas table from BigQuery once a day and compare it with an upstream table and alert any discrepancies seen. The purpose of this is to troubleshoot the discrepancies and apply back the mismatches to BigQuery deltas table if needed.
We also ensure that the deltas feed file has the correct format, schema and if all the required values are present or not as part of data quality check. We are looking to build additional functionality on top of this data quality check pipeline as we observe trends over the course of time.

Conclusion

A very important part of the feed pipelines is to make sure that the feed data provided is as accurate as possible. Having a proper design in place by talking to various upstream teams helped us in achieving this.
A reusable framework for monitoring and alerting of the pipelines helped us in ensuring that the pipelines run without any issues.
Data quality check layer helped us to identify any discrepancies seen in the data we generate as part of the feeds and take necessary actions.
Leveraging GCP for running our data pipelines and optimized Spark Jobs also helped speed up our final snapshot generation with the whole process completing under 2 hours.
Leveraging GCP for providing downloadable feeds(with google cloud storage signed URL) helped us in addressing some of the bandwidth concerns seen while downloading feeds from on-premises infrastructure.
Important consideration has been given to ensure security of the APIs exposed to external partners by enabling necessary authentication and authorization mechanisms for the APIs at API gateway and google cloud storage signed URL for data feeds which expires after a pre-configured time.
On boarding new partners is made easier by going with an API based approach for providing feeds.