Deploying Apache Pinot at a Large Retail Chain

Our real-time analytics journey to production

Published in

Apache Pinot Developer Blog

6 min readApr 27, 2021

This story was anonymously contributed by a member of the Apache Pinot community. Certain specifics, such as architecture components, have been obscured or made generic due to certain corporate legal requirements. Many thanks to the author and his development team for writing this user story about deploying Pinot at one of the world’s largest retail chains.
— Pinot Editorial Team

I work for a large retail chain that operates tens of thousands of retail stores worldwide, with thousands in North America. Across our retail stores, we experience a few million unique store visits from consumers every week. We also experience a large influx of activity through our digital platforms, including online and mobile. We have always strived to be at the forefront of innovation in the retail industry. Today, a key part of that innovation comes through an increased digital presence, from contactless payments to automatic coupon delivery through mobile applications and more. The ability to crunch this digital data is vital to get timely insights into inventory optimization, promo optimization, customer-base growth, and customer satisfaction.

I am part of the Data Science and Machine Learning Team. A key function of my team is to help the various business groups make informed decisions based on real and timely data. In this blog post, I will give an overview of our use of Apache Pinot to solve some of our biggest challenges around Data Analytics.

Data analytics challenge

We reach our consumers via a number of channels; millions of people visit our stores every week. A growing number of customers are now using our loyalty program as well as our delivery app. Every transaction and update that is generated by our customers from across these channels is funneled into our data pipeline. There is extremely valuable information in this data that we want to analyze and utilize to improve the experience for our consumers and our franchise owners.

Producing timely data insights for franchise owners and consumers at a large retail chain

Some of the key stakeholders that need timely access to this data are data analysts and scientists, product and marketing managers, operations teams, and store owners — in order to improve campaigns, programs, systems, designs, customer satisfaction, and more.

We were looking for a unified data analytics system that could help us build all of these use cases and more. The system needed to have the following capabilities in order to meet our current and future needs:

Ingest billions of transactions/events from real time and offline data sources.
Ability to serve speed of thought analytics for 100s to 1000s of simultaneous users on high-dimensionality data.
Simplify our data pipelines to reduce or eliminate costly pre-computation and pre-aggregation operations — with the eventual goal of enabling real-time analytics for our stakeholders.

Our Chosen Solution: Apache Pinot

With these requirements in mind, we picked Apache Pinot as the unified analytics platform for building such mission critical applications. At its core, Apache Pinot is a production ready, distributed analytical database. It has already proven its ability to service 100s of millions of users on LinkedIn, and also powers global operational & financial intelligence for UberEats.

Program metrics and trends

Getting timely and accurate business metrics is crucial for the successful operation of our ongoing campaigns and promotional programs across stores, mobile orders, and delivery channels. Examples of such metrics include sales and user growth numbers categorized across different geographical regions, time of the year, retail channels, and so on. Business analysts and executives use these metrics to determine the success of programs and derive insights for future campaigns. These metrics are computed from the raw transaction data collected across all of our retail channels, which amounts to tens of billion rows of data.

Issues with our legacy data pipeline

The legacy data pipeline setup for this application had a significant amount of operational overhead. Since most existing BI tools cannot handle arbitrary slicing and dicing on the aggregate multi-billion row table in a timely manner, we had to create custom views for each business metric use case. The data flow is broadly as shown:

Our ETL (Extract, Transform and Load) job ran every 24 hours and ingested all of the raw data into the data lake. In the next stage, we ran custom Spark jobs per use case that would filter the raw data by the required product category and restrict the output to manageable datasets containing only a few weeks of data, which was imported into our BI tool. Business users would then do further analysis for building charts and graphs on such custom views.

This pipeline had a lot of limitations:

The views had to be predefined, and so business users were limited in the analysis that they could perform and questions they could have answered.
Maintaining SLAs for these custom views was a huge operational overhead which required constant attention to the data pipelines.
Any new question or use case required the business users to rely on my team to produce a new view. This could take anywhere from hours to days and impacted the timeliness of their analysis.

Pinot to the rescue

We built a new pipeline that removed a lot of these moving parts using Apache Pinot as shown below.

Data Pipeline with Apache Pinot (after one-time historical load)

We set up a Pinot table for ingesting all of the transaction data. This was done in 2 steps:

One time ingestion of the historical transaction dataset.
Scheduled job that bulk loads new transaction data into Pinot on a daily basis.

In addition, we introduced Streamlit to build dashboards to allow business users to interact with Pinot. Various business users can now send their analytical queries directly to Pinot using SQL, which in turn performs aggregations on the fly at a very low latency. This helped solve all of our outstanding issues as described below:

Business users now have access to the entire dataset (including historical data) and hence, are not limited to any static pre-cubed views. They can arbitrarily slice and dice this data as desired.
The engineering team does not have to create any more daily views for the individual lines of business. No filtering or pre-aggregation is necessary in this case, thus drastically reducing our operational overhead.
New use cases can be on-boarded instantly since the computations are performed on the fly.

This new data pipeline has been fully deployed to production and we’re already seeing dramatic usability improvements from all our end users.

Single source of truth

In addition to resolving the operational overhead, the introduction of Streamlit on top of Apache Pinot enabled a great interactive experience for our users. Business analysts can now write arbitrary scripts and rapidly slice and dice the underlying data — which wasn’t possible in our legacy architecture.

Adopting Pinot also improved the accuracy of metrics. Since Pinot is now the single source of truth across all lines of business (as opposed to precomputed views), there is a significantly reduced scope for any discrepancy in metrics computed and presented across the organization. All the lines of business now see the same data and metrics.

In a single stroke, we solved several pain points in our data pipeline with the introduction of Apache Pinot as the unified storage and query engine.

The road ahead

As we look to onboard more applications into the Apache Pinot platform, I realize that we have just scratched the surface of the capabilities in Pinot. One use case that I am looking forward to building is an application that will allow us to do “Cohort Analysis’’ to break down user purchasing behavior and the ROI for various programs that are being deployed. Given the performance we see from our existing set of Pinot applications, I’m eagerly looking forward to seeing the adoption of the new applications from business users as we onboard these new use-cases.

Acknowledgements

I’d like thank the Developers, Product Managers, Pinot community, and Data Managers that have made this project possible.

References

https://docs.pinot.apache.org
https://github.com/apache/incubator-pinot
https://eng.uber.com/operating-apache-pinot/

Apache Pinot Editorial Team
This blog post was edited by Kenny Bastani, Chinmay Soman, Kishore Gopalakrishna, and Uday Vallamsetty. You can reach out to us here with questions or connect with us by joining our Apache Pinot Slack community.