Starbucks Offer Analytics using AzureML & Databricks

15 min readMay 20, 2023

Big Data Analytics with Machine Learning

Introduction

Starbucks provides regular offers to customers who use their mobile app, with promotions ranging from basic drink advertisements to substantial deals such as buy-one-get-one-free or discounts. However, some users may not receive any offers during specific weeks, and not every customer receives the same offer.

Every offer has a specified duration during which it remains valid before expiring. For instance, a BOGO deal may only be available for five days. Although informational offers only provide information about a product, the dataset reveals that they also have a validity period. If an informational offer is valid for seven days, it can be inferred that the customer will be influenced by the offer for seven days after viewing the advertisement.

We will be using a dataset that simulates consumer behavior on the Starbucks Rewards mobile app. This dataset closely resembles the actual app’s activity, but it is a simplified version since the simulator only features one product, while Starbucks offers dozens of products in reality.

The dataset comprises of transactional information that displays user purchases made on the app, including the purchase timestamp and amount spent. This transactional data also contains a record for every offer that a user receives, as well as a record for when a user views the offer. Additionally, records are present for when a user completes an offer. It is also possible for a user to make a purchase on the app without receiving or viewing any offer.

The task is to combine transaction, demographic, and offer data to determine which demographic groups respond best to which offer type. Here, we are just concerned with the bogo offer, and we will assess the effectiveness of this offer and identify the customers who respond best to it, enabling us to target them effectively and reduce customer churn rate.

Azure Architecture

The system architecture involves gathering data from various sources such as Starbucks Stores POS, mobile app, databases, and customer surveys. This data is then stored in Azure Data Lake Storage using a data orchestration tool like Azure Data Factory. To handle big data, Hive tables can be created in the Hadoop ecosystem from the stored data. Databricks is then connected to Hive/ADLS for Spark Analytics & Data Transformation, allowing for faster querying of complex data and the discovery of insights using PySpark. The preprocessed and transformed data is fed into Azure Machine Learning for further analysis and answers. Finally, the data insights are visualized using Tableau or Power BI.

Data Orchestration

Data Source
Data Type & Attributes
Ingestion
Storage
Batch Processing

Data Source

Starbucks is one of the world’s largest coffeehouse chains, with over 32,000 stores in more than 80 countries. As a major player, Starbucks generates a substantial amount of data through its point of sales systems, database, and mobile app. This data includes sales and transaction information, offer-related data, and customer demographic information.

Starbucks POS systems process customer orders, track inventory, and manage sales. Every transaction that occurs at a Starbucks store is recorded in the POS system, which captures data such as the time and date of the transaction, the items purchased, the payment method used, and the location of the store. This data is then aggregated and stored in the company’s database, where it can be analyzed and used to inform business decisions.

Starbucks also collects data from its mobile application, which allows customers to place orders, make payments, and earn rewards through the company’s loyalty program. The app collects a wealth of data about customer behavior, such as the items they order, the frequency of their visits, and their payment preferences. The app also collects location data, enabling Starbucks to tailor its marketing efforts to specific regions and understand the geographic distribution of its customer base.

It also includes `customer demographic data`, collected through customer surveys, which ask questions about age, gender, income, and other demographic factors along with location data through its mobile app.

By analyzing this data, Starbucks gains insights into customer preferences and behavior, track the geographic distribution of its customers, enabling the company to make data-driven decisions about product development and marketing strategies to specific regions.

Starbucks’ data collection and analysis allow the company to better understand its customers and their needs. This enables the company to create personalized experiences, improve products, and increase customer loyalty, ultimately driving sales and growth for the business.

Data Type & Attributes

In our project, we have four files -

portfolio.json — contains characteristics of each offer type, including its offer type, difficulty, and duration. Currently it has 3 offer types — bogo, informational, discount. Count:10 Size:2KB

id (str)                    # offer id
offer_type (str)            # type of offer ie BOGO, discount, informational
difficulty (int)            # minimum required spend to complete an offer
reward (int)                # reward given for completing an offer
duration (int)              # time for offer to be open, in days
channels (list)             #

profile.json — contains all customer demographic data including gender, age, income, start date of membership. Count:1700 Size:2MB

age (int)                   # age of the customer
became_member_on (int)      # date when customer created an app account
gender (str)                # gender of the customer (note some entries contain 'O' for other rather than M or F)
id (str)                    # customer id
income (float)              # customer income

transcript.json —records all customer transactions, offers received, offers viewed, and offers completed, amount and time of transaction. If a customer completed an offer but never actually viewed the offer, then this does not count as a successful offer as the offer could not have changed the outcome. Count:306648 Size:40.2MB

event (str)                 # record description (ie transaction, offer received, offer viewed, etc.)
person (str)                # customer id
time (int)                  # time in hours since start of test. The data begins at time t=0
value - (dict)              # either an offer id or transaction amount depending on the record

survey.csv — saved customer survey responses Count:122 Size:27KB

Download data files — Kaggle

Ingestion

Azure Data Factory, which is a cloud-based data integration service that allows users to create and schedule data pipelines. Data Factory supports various data sources and destinations, including Data Lake Storage, making it an efficient and reliable solution for ingesting large amounts of data.

Storage

Azure Data Lake Storage is a cloud-based storage service provided by Microsoft Azure. It is designed to store and manage large amounts of data in a cost-effective and efficient manner. ADLS can handle various types of data such as structured, semi-structured, and unstructured data, making it a versatile tool for big data management.

Batch Processing

Batch processing is a common technique used in big data processing to process large volumes of data in batches or chunks. Batch processing involves collecting a large amount of data over a period of time, storing it, and then processing it in batches. This approach process large volumes of data more efficiently, as it allows for the processing of data in parallel across multiple machines.

Batch processing also enables organizations to perform complex analytics and machine learning operations on large datasets. It optimize resources, process data with reduced costs.

Data Analytics

Databricks & PySpark

Databricks

Databricks is a cloud-based data analytics platform. It is built on top of Apache Spark, an open-source big data processing framework, and provides a range of tools and features for data processing, analytics, and machine learning.

The platform includes tools for data ingestion, ETL (extract, transform, load) operations, and data exploration. It also includes a range of machine learning tools and libraries that enable users to build and train machine learning models on large datasets.

Databricks is highly scalable and provides a distributed computing environment that allows users to process and analyze large amounts of data in parallel across multiple machines, ideal for big data processing tasks.

PySpark

PySpark is the Python API for Apache Spark, an open-source big data processing framework. PySpark provides an interface for Python programmers to use Spark’s distributed computing engine to process and analyze large datasets efficiently and at scale.

Notebooks

There are three notebooks used -

Starbucks_Preprocessing — Includes data transformation and preprocessing, to make data compatible for finding insights and machine learning purposes.

Starbucks_Analytics— covers all data analytics performed on the dataset.

Starbucks_MachineLearning — used for Azure Machine Learning and training model.

Preprocessing

Mount data from Azure Data Lake Storage

You can mount all data files present in Azure Data Lake Storage to Databricks file system following Microsoft Azure Databricks documentation.

# Configuration string
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id": appId,
       "fs.azure.account.oauth2.client.secret": clientSecret,
       "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"+tenant+"/oauth2/token",
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

# Mount data
try:
    dbutils.fs.mount(source = "abfss://"+container_name+"@"+storage_account_name+".dfs.core.windows.net",
                    mount_point = "/mnt/starbucksdata",
                    extra_configs = configs)
except:
    pass

Read data into Spark dataframe

# Read data into Spark dataframe
sdf_portfolio = spark.read.json("/mnt/starbucksdata/portfolio.json")
sdf_profile = spark.read.json("/mnt/starbucksdata/profile.json")
sdf_transaction = spark.read.json("/mnt/starbucksdata/transcript.json")

Data Transformation

Now we need to do some data transformations to find out valuable insights from the raw data ingested from Starbucks data source and for further processing and analysis. Spark provides a wide range of built-in functions and libraries to perform various transformations on large datasets.

portfolio

This provides an overview of the offer portfolio, including information such as the unique offer ID, type of offer, reward upon completion, offer validity, minimum spending requirement to complete the offer, and the various channels through which the offer is promoted.

profile

This dataset contains customer profile data, including a unique customer ID and demographic information. It also includes details such as the days of membership, the customer’s income range, the total number of transactions made, the total amount spent at Starbucks to date, the amount of rewards earned, and the time taken in hours to complete an offer after viewing it. Additionally, there are some transformed range columns added for convenience of further analytics and visualizations.

transaction

This dataset includes transaction details for customers as well as information about their offer events. It contains data such as customer ID, offer ID, event type (whether the offer was received, viewed, or completed), details of the current transaction (transaction amount), time elapsed since the start of the campaign, and the rewards awarded.
Note:
- The transaction amount is provided only for transaction events.
- The reward amount is provided only for offer completion events.

transaction_offer_time

This dataframe is designed to combine transaction and offer details into a single dataset.

trans_cust_offer

This dataframe is created to merge transaction, offer, and customer details together into a single dataset, allowing for comprehensive analysis and insights across all three aspects

Transformed-Transaction-Customer-Offer Dataframe I

Transformed-Transaction-Customer-Offer Dataframe II

Transformed-Transaction-Customer-Offer Dataframe III

trans_each_cust_offer

Grouped by each customer, the data is transformed to gain specific insights for each individual.

Transformed-Transaction-Each-Customer-Offer Dataframe

trans_cust_offer_succ

This displays the information about offer events, such as the duration between the start of the campaign and the event. The variable time_lapsed_succ represents the time difference between when the offer was viewed and when it was completed. If the offer was neither viewed nor completed, this variable will be null.

Transformed-Customer-Offer-Success Dataframe

bogo_offer_succ_rate

To demonstrate the success rate of buy-one-get-one (BOGO) offers, the dataframe is filtered to include only the BOGO offers mentioned above.

Transformed Starbucks-BOGO-Offer-Success-Rate Dataframe

Analytical Insights

Offer portfolio

offer validity histogram

Most of the offers have validity greater than 5 days

offer/channel count

bogo and informational offers have more promotions through email/mobile
discount offer has more web/email promotions
Starbucks mostly promote bogo offers (as per this dataset)

offer/event count

First — customer receives an offer
Second — customer views that offer / doesn’t view it
Third — customer makes a transaction of certain amount (knowingly/unknowingly the offer)
Fourth — the offer is marked completed by the system, if the difficultly (minimum amount) level of that offer matches the transaction amount and is awarded the reward
So we have more transactions as compared to offers received, viewed and completed, which is as expected

offer/event/gender count

Female — offer success rate 37.25%
Male — offer success rate 50.56%
Others — offer success rate 15%

offer/customer_profile

The number of offers received, viewed, and completed generally increases with age up to a certain point and then starts to decline. The age ranges of (50, 55] and (55, 60] have the highest numbers across all three categories, indicating a higher level of engagement with offers.
The conversion rate from offer viewed to offer completed varies across age ranges. While the number of offers viewed generally increases with age, the completion rate shows some variation. Age ranges (45, 50] and (50, 55] have relatively higher completion rates compared to other age ranges, indicating better conversion of viewed offers into completed ones.
Age ranges (40, 45] and (45, 50] stand out with high numbers in all three categories. These age groups might be particularly receptive to offers, presenting potential targeting opportunities for marketing campaigns.
Youth Engagement — the age range (15, 20] has a relatively high number of offers received and viewed, but a lower number of offers completed compared to other age ranges. This suggests that while the youth show interest in offers, their conversion rate into completed offers is relatively lower.
Older Age Engagement — the graph shows a decline in offer engagement as age increases beyond 60. Age ranges (60, 65], (65, 70], and (70, 75] have lower numbers in all three categories, indicating potentially decreased interest or engagement with offers among older age groups.
The number of offers received, viewed, and completed generally decreases as membership ranges increase. The highest numbers are seen in the lowest membership range of (1700, 2200], indicating higher offer engagement for members in that range.
The number of offers received, viewed, and completed generally varies across income ranges. The highest numbers are observed in the income range of (50000, 70000], indicating higher offer engagement for individuals within that income bracket.

Customer profile

income/membership_days histogram

Income and membership days histogram, explained later in this article

age count

The primary age demographic of customers at Starbucks typically falls between 49 and 67 years old

income_range count

Starbucks is primarily favored by customers whose income falls within the range of 50,000 to 70,000 dollars

membership_days_range count

The majority of customers at Starbucks have membership durations ranging from 1700 to 2700 days (approximately 4.7 to 7.4 years) with more male customers

gender count

The customer database at Starbucks is primarily dominated by male customers.

Transactions

transactions/offer histogram

age_range/transaction_amount

The age group of 50 to 60 years tends to be the primary demographic of customers who spend more on coffee at Starbucks.

membership_days_range/transaction_amount

The majority of customers at Starbucks have membership durations ranging from 1700 to 2700 days (approximately 4.7 to 7.4 years)

income_range/transaction_amount

Starbucks tends to attract customers with incomes ranging from $50,000 to $90,000, who tend to spend more at the coffee chain

gender/transactions

Male/Female spend almost equally on coffee
Females spend more on each transaction as compare to Males
Others are consistent and balanced

Machine Learning

Azure Machine Learning

It is a cloud-based platform for building, training, and deploying machine learning models. It provides a range of tools and services for data scientists and developers to collaborate on building and managing machine learning workflows. The platform includes features such as automated machine learning, model interpretation, and monitoring, and it supports a range of popular machine learning frameworks and languages. Azure Machine Learning also integrates with other Azure services such as Azure Data Factory and Azure Databricks for end-to-end data analytics and machine learning workflows.

Objective

The task is to combine transaction, demographic, and offer data to determine which demographic groups respond best to which offer type. Here, we are just concerned with the bogo offer, and we will assess the effectiveness of this offer and identify the customers who respond best to it, enabling us to target them effectively and reduce customer churn rate.

Machine Learning Model (unsupervised)

In this project, I have used K-Means clustering unsupervised ML model to segregrate customers in two groups. a) Customers who have succesfully viewed and completed the offer upon receipt. b) Customer who have received and viewed the offer but haven’t completed the offer successfully and needs further attention.

K-Means Clustering

K Means is an unsupervised machine learning algorithm used for clustering and pattern recognition. It partitions a dataset into K clusters based on the similarity between data points, where K refers to the number of clusters. The algorithm iteratively updates cluster centroids until they no longer move significantly, effectively partitioning the dataset into K clusters. K Means is commonly used in data mining, image processing, and natural language processing.

Once the customers have been clustered into groups based on their offer success, we can proceed to train different supervised classification predictive machine learning models. This will allow us to extract further insights from the data and make predictions based on the models’ outputs.

Data Visualizations

Tableau & Reports

This Tableau dashboard provides basic visualizations for quick analysis. For future reference, more advanced and interactive dashboards can be created to further enhance the insights.

BOGO Insights

Customers actually viewed and completed the bogo offer successfully —

Age range 48–70 Offer received 1–3 Total transactions made 5–12 Total amount spent 80–200 dollars Rewards earned 10–21 Time lapsed success 6–100 hours, mostly round 60 hours Income 55,000–90,000 dollars Membership 2000–2500 days or 5.5–7 years

Based on the information obtained, we now have an understanding of which customers are successfully completing the BOGO offers and who require further engagement. For customers who do not fall within the mentioned ranges of attributes, it is necessary to consider sending them more BOGO offers or devising improved business strategies to encourage their participation.

Bogo offer success rate
Currently, we have around 50% BOGO offer success rate as per data collected

Full Code at GitHub

You can get the full code in my GitHub repository.

GitHub - shuv50/Starbucks-Data-Analytics

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Conclusion

Using a simulated dataset that closely resembles the Starbucks Rewards mobile app, we were able to analyze consumer behavior and identify which demographic groups respond best to the BOGO offer.

By combining transaction, demographic, and offer data, we have successfully assessed the effectiveness of this offer and identified the customers who are most likely to respond to it. This information can be used to develop targeted marketing strategies, reduce customer churn rate, and increase sales.

However, it is important to note that our analysis only covers the BOGO offer, and further research may be needed to determine which offer types are most effective for different demographic groups and also we can proceed to train different supervised classification predictive machine learning models. This will allow us to extract further insights from the data and make predictions based on the models’ outputs.