Starbucks Offer Analytics using AzureML & Databricks
Big Data Analytics with Machine Learning
Introduction
Starbucks provides regular offers to customers who use their mobile app, with promotions ranging from basic drink advertisements to substantial deals such as buy-one-get-one-free or discounts. However, some users may not receive any offers during specific weeks, and not every customer receives the same offer.
Every offer has a specified duration during which it remains valid before expiring. For instance, a BOGO deal may only be available for five days. Although informational offers only provide information about a product, the dataset reveals that they also have a validity period. If an informational offer is valid for seven days, it can be inferred that the customer will be influenced by the offer for seven days after viewing the advertisement.
We will be using a dataset that simulates consumer behavior on the Starbucks Rewards mobile app. This dataset closely resembles the actual app’s activity, but it is a simplified version since the simulator only features one product, while Starbucks offers dozens of products in reality.
The dataset comprises of transactional information that displays user purchases made on the app, including the purchase timestamp and amount spent. This transactional data also contains a record for every offer that a user receives, as well as a record for when a user views the offer. Additionally, records are present for when a user completes an offer. It is also possible for a user to make a purchase on the app without receiving or viewing any offer.
The task is to combine transaction, demographic, and offer data to determine which demographic groups respond best to which offer type. Here, we are just concerned with the bogo
offer, and we will assess the effectiveness of this offer and identify the customers who respond best to it, enabling us to target them effectively and reduce customer churn rate.
Azure Architecture
The system architecture involves gathering data from various sources such as Starbucks Stores POS
, mobile app
, databases
, and customer surveys
. This data is then stored in Azure Data Lake Storage
using a data orchestration tool like Azure Data Factory
. To handle big data, Hive
tables can be created in the Hadoop ecosystem
from the stored data. Databricks
is then connected to Hive/ADLS
for Spark Analytics & Data Transformation, allowing for faster querying of complex data and the discovery of insights using PySpark
. The preprocessed and transformed data is fed into Azure Machine Learning
for further analysis and answers. Finally, the data insights are visualized using Tableau
or Power BI
.
Data Orchestration
Data Source
Data Type & Attributes
Ingestion
Storage
Batch Processing
Data Source
Starbucks is one of the world’s largest coffeehouse chains, with over 32,000 stores in more than 80 countries. As a major player, Starbucks generates a substantial amount of data through its point of sales systems, database, and mobile app. This data includes sales and transaction information, offer-related data, and customer demographic information.
Starbucks POS systems
process customer orders, track inventory, and manage sales. Every transaction that occurs at a Starbucks store is recorded in the POS system, which captures data such as the time and date of the transaction, the items purchased, the payment method used, and the location of the store. This data is then aggregated and stored in the company’s database, where it can be analyzed and used to inform business decisions.
Starbucks also collects data from its mobile application
, which allows customers to place orders, make payments, and earn rewards through the company’s loyalty program. The app collects a wealth of data about customer behavior, such as the items they order, the frequency of their visits, and their payment preferences. The app also collects location data, enabling Starbucks to tailor its marketing efforts to specific regions and understand the geographic distribution of its customer base.
It also includes `customer demographic data`, collected through customer surveys, which ask questions about age, gender, income, and other demographic factors along with location data through its mobile app.
By analyzing this data, Starbucks gains insights into customer preferences and behavior, track the geographic distribution of its customers, enabling the company to make data-driven decisions about product development and marketing strategies to specific regions.
Starbucks’ data collection and analysis allow the company to better understand its customers and their needs. This enables the company to create personalized experiences, improve products, and increase customer loyalty, ultimately driving sales and growth for the business.
Data Type & Attributes
In our project, we have four files -
portfolio.json
— contains characteristics of each offer type, including its offer type, difficulty, and duration. Currently it has 3 offer types — bogo, informational, discount. Count:10 Size:2KB
id (str) # offer id
offer_type (str) # type of offer ie BOGO, discount, informational
difficulty (int) # minimum required spend to complete an offer
reward (int) # reward given for completing an offer
duration (int) # time for offer to be open, in days
channels (list) #
profile.json
— contains all customer demographic data including gender, age, income, start date of membership. Count:1700 Size:2MB
age (int) # age of the customer
became_member_on (int) # date when customer created an app account
gender (str) # gender of the customer (note some entries contain 'O' for other rather than M or F)
id (str) # customer id
income (float) # customer income
transcript.json
—records all customer transactions, offers received, offers viewed, and offers completed, amount and time of transaction. If a customer completed an offer but never actually viewed the offer, then this does not count as a successful offer as the offer could not have changed the outcome. Count:306648 Size:40.2MB
event (str) # record description (ie transaction, offer received, offer viewed, etc.)
person (str) # customer id
time (int) # time in hours since start of test. The data begins at time t=0
value - (dict) # either an offer id or transaction amount depending on the record
survey.csv
— saved customer survey responses Count:122 Size:27KB
Download data files — Kaggle
Ingestion
Azure Data Factory
, which is a cloud-based data integration service that allows users to create and schedule data pipelines. Data Factory supports various data sources and destinations, including Data Lake Storage, making it an efficient and reliable solution for ingesting large amounts of data.
Storage
Azure Data Lake Storage
is a cloud-based storage service provided by Microsoft Azure
. It is designed to store and manage large amounts of data in a cost-effective and efficient manner. ADLS can handle various types of data such as structured, semi-structured, and unstructured data, making it a versatile tool for big data management.
Batch Processing
Batch processing is a common technique used in big data processing to process large volumes of data in batches or chunks. Batch processing involves collecting a large amount of data over a period of time, storing it, and then processing it in batches. This approach process large volumes of data more efficiently, as it allows for the processing of data in parallel across multiple machines.
Batch processing also enables organizations to perform complex analytics and machine learning operations on large datasets. It optimize resources, process data with reduced costs.
Data Analytics
Databricks & PySpark
Databricks
Databricks is a cloud-based data analytics platform. It is built on top of Apache Spark, an open-source big data processing framework, and provides a range of tools and features for data processing, analytics, and machine learning.
The platform includes tools for data ingestion, ETL (extract, transform, load) operations, and data exploration. It also includes a range of machine learning tools and libraries that enable users to build and train machine learning models on large datasets.
Databricks is highly scalable and provides a distributed computing environment that allows users to process and analyze large amounts of data in parallel across multiple machines, ideal for big data processing tasks.
PySpark
PySpark is the Python API for Apache Spark, an open-source big data processing framework. PySpark provides an interface for Python programmers to use Spark’s distributed computing engine to process and analyze large datasets efficiently and at scale.
Notebooks
There are three notebooks used -
Starbucks_Preprocessing
— Includes data transformation and preprocessing, to make data compatible for finding insights and machine learning purposes.
Starbucks_Analytics
— covers all data analytics performed on the dataset.
Starbucks_MachineLearning
— used for Azure Machine Learning and training model.
Preprocessing
Mount data from Azure Data Lake Storage
You can mount all data files present in Azure Data Lake Storage to Databricks file system following Microsoft Azure Databricks documentation.
# Configuration string
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": appId,
"fs.azure.account.oauth2.client.secret": clientSecret,
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"+tenant+"/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
# Mount data
try:
dbutils.fs.mount(source = "abfss://"+container_name+"@"+storage_account_name+".dfs.core.windows.net",
mount_point = "/mnt/starbucksdata",
extra_configs = configs)
except:
pass
Read data into Spark dataframe
# Read data into Spark dataframe
sdf_portfolio = spark.read.json("/mnt/starbucksdata/portfolio.json")
sdf_profile = spark.read.json("/mnt/starbucksdata/profile.json")
sdf_transaction = spark.read.json("/mnt/starbucksdata/transcript.json")
Data Transformation
Now we need to do some data transformations to find out valuable insights from the raw data ingested from Starbucks data source and for further processing and analysis. Spark provides a wide range of built-in functions and libraries to perform various transformations on large datasets.
portfolio
This provides an overview of the offer portfolio, including information such as the unique offer ID, type of offer, reward upon completion, offer validity, minimum spending requirement to complete the offer, and the various channels through which the offer is promoted.
profile
This dataset contains customer profile data, including a unique customer ID and demographic information. It also includes details such as the days of membership, the customer’s income range, the total number of transactions made, the total amount spent at Starbucks to date, the amount of rewards earned, and the time taken in hours to complete an offer after viewing it. Additionally, there are some transformed range columns added for convenience of further analytics and visualizations.
transaction
This dataset includes transaction details for customers as well as information about their offer events. It contains data such as customer ID, offer ID, event type (whether the offer was received, viewed, or completed), details of the current transaction (transaction amount), time elapsed since the start of the campaign, and the rewards awarded.
Note:
- The transaction amount is provided only for transaction events.
- The reward amount is provided only for offer completion events.
transaction_offer_time
This dataframe is designed to combine transaction and offer details into a single dataset.
trans_cust_offer
This dataframe is created to merge transaction, offer, and customer details together into a single dataset, allowing for comprehensive analysis and insights across all three aspects
trans_each_cust_offer
Grouped by each customer, the data is transformed to gain specific insights for each individual.
trans_cust_offer_succ
This displays the information about offer events, such as the duration between the start of the campaign and the event. The variable
time_lapsed_succ
represents the time difference between when the offer was viewed and when it was completed. If the offer was neither viewed nor completed, this variable will be null.
bogo_offer_succ_rate
To demonstrate the success rate of buy-one-get-one (BOGO) offers, the dataframe is filtered to include only the BOGO offers mentioned above.
Analytical Insights
Offer portfolio
offer validity histogram
Most of the offers have validity greater than 5 days
offer/channel count
bogo
andinformational
offers have more promotions through email/mobilediscount
offer has more web/email promotions
Starbucks mostly promote bogo offers (as per this dataset)
offer/event count
First — customer
receives
an offer
Second — customerviews
that offer / doesn’t view it
Third — customer makes atransaction
of certain amount (knowingly/unknowingly the offer)
Fourth — the offer is markedcompleted
by the system, if the difficultly (minimum amount) level of that offer matches the transaction amount and is awarded the reward
So we have more transactions as compared to offers received, viewed and completed, which is as expected
offer/event/gender count
Female — offer success rate 37.25%
Male — offer success rate 50.56%
Others — offer success rate 15%
offer/customer_profile
The number of offers received, viewed, and completed generally increases with age up to a certain point and then starts to decline. The age ranges of (50, 55] and (55, 60] have the highest numbers across all three categories, indicating a higher level of engagement with offers.
The conversion rate from offer viewed to offer completed varies across age ranges. While the number of offers viewed generally increases with age, the completion rate shows some variation. Age ranges (45, 50] and (50, 55] have relatively higher completion rates compared to other age ranges, indicating better conversion of viewed offers into completed ones.
Age ranges (40, 45] and (45, 50] stand out with high numbers in all three categories. These age groups might be particularly receptive to offers, presenting potential targeting opportunities for marketing campaigns.
Youth Engagement — the age range (15, 20] has a relatively high number of offers received and viewed, but a lower number of offers completed compared to other age ranges. This suggests that while the youth show interest in offers, their conversion rate into completed offers is relatively lower.
Older Age Engagement — the graph shows a decline in offer engagement as age increases beyond 60. Age ranges (60, 65], (65, 70], and (70, 75] have lower numbers in all three categories, indicating potentially decreased interest or engagement with offers among older age groups.
The number of offers received, viewed, and completed generally decreases as membership ranges increase. The highest numbers are seen in the lowest membership range of (1700, 2200], indicating higher offer engagement for members in that range.
The number of offers received, viewed, and completed generally varies across income ranges. The highest numbers are observed in the income range of (50000, 70000], indicating higher offer engagement for individuals within that income bracket.
Customer profile
income/membership_days histogram
Income and membership days histogram, explained later in this article
age count
The primary age demographic of customers at Starbucks typically falls between 49 and 67 years old
income_range count
Starbucks is primarily favored by customers whose income falls within the range of 50,000 to 70,000 dollars
membership_days_range count
The majority of customers at Starbucks have membership durations ranging from 1700 to 2700 days (approximately 4.7 to 7.4 years) with more male customers
gender count
The customer database at Starbucks is primarily dominated by male customers.
Transactions
transactions/offer histogram
age_range/transaction_amount
The age group of 50 to 60 years tends to be the primary demographic of customers who spend more on coffee at Starbucks.
membership_days_range/transaction_amount
The majority of customers at Starbucks have membership durations ranging from 1700 to 2700 days (approximately 4.7 to 7.4 years)
income_range/transaction_amount
Starbucks tends to attract customers with incomes ranging from $50,000 to $90,000, who tend to spend more at the coffee chain
gender/transactions
Male/Female spend almost equally on coffee
Females spend more on each transaction as compare to Males
Others are consistent and balanced
Machine Learning
Azure Machine Learning
It is a cloud-based platform for building, training, and deploying machine learning models. It provides a range of tools and services for data scientists and developers to collaborate on building and managing machine learning workflows. The platform includes features such as automated machine learning, model interpretation, and monitoring, and it supports a range of popular machine learning frameworks and languages. Azure Machine Learning also integrates with other Azure services such as Azure Data Factory and Azure Databricks for end-to-end data analytics and machine learning workflows.
Objective
The task is to combine transaction, demographic, and offer data to determine which demographic groups respond best to which offer type. Here, we are just concerned with the bogo offer, and we will assess the effectiveness of this offer and identify the customers who respond best to it, enabling us to target them effectively and reduce customer churn rate.
Machine Learning Model (unsupervised)
In this project, I have used K-Means clustering unsupervised ML model to segregrate customers in two groups. a) Customers who have succesfully viewed and completed the offer upon receipt. b) Customer who have received and viewed the offer but haven’t completed the offer successfully and needs further attention.
K-Means Clustering
K Means is an unsupervised machine learning algorithm used for clustering and pattern recognition. It partitions a dataset into K clusters based on the similarity between data points, where K refers to the number of clusters. The algorithm iteratively updates cluster centroids until they no longer move significantly, effectively partitioning the dataset into K clusters. K Means is commonly used in data mining, image processing, and natural language processing.
Once the customers have been clustered into groups based on their offer success, we can proceed to train different supervised classification predictive machine learning models. This will allow us to extract further insights from the data and make predictions based on the models’ outputs.
Data Visualizations
Tableau & Reports
This Tableau dashboard provides basic visualizations for quick analysis. For future reference, more advanced and interactive dashboards can be created to further enhance the insights.
BOGO Insights
Customers actually viewed and completed the bogo offer successfully —
Age range 48–70
Offer received 1–3
Total transactions made 5–12
Total amount spent 80–200 dollars
Rewards earned 10–21
Time lapsed success 6–100 hours, mostly round 60 hours
Income 55,000–90,000 dollars
Membership 2000–2500 days or 5.5–7 years
Based on the information obtained, we now have an understanding of which customers are successfully completing the BOGO offers and who require further engagement. For customers who do not fall within the mentioned ranges of attributes, it is necessary to consider sending them more BOGO offers or devising improved business strategies to encourage their participation.
Bogo offer success rate
Currently, we have around 50% BOGO offer success rate as per data collected
Full Code at GitHub
You can get the full code in my GitHub repository.
Conclusion
Using a simulated dataset that closely resembles the Starbucks Rewards mobile app, we were able to analyze consumer behavior and identify which demographic groups respond best to the BOGO offer.
By combining transaction, demographic, and offer data, we have successfully assessed the effectiveness of this offer and identified the customers who are most likely to respond to it. This information can be used to develop targeted marketing strategies, reduce customer churn rate, and increase sales.
However, it is important to note that our analysis only covers the BOGO offer, and further research may be needed to determine which offer types are most effective for different demographic groups and also we can proceed to train different supervised classification predictive machine learning models. This will allow us to extract further insights from the data and make predictions based on the models’ outputs.