Starbucks Capstone — the best offer for every customer


Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). I’m working with data sets that contain simulated data that mimics customer behavior on the Starbucks rewards mobile app. This data sets are a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products. Every offer has a validity period before the offer expires.

The three data sets are in detail: The portfolio dataset containing the offer types, the profile dataset including the demographic data on customers and the transcriptions dataset.

My goal is to develop a heuristic that outputs which type of offer is most successful for a customer group that can be narrowed down arbitrarily. In addition, it should be output how much the group spends per day with and without influence. This information can then be used to decide which offer is the most suitable for the customer.

Data Exploration

In the first step, I read in the three provided data sets to examine them in detail.


df_portfolio contains the 10 different offer types. A distinction is made between BOGO, discount and informational. To successfully complete the offer and receive the reward, the customer must spend the amount noted under difficulty. He has only as many days to do this as is noted under Duration. In case of a BOGO (buy one, get one) the customer must spend the amount within one order. In case of a discount, the customer can complete several orders within the validity period, which can be credited. In case of an informative promotion, the customer is only informed about the product without receiving a discount. There are 4 BOGO offers, 4 discount offers and 2 informational offers.


df_profile contains information on all customers who participated in the experiment. Data on gender, age, income and the date of entry are stored.

2175 customers did not specify gender.

Regarding the age, there are many people who are 118 years old. It is assumed that for all persons who did not want to enter an age, the year of birth 1900 was automatically entered and thus the 118 years came about. This age is not taken into account.
It shows that the average age of the customers is 54 years and the average income is 65,000. 2175 data are again missing for the income.

It can be shown that the missing data on gender, age and income always occur together.


df_transcript is the largest record that contains the sales of all customers. Additionally it contains all events, like offer received, offer viewed and offer completed. The value column contains data on the amount of the transaction or information on the respective offer type. The time column contains the exact time of the event or transaction.

There are records of 17,000 people; as many as there are in df_profile.

The dataset contains four different event types: Transaction means a sales volume generated by the customer. Each new offer that a customer receives is introduced via offer received. When a customer looks at the offer, an offer viewed event is generated and if they succeed, an offer completed event is generated. It is plausible that there are more offer received than from offer viewed and offer completed.

The time column records the time of the event in hours. The record starts at 0 and ends at 714.

The value column contains the amount for a transaction. In case of an offer it contains the offer id. There are all 10 types, except for offer completed: there are only 8 different entries. This is true, because the two types with information advertisements can not be completed.

Data Preprocessing


In df_portfolio the offer type ids are reassigned numerically to increase readability. In addition, the offers have been re-sorted by offer type.


In df_profile the entry data has been reformatted and the person IDs have been replaced by numbers as well. If the age or income was not specified, it now contains a 0.


In df_transcript the information from the old value column was converted and transferred to amount and offer type id.

There are some orders that are very high. But they are not considered implausible, because it could be that one person invited a larger group.

Data Visualization

Most customers are male, with very few indicating Other as their gender.

Clients earn between 30,000 and 120,000 per year. The distribution is slightly right skewed.

The age of the customers looks very normally distributed.

The amount spent by customers during an order is not normally distributed. Rather, it looks like most customers spend less than 5 per order.

The graph indicates since when customers have registered in the app. You can see a jump in mid-2015 and a second jump in mid-2017.


I have implemented a function calculate_offers: it calculates the offers as objects and then stores them in a Data Frame. For this purpose, the offers and transactions in df_transcript are considered individually for each customer. For each offer there must be an offer received transcript. Then a new offer is created as an object. Since there is no entry when an offer expires, this date must be calculated manually. Each offer object stores how much money was spent by the customer in the active period after the offer was viewed. One difficulty is assigning offer events to the correct offer, since there is no offer ID in the transcript for this purpose. However, there is the offer type ID which stands for one of the 10 different types, such as BOGO or discount. In some cases a customer gets several offers with the same offer type ID. An offer viewed event can then not be uniquely assigned to an offer. In these ambiguous cases, the older offer is assumed.
Informational offers have no difficulty and no expiration date. It is assumed, depending on the type, that the advertisement will have an impact on the customer for three or four days after viewing it. Unlike the other types of offers, the duration until the expiration date counts only from the viewing time and not from the delivery time.

Every offer is in a certain state at any given time. There are 12 possible states:

B= BOGO, D = Discount, I = Informational

The money spent column indicates how much money was spent when the customer was influenced by the offer, i.e. after the time of viewing until the time of fulfillment or expiration.


The DataFrame calculated with the calculate_offers function is stored in df_offers.

At the end of the calculation, the two states open and active are converted to still open after end of experiment resp. still active after end of experiment, so that finally 10 out of 12 different states appear in df_offers.

At the end of the experiment, most offers are in the states successful, expired or ended. To evaluate how much money a customer spends under the influence of an offer, only these three states are interesting. The two states still open after end of experiment and still active after end of experiment are ignored, because the offers are still running at the end and therefore it cannot be evaluated whether the customer spends more money in the promotion period. The other states are also ignored because the customer did not view the offer until the close and was therefore not influenced by it.
Overall, there are more fulfilled than expired offers.

The calculate_offers function not only calculated the offers, but also recorded by how many offers the customer was influenced in each transaction:

In most cases, a customer is not influenced by any or one offer while making a transaction. In exceptional cases, there are also 2 or 3 offers that influence the customer at the same time. In these cases, the amount of the transaction is credited to all offers.

The function df_transcript_person returns the transcript for a customer with the respective person id in a DataFrame:


In the case of an offer event, info such as difficulty and duration are displayed. For a transaction, the number of offers the customer is influenced by are displayed in the active offers column.

Finally, the df_offers_person function outputs all offers made to a specific customer:


The next step is to calculate how many hours the customer was not influenced by any offer and how much money he spent during this period. This data is appended to df_profile:


In order to make a statement about a group of people, a filter function is required that filters out only those with the desired criteria from all customers. This was implemented in the function find_persons, which returns an extract from df_profile with exactly those customers that match the search criteria:

Finally, among other functions, the print_recommendations function was developed to indicate which offer is best suited for a particular customer based on heuristics. It outputs how much the customer is likely to spend per day if he is not influenced by any offer and how much he spends if he is influenced by informational advertising. It also indicates how likely a BOGO- or discount offer is to be completed successfully and outputs the most successful offer type:


Most functions have assert statements to ensure that they work properly.

If recommendations are made to a customer group with less than 20 people, an error is issued because the reference group is too small.

The offer type id 0 corresponds to a BOGO offer with difficulty 5. To fulfill it, more than 5 must be spent during one single transaction. If it was fulfilled on purpose, the spent sum must be greater than 5 in any case. So there must not be an offer with a sum smaller than 5. Conversely, there may well be offers that have expired and a sum greater than 5 has been spent because the task is to spend 5 at a time in a BOGO offer.

The offer type id 4 corresponds to a discount offer with difficulty 7. To fulfill it, more than 7 must be spent during all transactions. There are actually fulfilled offers where less than 7 was spent! This is because after receiving the offer, the customer has already spent money without looking at the offer. However, this money counts towards fulfillment but is not added to the offer as money spent because the customer was not influenced by this offer. Conversely, there must not be any expired offers where more than 7 was purposely spent.

For all offers that had not expired by the end of the experiment, the expiration date must be after hour 714.

Finally, all these assumptions for all offer type ids are put into assert statements to validate the calculation process of the offers.


On average, customers already spend 3 per day without advertising. The most successful are the discount offers with duration 7 and difficulty 7. They are fulfilled by about 70%.
Informational advertising is very effective as it increases sales by an average of 44% and has no cost.

In general, the more a customer earns, the more he is willing to spend. This applies to both men and women. In general, women spend more than men, but this trend reverses at very high incomes. Informational advertising is very effective for all and generates an increase in sales of approx. 40–60%.

Men spend a little more the older they get. This trend cannot be observed among women. There is no correlation with age among them. It is also evident that infomercials have less influence among older than among younger customers.


Further above it was already found that there are some high transactions. It is to be checked how the recommendations change if all transactions above 50 are considered as 50:

Of course, the success rates do not change, but the money spent becomes less.

Spending with and without being influenced by infomercials falls on average by around 0.4 per day. Nevertheless, all main findings remain the same and capping does not cause a displacement of the results. Therefore, the capping is reversed again.


On average, customers already spend 3 per day without advertising. Therefore, from a commercial point of view, discounts are for most of the customers not profitable, because they would spend the amount of the challenge anyway. BOGO offers with the difficulty level 5 or better 10 would be a way to increase sales and increase profits. These are succeeded about 40–60% of the time.
Informational advertising is very effective as it increases sales by an average of 44% and has no cost.

Nevertheless, there are customer groups for whom discount offers also make commercial sense: namely, when they spend not significantly more on average over the offer period than the challenge and have a high success rate. Men with low incomes appear to be one such group:


The heuristic could be extended as follows: it could be explicitly calculated for an individual customer how much he spends uninfluenced and a customer-specific recommendation could be derived from this. However, this only works if the customer is already known and enough data is available about him.

To take a closer look at my evaluations, you can check them on GitHub.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store