Assessing Starbucks Data.

Diego Zúñiga Ortiz
Nov 4 · 3 min read

This article aims to present the main findings that come out from the processing of the Starbucks data set.

This data set contains three tables to describe customers data, offers data that are sent to the clients to encourage their consumption and the data that register the interactions between the users and the offers.

The goal of this project is to explore the data, wrangle it and produce a clean data set that allows us to get some useful information. That said, the main goal is set as train a machine learning model that allows to predict if a specific type of user receiving a specific offer will consume the products or not.


Cleaning Data

To accomplish this goal the first step is to explore the data and clean it. This can be followed in the Jupyter notebook associated with the project.

Some of the issues that were solved to clean the data were some type converting, separating data from some columns into several, and deleting some null values from users. This last decision was taken because the model was intended to predict data based in demographic data of users and offers details.

Once this was done some interesting features were selected to use as th input of the model.

The next step was to create a neat data set that contains combined data from users and offers to become the unities to be classified depending on if the offer was completed or not.

The next image shows how it looks like.

This data will be used as the input for a classification problem solved with Logistic Regression.

Logistic Regression Model

Once we have got the features from the table, we used recursive feature elimination to see if there are some features that can improve the performance of the model with their absence.

This information is used to delete one feature from the users and one from the offers. [‘income’, ‘reward’].

This modification shows to improve the model in a 6%.

Once we have selected the model and the features, it is trained and tested as shown in the notebook and it produce a well trained model that allows to predict for a combination user-offer if the user will complete the offer consuming the products.

The Accuracy of the logistic regression classifier on test set: 0.77 and the confusion matrix is as follows:

[[3157 2522]
[ 470 6599]]

Which shows that more that 9500 user-offer combinations were correctly predicted in a population of about 12500.

This put in our hands a model that can be used to predict the completeness of the offers sent to the users. It is true that the accuracy is not the best, nevertheless in this context the loss is given when the prediction is 0 and the reality is 1 which we call false negative. This is a very low value in proportion so statistically we can see that the loss should not be significant.

You can see the details of this project in the following link: https://github.com/diego-rzo/DataScientistNanodegree/tree/master/Project_6_Capstone_Project

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade