Stories by Minh Khang Tran on Medium

Exploring Starbucks’s Customer Behaviors for Deeper Insights

Minh Khang Tran — Sat, 03 Jun 2023 18:13:42 GMT

Unveiling Deeper Insights into Starbucks Customer Behaviors

I. Introduction

Overview

Starbucks’ remarkable growth since its inception has propelled it to become a sought-after brand for travelers and investors alike. In collaboration with Udacity and Starbucks, this project ventures into analyzing simulated data derived from the Starbucks rewards app.

The data simulation program captures individuals’ purchasing decisions and the impact of promotional offers on those decisions. Each individual in the simulation possesses hidden traits that affect their purchasing behavior and are linked to observable characteristics. The simulation generates diverse events including receiving offers, opening offers, and making purchases.

Dataset

There are three types of offers: buy-one-get-one (BOGO), discount, and informational. BOGO offers require users to spend a specific amount to receive a reward equal to that threshold. Discount offers provide users with a reward that is a fraction of the amount spent. Informational offers do not provide a reward and do not require a specific spending amount. Offers can be delivered through various channels.

The data encompasses three JSON files:

portfolio.json — containing offer ids and metadata about each offer (duration, type, etc.)
profile.json — demographic data for each customer.
transcript.json — records for transactions, offers received, offers viewed, and offers completed.

Here is the schema and explanation of each variable in the files:

Project Goal

The objective of this project is to identify factors and develop a machine learning (ML) model that can predict customer behavior when it comes to completed offers. My primary focus is on demographics because I believe they have a greater influence than the specific offer types.

Strategy

To begin the analysis, I will conduct an Exploratory Data Analysis (EDA) to examine and understand the data’s representations and characteristics. This step will involve gaining insights into the data and performing data cleaning as needed.

Next, I will utilize the provided data to address the aforementioned objectives. I will employ charts and various ML models to analyze the merged dataset, which consists of the portfolio, profile, and transcript data. These models will be trained on the data to generate answers and insights.

For the binary classification task, I plan to employ eight different models: Logistic Regression, AdaBoost Classifier, Random Forest Classifier, K Neighbors Classifier, Decision Tree Classifier, Gradient Boosting Classifier, XGB Classifier, and LGBM Classifier. Afterwards, I will compare their F1-scores and select the most suitable model for further tuning and determining feature importance.

Metrics

Given the dataset’s imbalance, the F1-Score is chosen as the performance metric instead of accuracy. This decision prevents oversampling of incomplete offer data and aligns with the binary classification nature of the outcome (0 or 1).

II. Data Exploration

In order to gain a comprehensive understanding of the datasets, our initial step involves exploring them. This includes tasks such as checking for duplicated rows, visualizing data distribution, and more.

Data Preprocessing

Let’s start by accessing each dataset and determining the necessary steps to clean the data.

The portfolio dataset showed 10 rows and 6 columns.

Create four separate columns: ‘web’, ‘email’, ‘mobile’, and ‘social’ from the existing ‘channels’ column.
Convert the ‘channels’ column from a list to a numerical type.

2. The profile dataset showed 17000 rows and 5 columns.

Convert became_member_on to datetime format and extract the year.
Replace incorrect age data (118) with null values.

3. The transcript dataset showed 306534 rows and 4 columns.

Extract offer_id and amount from ‘value’ column.

After cleaning up all three datasets, they are merged based on ‘customer_id’ and ‘offer_id’. Down below, you can see the merged dataset:

Data Visualization

Having explored and cleaned the dataset, our focus now shifts to visualizations, particularly those related to the demographic information in the profile dataset. Let’s begin by examining the gender distribution through a bar graph:

The bar graph illustrates that the majority of Starbucks customers identify as male, with approximately 50% of the collected Starbucks profiles reflecting male customers, while around 35% represent female customers. A small percentage of customers identify as ‘Others’. Now, let’s analyze the income distribution using a histogram:

The income distribution appears slightly right-skewed, with the highest count observed around $75,000 and a majority of customers earning less than that. To gain further insights, I have segregated the income distribution by gender using violin plots:

Interestingly, the income distribution appears to be evenly spread across all genders except for males, where the majority earn less than $80,000. Moving on, let’s visualize the age distribution:

After removing incorrect age data (e.g., 118), we observe a relatively normal distribution of ages. Towards the higher end, the distribution tapers off, as it is rare to encounter individuals older than 90. Now, let’s examine the age distribution by gender:

The male age distribution displays a bimodal pattern with a small peak around age 22 and a larger peak at age 60. However, one could argue that the distribution also exhibits a right-skew. Similarly, the female age distribution is bimodal, with peaks around ages 40 and 60. The age distribution for ‘Others’ appears to follow a normal distribution.

Next, let’s visualize the number of memberships initiated each year:

The graph indicates a significant surge in Starbucks members in 2017, with over 6,000 new members. However, the number of new members joining in 2018 is relatively low, which is somewhat unexpected. Overall, membership has been steadily growing since 2013.

Now, let’s proceed to study different offer types based on offer actions (received, viewed, and completed) and examine the corresponding visuals:

It is evident that members have received an equal number of discounts and BOGO (buy one, get one) offers. Additionally, there is a higher number of views for the BOGO offer, indicating its potential attractiveness. However, the majority of people have completed the discount offer. Now, let’s explore the offer action distribution by gender:

From the graph above, it is evident that male customers make more transactions, and they also receive a higher number of offers. However, the completion rate for offers is almost the same for male and female customers.

Next, let’s check the total counts of different events for each channel:

0.0: [‘web’, ‘email’, ‘mobile’, ‘social’]

1.0: [‘web’, ‘email’, ‘mobile’]

2.0: [‘email’, ‘mobile’, ‘social’]

3.0: [‘web’, ‘email’]

From the first three bar plots, it is apparent that utilizing social channels for advertisements is significantly effective. Although the ‘1.0’ channel has more received offers compared to the ‘2.0’ channel, it has a lower number of viewed offers than the ‘2.0’ channel.

III. Methodology

Data Preparation

Now, we need to prepare the data for modeling. I performed the following cleaning steps in this order:

Drop all rows with null values in the ‘income’ column.
Create dummy variables for the ‘event’, ‘gender’, and ‘offer_type’ columns.
Fill all null values with 0.

To address null values, I dropped all rows where income is null, and I replaced the remaining null values with 0. This is justified because if the offer was only viewed or received, no transactions were made, resulting in $0 spent.

Next, I split the data into X and y. X represents the independent variables that serve as input for the model, while y is the variable to predict. Below, you can see how I separated the data and the resulting data frame for X.

Before proceeding with modeling the data, I need to normalize the range of independent variables. To accomplish this, I used MinMaxScaler, as shown below:

Finally, we need to split the data into training and test datasets.

Implementation

Before delving into modeling the data, I first need to select a performance metric. While accuracy is commonly used, it may not be the most suitable choice in this case. To identify the dataset’s imbalance, I executed the following code below:

Based on the observed imbalance, I have opted to utilize the F1-Score as a more appropriate metric. This metric takes into consideration both precision and recall, making it well-suited for imbalanced datasets.

Moving forward, I proceeded to train and test eight different models to calculate the F1-Score. These models include Logistic Regression, AdaBoost Classifier, Random Forest Classifier, K Neighbors Classifier, Decision Tree Classifier, Gradient Boosting Classifier, XGB Classifier, and LGBM Classifier. Initially, all models were used without any parameter tuning. The objective was to identify the model with the highest F1-Score, which would then be subjected to further tuning and potential improvement.

Firstly, I will provide an example of training and testing the LGBM classifier:

Then, I will create a loop to train and test 8 classification models, storing the results in a dataframe for direct comparison:

Based on these results, the top 3 models with the highest F1-score are the Gradient Boosting Classifier, XGB Classifier, and LGBM Classifier. Among them, the LGBM Classifier stands out as the most accurate with an F1-Score of 0.59 and an accuracy rate of 91%. Consequently, this model will be selected for further tuning and feature evaluation.

Refinement

Tuning an LGBM Classifier for an imbalanced dataset involves considering techniques specifically designed to handle class imbalance. Here are some improvements in the code:

Stratified Sampling: The train_test_split function now includes the stratify parameter, which ensures that the class distribution is preserved in both the training and test sets.
Addressing Class Imbalance: The code now uses the SMOTE oversampling technique from the imblearn library to address class imbalance. It resamples the training set by generating synthetic samples of the minority class.
Class weight balancing: In imbalanced datasets, the minority class is often underrepresented. You can try assigning higher weights to the minority class using the class_weight parameter in the LGBM Classifier. Set it to 'balanced' to automatically adjust the weights based on the class frequencies.

IV. Results

Model Evaluation

After modifying the code, I obtained improved results as follows:

Compared to the previous model and dataset, the accuracy decreased slightly from 91% to 89%. However, since this is an imbalanced dataset, our focus should be on the F1-Score rather than accuracy. Surprisingly, by implementing Stratified Sampling, SMOTE oversampling, and Class weight balancing, there was a significant enhancement in the LGBMClassifier model, with the F1-Score increasing from 0.6 to 0.65.

While the improvement is substantial, we can further enhance our model through hyperparameter tuning using a param grid in GridSearchCV. It’s important to note that an exhaustive grid search can become time-consuming when we increase the number of parameters and cross-validation.

Using the LGBMClassifier model above, I used it to find the Feature Importance of each feature. Feature Importance refers to techniques that calculate a score for all the input features for a given model — the scores simply represent the “importance” of each feature. Down below, you can see my results.

The three most influential factors that impact a customer’s completion of an offer are the duration of response time, the customer’s income, and their age.

Justification

Based on these results, the LGBM Classifier outperforms the other models in terms of both F1-Score and accuracy. It achieves an F1-Score of 0.595501 and an accuracy of 0.909688, indicating its ability to make accurate predictions on the dataset.

Notably, the LGBM Classifier also exhibits a remarkably short modeling time of 2.206612, making it significantly faster compared to most of the other models. This computational efficiency is particularly advantageous, especially when dealing with large datasets, as it allows for quicker model training and prediction processes.

Considering these factors, we confidently conclude that the LGBM Classifier is the most suitable and superior model for our dataset. Its ability to deliver accurate results in a shorter amount of time makes it an excellent choice for handling large datasets effectively and efficiently.

V. Conclusion

Reflection

The objective of this project was to build a model that predicts whether a customer will complete an offer. The process involved preprocessing the portfolio, profile, and transaction datasets, as well as visualizing the merged datasets to gain a better understanding of the data. Subsequently, the merged dataset was used to train the model, with the data split into train and test datasets. Various models were evaluated, and the LGBM Classifier was found to be the most accurate choice. Techniques such as Stratified Sampling, SMOTE oversampling, and class weight balancing were employed to address class imbalance.

Thanks to these techniques, there was an improvement in the F1-Score, which increased from 0.6 to 0.65. This indicates the effectiveness of the implemented methods in handling class imbalance and enhancing the model’s predictive performance.

During the analysis, the three most important factors that determine whether a customer might complete an offer were identified. These factors are the time taken to act on the offer, the customer’s income, and age. By recognizing these influential variables, a deeper understanding of customer behavior in relation to offer completion was gained.

Improvement

However, despite the achieved improvement, there is still potential for further enhancement in the model’s performance.

To achieve better results, it is recommended to focus on fine-tuning the model’s hyperparameters. By experimenting with different settings for hyperparameters such as learning rate, maximum depth, and number of estimators, the model’s performance can be further optimized. This process of hyperparameter tuning has the potential to improve the overall accuracy and increase the F1-Score of the model.

I would like to thanks Udacity and Starbucks for providing me the datasets to use for this project. Click this link GitHub to explore more!!!

Uncovering the 2022 NYC Airbnb Market: Insights into Popularity, Pricing, and Tourist Behavior

Minh Khang Tran — Wed, 15 Feb 2023 12:59:38 GMT

A deeper understanding of 2022 NYC Airbnb market

Introduction

Airbnb’s exponential growth since its establishment in 2008 has made it a popular choice for both travelers and investors.

Photo by Andreas Kruck on Unsplash

To provide insight into the latest Airbnb market trends, this post examines the 2022 and 2023 analytics for New York City, utilizing data sourced directly from Inside Airbnb.

The data includes three files, namely listings, reviews, and calendar, which offer a wealth of information on host and listing details, pricing, and availability.

Upon analyzing the data, I formulated several questions relating to three areas of interest — popularity, pricing, and valuable information for tourists.

Which hosts have the highest number of listings? What are the most popular neighborhoods in NYC for Airbnb listings?

What are the most expensive and cheapest neighborhoods in NYC to rent an Airbnb? How does the distribution of listing prices vary between boroughs? And what are the seasonal trends in Airbnb pricing and availability in NYC?

What are the top 10 reviewed listings in NYC? What are the top amenities that guests look for?

Question 1.1: Which hosts have the highest number of listings?

The top host on the list is Blueground, with 487 listings in New York City. Eugene comes in second with 351 listings and Michael with 338 listings. Blueground has a significant lead over the second-ranked host, with approximately 136 more listings, representing a difference of roughly 39%. The five hosts at the bottom of the list have similar numbers of listings.

Blueground is an apartment rental agency, which may explain its high number of listings in New York City. The agency could have partnerships with building owners and property managers, allowing it to offer many apartments for rent on the Airbnb platform.

Question 1.2: What are the most popular neighborhoods in NYC for Airbnb listings?

The top 10 neighborhoods with the most Airbnb listings are split between Brooklyn and Manhattan. Interestingly, Brooklyn’s Bedford-Stuyvesant and Williamsburg rank first and second with 2936 and 2570 listings, respectively.

This is surprising as Manhattan is typically thought of as the busiest and most popular borough in New York City. However, the remaining six neighborhoods in the top 10 are located in Manhattan.

This suggests that while Brooklyn has some highly popular neighborhoods, Manhattan still remains a major center of activity and commerce in New York City.

Additionally, an interactive sunburst plot created with Plotly is available below:

https://medium.com/media/d3bdec9c8f7a6d272ec1b798d90932ef/href

The chart is divided into two main layers, with Brooklyn and Manhattan as the top-level categories. . However, the chart shows that Brooklyn contains a significantly larger proportion of the total listings than Manhattan, indicating its popularity among Airbnb hosts.

Question 2.1: What are the most expensive and cheapest neighborhoods in NYC to rent an Airbnb?

Coney Island and West Brighton are the most expensive with average prices of $3694.62 and $2607.00 per night, respectively. In contrast, Hollis Hills is the least expensive with an average price of $500 per night, which is seven times less expensive than Coney Island.

The top 10 most expensive neighborhoods show a wide range of prices, while there is no notable difference among the top 10 cheapest neighborhoods.

Question 2.2: How does the distribution of listing prices vary between boroughs?

The price range for listings in Manhattan is the highest among the boroughs, with an average of $150 per night. Brooklyn follows closely behind with an average of $110 per night. Meanwhile, Staten Island, Bronx, and Queens have similar price distributions.

This distribution and density of prices were expected, given that Manhattan is known to be one of the most expensive places in the world to live, while the Bronx generally has lower living standards.

Question 2.3: What are the seasonal trends in Airbnb pricing and availability in NYC?

https://medium.com/media/edb9ea8dfde811195a3f8899a2a0c5f0/href

Although the average price of Airbnb rentals in NYC remains stable for the most part, two noteworthy fluctuations stand out. On December 4, 2022, the average price dropped significantly to $176.3153. Conversely, on December 31, 2022, the average price experienced a sharp increase, reaching $255.6417. The surge in price on New Year’s Eve could be attributed to the high demand for accommodations during this popular holiday.

https://medium.com/media/8a2b0b085310432b0d4d9ef526c8fc62/href

In 2023, all Airbnb hosts in NYC reported that their homes were available throughout the year, whereas in 2022, the majority of listings were unavailable. Specifically, 14,353 listings were marked as 0% availability, compared to only 2,096 listings that were available all year.

https://medium.com/media/9f11b388af32c48ea409b3df3c1cb3fd/href

Question 3.1: What are the top 10 reviewed listings in NYC?

The top reviewed listings in NYC are all located in Manhattan, except for Entire New Apartment in Park Slope/Gowanus in Brooklyn. The majority of them are in the Theater District and Financial District neighborhoods.

The features most commonly highlighted in the names of these listings are their views, proximity to attractions such as Central Park, and their modern amenities, such as communal rooftops and new apartments.

Question 3.2: What are the top amenities that guests look for?

The top 25 most frequently used words in Airbnb listing names reveal that hosts emphasize providing comfortable, visually appealing accommodations. “Bedroom,” “room,” “private,” and “apartment” are among the most commonly used words, along with “cozy,” “spacious,” “beautiful,” and “modern.”

Listings in Brooklyn, Manhattan, Williamsburg, and the East Village are popular, suggesting that these areas are popular destinations for travelers. Additionally, the words “park” and “heart” imply that some listings are located near tourist attractions or central areas.

The top 25 used words for amenities suggest that guests prioritize basic necessities and convenience. Wifi, long-term stays allowed, smoke and carbon monoxide alarms, kitchen, essentials, heating, and air conditioning are some of the most common amenities listed.

It also seems that hosts prioritize cleanliness and safety as evidenced by the presence of essentials such as hangers, bed linens, and towels, as well as safety items such as fire extinguishers and first aid kits.

Additionally, many amenities on this list focus on making guests’ stays as easy and convenient as possible, such as self-check-in, dedicated workspace, and free street parking.

Conclusion

This article analyzed AirBnB data from NYC to uncover insights in popularity, pricing, and tourist information. Some high-level takeaways include:

Brooklyn’s Bedford-Stuyvesant is the most popular neighborhood, despite Manhattan being the most well-known borough.
The average listing price in 2022 was $174 per night, with the priciest neighborhoods in Manhattan and Brooklyn. Prices tend to be highest around New Year’s Eve.
In 2022, most listings were unavailable, with 14,353 listings marked as 0% availability compared to only 2,096 available all year.
Visitors tend to stay long-term when visiting NYC. Therefore, they often prioritize basic necessities and convenience, such as WiFi, smoke and carbon monoxide alarms, kitchen, and heating and air conditioning, as evidenced by the most common amenities listed by hosts.

Click this link GitHub to explore more!!!