Analysis Of Instacart From Kaggle Competition

dipak tiwari
Mar 19 · 14 min read
source link

Nowadays, Due to the establishment of the internet worldwide, a large number of retail businesses throughout the world have introduced online shopping where the customer can order their basic product by browsing the selected shopping websites which provide the online service. During the online purchasing process, transaction data of customers is stored in the form of data for analysis purposes to understand the customer purchasing behavior.

Lots of research has been done and still in progress to establish the algorithm which can enable the retail or e-commerce business to create a good relationship between customer and sales business by giving outstanding service to customers. The whole purpose of transaction detail is to understand the purchasing pattern of customer and product which were sold by the customer as well. Therefore, I would like to analyze the transaction data which was provided by the instacart competition which was held 3 years ago.

Let's get down into the instacart problem

Source:- https://www.kaggle.com/c/instacart-market-basket-analysis

Business Problem:

Instacart is an e-commerce website where the customer can purchase their product online from nearby grocery stores and instacart personal shopper will pick up and deliver the order to the customer location. In Instacart competition, the user purchase history which is a complete temporal based data of each customer has been provided and the problem statement is to predict which previously purchased products will be in a user’s next order.

Customers prior purchase a product and reorder as a label has been given in the dataset, By analyzing the customers prior ordering pattern, we have to solve the problem statement. As per the given data, it does not have a cold start problem where we have to predict the product for new users as all the user detail has already been available in data.

About the Instacart Market Basket Data Set

There are 6 different datasets of users that have been given. In the below section we will be discussing the dataset in detail.

ML formulation :

The solution to the given problem statement can be approached by a supervised way as the label for each query has already been provided. It is a binary classification problem where 1 stands for reordering and 0 stands for not reorder. The final model which we build as the case study progresses, the model would be something like predict the class label for a given query point where the query point includes all the features for the user. We have to find out the optimized weight as per the performance metric.

Performance metric:

As per the business requirement, the performance of the model is validated by the mean f1 score. As we know f1 score deals with both precision and recall. It is defined as the harmonic mean of precision and recall and measures how low is the false positives and low false negatives. f1 score 1 means all predicted Y labels are the same as the true Y labels and 0 means the model is a total failure. So apart from accuracy, it also deals with how to labels are classified.

Formula:

F1_score = 2*(precision * recall)/(precision + recall)

Summary from the above dataset

The dataset gives a detailed explanation of each user's ordering pattern. from the day first when users had signed up for instacart app for purchasing items to the recent purchase item has already been given in dataset. User detail in the prior file tells us whether the users have purchased their previously purchased item or are they buying for the first time. user history file tells us on which day, an hour of the day, and day of the month, the user has purchased a particular product, as well as the relationship between the product, department, and aisles.

It is sufficient enough to analyze the user ordering pattern which is the sole purpose of this competition.

How do we start the problem?

Since our aim is to predict which of the previously purchased item will be in the user's next order. The problem can divide into two parts. One way to establish the relation between the user and purchase and another way is to establish the relation between the user and reorder. I am going to explain the former one which is to create a connection between user and product as it tells us a lot of stories about both of them.

let's perform EDA over the dataset

In this section, we will be discussing in detail about analysis of a dataset that has been provided by Instacart. The detailed explanation analysis which I have done is explained below:

Information related to reorder:

From the given above plot, it looks like the data is almost balanced to some portion as the number of reorders is more than not reorder.

Univariate Analysis of the dataset

Information related to Aisles :

% of products reorder aisle wise

From the graph, it is clear that there are some aisle sections from which most of the user's daily or regular products are purchased. In the top 2 Aisle sections which are fresh fruits and fresh vegetables, user purchased items are found to be more as compared to others.

Information related to Departments:

% of products reorder department wise

From the graph, it is clear that there are some department sections from which most of the user's daily or regular products are purchased. The top 2 department sections which produce and daily eggs, user purchased items are found to be more as compared to others.

Information related to products:

% of products reorder product wise

In this plot, we are discussing products that users order on their daily basis. As we can see from the graph, these are the top 20 products users purchase most of the time.

Information related to order_dow:

% of products reorder DOW wise

As we can see from the graph users do online shopping usually every day. First two days of the week, orders made by users are found to be more, but it is clear that online shopping continues every day irrespective of which day it belongs to. We would further be discussing which hour of the day more rush of order found more in below.

Information related to order the hour of days:

% of products reorder order_hour_of_day wise

Most of the shopping is done in the afternoon. The peak starts at morning 9 AM and is maximum at 1 PM and is steady till 6 PM. It means the users like to purchase their basic items in the given time interval and the peak slowly decreases as the day passes by.

Information related to Days since prior order:

% of products reorder days_since_prior_order wise

As we can see from the graph most of the orders made by the user fall under the 1 week of the month and the last day of the month. From the graph, it is clear that users do shopping mostly in the first week and shopping gradually decreases from 8 days to till to the 29 days of the month. It also tells that chances of reordering products are high for the days where shopping is more. it is further noted that there are some users who wish to buy the product for a whole week. we can see from the graph, on the 7th, 14th, 21th and 28th day of the month reorder ratio over the week is high to which we call active users.

Information related to add_to_cart_order:

For most of the users, no. of product that has been added to the cart is found to be approx 30 maximum. The majority of users have an average of 15 no. of product add to their cart.

Bivariate Analysis of the dataset

Bivariate Analysis order_dow over Hour of days:

From the above heat map, it is clear that almost every day, markets start taking it peak at 7 AM and it continues till 6 PM. first 2 days of the week at between 9 AM to 4 PM purchasing found to be busier.

Bivariate Analysis order_dow over a day since prior order made:

From the graph, first week month is found to be busier than rest of the days of the month and found a sudden rise in last day of the month. if we see closely days 7th, 14, 21, and 28th, there is a light verticle line drawn which indicate that is some user who does the weekly shopping.

Bivariate Analysis hour of days over a day since prior order made:

Between hour 7 AM to 6 PM, ordering a product is found to be steady over the month though the intensity of buying a product is slowly decreasing after the 1 week of the month . 4 light verticle line drawn on every last day of the week categorized certain type of user who does weekly shopping between the shopping hour.

Bivariate Analysis of types of products add to cart:

From the graph, it is clear between the range of 1 to 45, products are added to the cart by users and reorder. At the same time, there is very little time, products are added more and reorder.

Analysis of the weekly buyers over the product, department, and aisles:

These are the products which weekly buyers prefer to buy for whole weeks. On average, the top 10 products found to be more reorder every day.

These are the top 10 departments from which the weekly buys choose to order the product for their whole 1-week stuff. The top 2 departments are found to be the department from which product has been ordered more.

These are the top 10 aisles from which the weekly buys choose to order the product for their whole 1-week stuff. The top 2 aisles are found to be the aisle from which the product has been ordered more.

Analysis of no of products reorder from each department:

From the above graph, it shows that what proportion of product from each department has been reordered by the user on regular basis. The top departments from which no. of products reorder are daily eggs, produce and snacks as well as beverages.

Feature Engineering

Here are some of the feature engineering which I have done to boost my model.

reorder_pattern and name_of_day_time :

We have seen from the graph that how user purchase patterns change within a day as well as a period of a month. So I have assigned the hour and day of the month into a name which defines the name of the day time ex:- morning and reorder_pattern ex:- weekly.

Data augmentation:

Since I am planning to use the last 5 purchase item history as a feature to boost the model so I am avoiding the train and test evals set.

reordered_count:

We are finding the number of times the particular product has been reordered by the user in past.

product_reordered_count:

We are finding the number of times the particular product has been reordered in past.

product_count_dict , aisles_count_dict and department_count_dict:

There are some products which have been ordered so many times by user, as well as there are some department and aisle name as well from which product has been taken from frequently. so, I thought to consider this as a feature in my model built.

dow_dict, hour_of_day ,days_since_prior ,new_name_of_day_time and order_reorder_pattern:

I am basically taking about order_dow , order_hour_of_day, days_since_prior_order , name_of_day_time and reorder_pattern . User purchasing product in proportion to above-given attributes. so I am counting the no of times users purchase items in a given tie period as well as giving weightage to the feature as well.

dow_by_user:

It indicates how many times a particular user does shopping on a given day of the week.

order_hour_of_day_by_user:

It indicates how many times a particular user does shopping in a given hour of days.

new_days_since_prior_order_by_user:

It indicates how many times a particular user does shopping in a given period of a month.

new_prouse_order_dow:

It indicates how many times a particular user purchase particular items on the given day of the week.

new_order_hour_of_day:

It indicates how many times a particular user purchase particular items in a given hour of the day.

new_days_since_prior_order:

It indicates how many times a particular user purchase particular items in a given period of a month.

new_product_count_y:

It indicates how many times a particular user purchase particular items in a given period of the day and given day of the week.

new_product_count_z:

It indicates how many times a particular user purchase particular items in a given period of the day, given day of the week, and period of a month.

rank_by_week function:

I am creating a function which in return creates a data frame in which the product purchased by the user on the given day of the week is arranged in the form of rank. products with a lower rank indicate the product is less purchased by the user whereas a higher rank indicates the product is more purchased by the user. in this way, we can give weightage to the product on the basis of day of the week because the user found to be doing shopping more within a week.

time_in_sec:

I am capturing the time in a sec from 1 day of the month the particular product is purchased.

Exit:

In training as well as test dataset, products have been associates with users on which we have to train and test the model. I am capturing the feature which tells us that if the particular product had been bought by the user in the previously same time frame.

Ex:= if in train dataset if the product P is associate with user A and given time frame is C dow, D hour, E days_since_prior_order then, I want to know if the same product P had user A purchased in same C or D or E time frame in previous purchase history and if the answer is yes, then ‘exit’ indicates 1 or if No then 0.

Average W2V:

I am doing Average W2V instead of a bag of words or Tfidf because w2v can help us predict similar products, aisles, and department names.

per:

I am creating a feature regarding how much percentage of a particular product is reordered by users on a particular day of week or hour of day or day of the month. it is carried out for Active customer which means from the graph, on every weekend i.e 7, 14, 21 and 28, there is some user who wishes to buy their products for a whole week.

reorder_status_5:

Creating a feature which tells us how many time the particular product has been reordered in last 5 order by the user.

Creating train data over which have to train:

#order id  on which we have to perform training
train_data = pd.read_csv("order_products__train.csv", usecols=["order_id"]).drop_duplicates()
# merging trainig order id with order data
train_data = pd.merge(train_data, orders, how="inner", on="order_id")
train_data = pd.merge(train_data, use_prod1, how="inner", on="user_id")

Creating labels for train data:

train = pd.read_csv("order_products__train.csv", usecols=["order_id", "product_id", "reordered"])train = pd.merge(train, orders, how="inner", on="order_id")
train = train[["user_id", "product_id", "reordered"]]
train_data = pd.merge(train_data, train, how="left", on=["user_id", "product_id"])train_data["reordered"].fillna(0)

Creating test data over which have to test our model:

#order id  on which we have to predict product which could be in user next order
test_data = pd.read_csv("sample_submission.csv", usecols=["order_id"])
# merging trainig order id with order data
test_data = pd.merge(test_data, orders, how="inner", on="order_id")
test_data = pd.merge(test_data, use_prod1, how="inner", on="user_id")

Models:

I have used altogether 6 models over which I have applied Randomized search with metric equal to ‘f1 score. since data is huge, the system could not bear using grid search. In addition, I have also tried one MLP architecture just to see how DL would perform.

The models which I have used are given below:

  1. logistic regression
  2. catboost classifier
  3. Random Forest
  4. Xgb classifier
  5. LGBMClassifier
  6. AdaBoost Classifier
  7. MLP Architecture

Since from the business problem point of view. I have to validate my model on the basis of the mean f1 score. I have used performance matric f1 for all the models and came to the conclusion that LGBMClassifier gave me the best result out of all.

How I have used LGBMClassifier:

searching for the best parameter:

lgbm=lgb.LGBMClassifier(n_jobs=-1)
prams={
'learning_rate':[0.001,0.01,0.1],
'n_estimators':[100,500,1000,1500],
'max_depth':[5,10,15,20],
'num_leaves':[25,50,75],
'class_weight':[{0:1,1:2},{0:1,1:4},{0:1,1:6}],
}
lgbm_cfl1=RandomizedSearchCV(lgbm,param_distributions=prams,scoring='f1',return_train_score=True,cv=3)
lgbm_cfl1.fit(X_train,y_train)

lgbm_cfl1.best_params_
{'class_weight': {0: 1, 1: 4},
'learning_rate': 0.1,
'max_depth': 15,
'n_estimators': 1000,
'num_leaves': 75}

Train my model with the best parameter:

lgbm=lgb.LGBMClassifier(class_weight={0:1,1:4},learning_rate=0.1,max_depth=15,n_estimators=1000,num_leaves=75,random_state=0)
lgbm.fit(X_train,y_train)

F1 score which I got from the model on train and test data:

from sklearn.metrics import f1_score
print('Train f1 score',f1_score(y_train,lgbm.predict(X_train)))
print('Test f1 score',f1_score(y_test,lgbm.predict(X_test)))
Train f1 score 0.4665766693810759
Test f1 score 0.44187727674179583

Confusion matrix on train and test data:

Features importances:

Feature Importances

Summary of all the model explained in the pretty table:

My submission to Kaggle:

Kaggle score over the best model is 0.38205

The public score is 0.38205

The private score is 0.38066

this score put me within the 20% of the leaderboard.

Deployment:

I have used Flask for the Deployment of this DL Model.

Work I would like to do in the future:

Since I have only used the first approach which the relationship between the user and product. In the future, I would like to use the relation of user with reorder probability and merge it with the above and find the maximum mean f1 score. I still think I could have done more better. Hence, I would read more research papers and come with more potential feature engineering which would help my model score more. The highest f1 score on the Kaggle leaderboard is 0.407. I would love to dive down and find more solutions in the future.

Reference:

  1. https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
  2. https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38126
  3. https://github.com/pratikparija93/Instacart-Market-Basket-Analysis
  4. https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38161
  5. https://vishalmendekarhere.medium.com/instacart-market-basket-analysis-challenged-e39d3c550bbd
  6. https://www.appliedaicourse.com/

you can check out all the detail about this case study from my GitHub link which is given below:

My LinkedIn:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

dipak tiwari

Written by

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store