Kaggle: Instacart Market Basket Analysis

Srinidhi Karjol
Geek Culture
Published in
11 min readAug 23, 2021

Table of Contents

  • Introduction
  • Business Problem
  • Machine Learning Problem
  • Data Explanation
  • Exploratory Data Analysis
  • Feature Engineering
  • Splitting Data into Train and Test
  • Machine Learning Models
  • Future Work
  • References

Introduction :

Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada.

Instacart, a grocery ordering and delivery app aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you. The company offers its services via a website and mobile app.

Business Problem :

A proper understanding of the business problem is the most crucial step for any machine learning or deep learning task. Let’s try to understand the business problem first.

Market Basket Analysis is a data mining technique that is very much helpful to increase sales, helps us to give a better understanding of customer purchasing techniques. Using the purchase history done over a period of time we try to suggest the items to the customer that could be reordered.

The major benefits would be an increase in sales and customer satisfaction. Using the data we try to determine the products that are brought together with which retailers can optimize product placement, they can provide better deals which would encourage the customers to purchase the items which they had not thought of buying. This would thus help increase sales. Market Basket analysis also allows companies to identify the important products and frequently purchased products which could potentially hurt their business if they are unavailable. For example, let’s say we have predicted that so and so the product would be a part of these many users’ next order. This would help the companies to be aware of the availability of the product and make sure it’s available.

Machine Learning Problem :

The data science team of Instacart wants us to use their outsourced data on customer orders over time to predict which previously purchased products will be in the user’s next order. The dataset consists of 6 files, each containing different information about Instacart. It contains information about the products, the aisles on which the product is placed, what product was reordered, after how many days did the user come to shop, etc.

The dataset is divided into 3 parts Priors, Train, and Test. Prior orders contain information about users and their previous orders. The train and test orders will be used for the training and testing of the model. These prior orders will be used for feature engineering. There are almost 50K products and about 3M orders. Also, there is a chance that users might or may not order products that were a part of previous orders. Thus ‘None’ can also be an answer to the user’s next purchase. Therefore, we will have to consider ‘None’ as a separate product.

Data Explanation :

We are provided with 6 tables namely -

  • Orders: This table includes all orders, namely prior, train, and test.
  • order_products_train: This table includes training orders and indicates whether a product in an order is a reorder or not (through the reordered variable).
  • order_products_prior: This table includes prior orders. It indicates whether a product in an order is a reorder or not (through the reordered variable).
  • products: This table includes all products and their related information.
  • aisles: This table includes all aisles and their related information.
  • departments: This table includes all departments and their related information.

These tables are related to each other as follows –

The data contains around 50k products and the purchase history of 206209 users.
As mentioned above, the dataset is divided into 3 parts.
The count of Prior orders is 3214874 which will be used to create features. The count of train orders is 131209 and Test orders are 75000.

The orders made by the user range from 4–100.

Exploratory Data Analysis :

Data visualization is the graphical representation of information and data. Our eyes are drawn to colors to patterns. When we see a chart or a plot we can quickly understand the trends or identify the outliers.

The original notebook market_basket_analysis_EDA.ipynb has a copious number of plots and including all of them here would make this blog monotonous. Thus I would be including only a few important ones here.

Eval set distribution

Distribution of train and test and prior data

On which day of the week were most ordered placed?

dfsdfds

Observations -

From the above plot, we can see that the maximum number of orders were placed during weekends i.e on Saturday and on Sunday. This infers that customers tend to go shopping more on weekends.

During what time of the day were most ordered placed?

Observations -

Most of the orders were placed between 9 am to 5 pm. We can see that during these hours on an average almost 25000 orders are being placed.

After how many days since the prior order is the current order being placed?

Observations -

  • A maximum number of orders are placed after a gap of 7 days and 30 days. This means that customers place orders or shop more on a weekly and monthly basis.
  • There are a few orders being placed within a gap of 1 day.
  • We can see that users tend to order on starting day of the week and the ending day of the week.
  • A maximum number of orders are placed during month-end.

Best Selling Department

Observations -

  • The top 5 best-selling departments are produce, dairy eggs, snacks, beverages, and frozen.
  • We do not have any data on revenue generated per department so we cannot know which department is most profitable so we can only have a say here in the most number of sales.

But, what is the “produce” department? Let’s break it by aisle

Observations -

  • The produce department refers to fresh fruits, fresh vegetables, packaged fruits and vegetables, and fresh herbs.
  • In the produce department, maximum orders are placed for fresh fruits and least for packaged products in produce.
  • The fresh vegetable aisle is not so behind. This would also give us a hint that most of the customers who order here seem to diet cautious and prefer healthy products to beverages, packaged products, snacks, etc.

Best Selling Aisle Overall

Observations -

  • The maximum orders are placed for fresh fruits, fresh vegetables, yogurt.
  • This seems great! The majority of the orders are placed for fresh fruits and vegetables.

Let’s look at the products that were reordered the most

Observations -

  • This is quite similar to the above finding but this plot tells the exact number of times each of the top 20 products being reordered. Banana is at the top spot with 400000 reorders.

Reorder v/s Day Of The Week

Observations -

  • Maximum reorders were done on a Saturday.
  • Least reorders were done on a Wednesday.

Reorder v/s Days Since Prior Order

Observations -

  • We can clearly say that the maximum number of reorders is done after a gap of 7 days.
  • Most recorders are made after a gap of 5 to 7 days and after 30 days.
  • The least number of reorders are made during a gap of 15 to 29 days is somewhere during the mid-week.
  • Thus, the majority of the reorders are made either during the weekend or month-end.

Feature Engineering :

Feature Engineering is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of machine learning algorithms.

I was able to come up with 4 types of features namely,

  1. User-related Features: What is the user behavior like?
  2. Product-related Features: What is the product like?
  3. User x Product-related Features: How does the user feel about the product?
  4. Datetime related features: Day and time of item purchased by the user.

User-related Features -

  • max, min and mean number of orders placed by each user
  • Day of the week the users order the most.
  • Order hour of the day the users order the most.
  • User Reordered Ratio
  • User total number of orders placed, etc.
  • User Average Basket
  • Distinct number of products brought by a user

Product-related Features -

  • Number of times each product was purchased
  • Total number of times the product was reordered — indicating how much the product was liked
  • Product Reorder Ratio
  • Product’s aisle and department reordered ratio

User x Product-related Features -

  • Number of times User ‘A’ bought product ‘B’
  • Number of times User ‘A’ reordered product ‘B’
  • Finding when the user has bought a product the first time
  • Average cart position of a product for each user
  • Finding when the user has bought a product the last time
  • User-Product Order Rate -How frequent a product is brought by a user
  • Sum across orders of the position in User A’s cart that Item B falls into.
  • User Product Reorder Ratio
  • Order Streak

In all, I had around 48 features for each ordered product.

Feature Importance(Top 30) -

To explain some of the top features -

  • order_streak -The sign indicates the type of the streak (ordered vs not ordered).
  • uxp_order_rate -How frequent a product is brought by a user
  • user_average_basket -User’s average basket size
  • reodered_ratio -Product’s reorder ratio
  • user_order_starts_at -At what number does the user’s order number start at? This tells us if the user is an old customer or a new customer of instacart.

Splitting Data into Train and Test :

Our first step here is to get all the orders that belong to the ‘eval_set’ as train and test. We then merge these orders with the current data which we have created using prior orders. Once we get the train and test orders, we merge the train orders with the order_products_train dataset. The next step here would be to replace NaN values with 0 in the reordered column.

Machine Learning Models :

The Performance metric that we will be using to evaluate our models is F1-score. Here we will train 6 different models. The first step here is to find the best hyperparameters by train and testing the model on cv-set using Random Search or Grid Search. Also, please note that we should consider the threshold value too as one of the hyper-parameters that needs to be tuned. We then train these models with the best hyperparameters. The last step is to calculate the F1-score of test data and compare the best result and select the best model that gives the highest F1-score.

Sklearn generic models that were used to train :

  1. Logistic Regression
  2. Decision Tree Classifier
  3. Random-Forest (ensemble)
  4. Light GBM
  5. XGBoost
  6. CatBoost

Different models will have a different threshold for which we would get the maximum F1 score.

The below code snippet helps us to identify the best threshold.

Best Threshold
Thresholds vs F1-score

Custom Ensemble Classifier (Stacking):-

Along with the generic sklearn models, let’s move on to build a custom ensemble model. The steps followed to implement the custom ensemble is as follows -

1) First we split the whole data into train and test(80–20).

2) In the 80% train set, we again split the train set into D1 and D2.(50–50).

3)We then do sampling with replacement from D1 to create d1,d2,d3….dk(k samples).

4) We create ‘k’ models and train each of those models with each of these k samples.

5) Once training of ‘k’ models is complete, we pass the D2 set to each of these k models, we thus get k predictions for D2, from each of ‘k’ models.

6) Using these ‘k’ predictions we create a new dataset, and for D2, as we already know its corresponding target values, we then train a meta-model with these ‘k’ predictions by considering it as a meta-data.

7) For model evaluation, we have used the 20% data that we have kept as the test set. Pass that test set to each of the base models and get ‘k’ predictions. After that, we create a new dataset with these k predictions and pass it to the meta-model.

Once we have got the final predictions, using the targets for the test set, we calculate the model's performance score.

Here we tune both the number of base models as well as a combination of base models as they can detect different patterns of data.

To get more insights about this approach ,please refer to the below paper:

https://pdfs.semanticscholar.org/449e/7116d7e2cff37b4d3b1357a23953231b4709.pdf

After comparing the results we decided to go with the below combination.

  • Combination of base models -CatBoostClassifier, LGBMClassifier, XGBClassifier
  • Number of base models -5

After choosing the best model, I saved my model and ran it with test data to get the results. For submitting on Kaggle I created the below functions to merge all the products as a list.

The below table summarizes the score for all the sklearn generic models and the custom ensemble.

Kaggle score for catboost model

Deployment :

To create a web application, Flask is being used as a backend framework here. After creating a web application it is then deployed on the cloud to make it available for all. For deployment on the cloud, an AWS EC2 instance is being used.

The below function in app.py is used to process the HTML request.

Future Work :

As a further extension of my solution, I would like to try out some Deep Learning Models. Also, I look forward to trying Association Rules and Apriori Algorithm.

Improvements To Current Approach:

As an improvement to the existing approach, I would like to try using the F1-Score Expectation-Maximization implementation by Faron which will definitely improve the Kaggle score.

Regarding the custom ensemble, this implementation would have definitely provided a much better score if there were no computation resource limitations.

To check out my entire work, kindly visit my GitHub repository: https://github.com/srinidhikarjol/Instakart-Market-Basket-Analysis

My LinkedIn Profile: https://www.linkedin.com/in/srinidhi-karjol-aba072103/

--

--

Srinidhi Karjol
Geek Culture

Senior Product Developer, Machine Learning Enthusiast