INSTACART market basket allowance… end to end solution

Business problem:-

Published in

The Startup

8 min readFeb 18, 2021

suggest relevant products as per users’ order history or predict which product a user has bought before is most likely to be purchased again.

Source of data :-

kaggle instacart market basket allowance https://www.kaggle.com/c/instacart-market-basket-analysis/overview

Existing approaches to the problem :-

It is a kaggle competition held in 2017 with 3000 entries

All the solutions can be found here

https://www.kaggle.com/c/instacart-market-basket-analysis/notebooks

My Approach:-

Do all the pipelining of machine learning, i.e., eda, trying out various models, tuning the best model, and get the probability values to feed them into the expectation of f1 maximizing algorithm

The different threshold for different orders for maximizing f1 score:-

Suppose we have products A,B with probabilities .9 and .3 of A and B, respectively.

So, probability of only A = 0.9*(1–0.1) = 0.63 only B= 0.3*(1-.09) = 0.03, (A and B) = 0.9*0.3=.027

Probability[none]=(1–0.9)*(1–0.3)

Now, E(f1/A) = 0.81, E(f1/B) = 0.21 and E(f1/A and B) = 0.71

Since, E(f1/A) gives the highest f1 score, we will choose that as the threshold

Furthermore, we will check this for 2(n-1) combinations of products instead of 2**n products because we always have to choose between (k, [none]) products of the highest posteriors.

Data Analysis :-

There are the three sets prior, train and test. The train data contains one order of each user, and the prior contains the rest of the 3 to 99 order IDs.

Here we can see the distribution of reordered or not is 60:40. So, not that much imbalance, which is excellent; we can use simple metrics (such as accuracy).

This is the spice in this competition and why it is hard to come into the ranks.

For a user, we try to find some of their hidden patterns, for example, their favorite product

In the dictionary, cola is their favorite product they have been ordering for the last six orders, and the value corresponds to the order in which they didn’t order cola, but why?

We can see he just ordered fridge pack cola, which is cola only.

This plot shows no of orders and their occurrences

Kaggle said, they will provide 4–100 orders for every user, but, here we see 3–100!

Interesting to see some people are ordering more than 100 products!

Interesting to see some people are ordering more than 100 products!

Observation:- It looks like customers order once every week (check the peak at seven days) or once a month (peak at 30 days). We could also see smaller peaks at 14, 21 and 28 days (weekly intervals).

Let’s look at some “User” specific features

reorder ratio = no of products reordered / total no of products reordered

Observation:- reorder ratio is the highest on day 0–2, and then it starts declining.

Ps:- (we don’t know what day of the week these numbers correspond to, educated guess?

0 is Sunday )

Observation :- increases for a small duration 0 -35 days then decreases sharply

Inference :- when the days since last order increases a lot we can conclude that the user got bored of that product and probably move on

Observation :-comfortable time for people reordering products is10–15 hours which is 10 am till 3 pm in a 12-hour cycle.

User unique product is the count of distinct products ordered by the user

Observation:- bizarre-looking plot

Inference:- the day of the week, an hour of the day, and days since prior order maps nicely to reorder ratio.

Product Specific Features:-

Observation:-fruits are the number one category ordered in the entire dataset and have a very high percentage of reordered.

Observation:-

Department wise personal care is the most ordered from the department, followed by beverages and snacks.
different story when observing reorder stats
Reorder probability of dairy eggs is the highest, followed by beverages and snacks.
Reorder ratio of snacks dominated, meaning reordered/total orders is very high.
Reorder times mean:- How many times that product per department is being reordered, highest in beverages.

relation between reorder ratio and hour of day and day of week

Observations

● Hours 5,6,7,8, and 9 are the hottest

● Day 5 seems to be the hottest, followed by 0 and 1

Lets see on a specific day throughout 24 hours which products are popular

Plot to represent on which day and which hour which product has the highest reorder ratio these are seven heatmaps of 7 days of the week with x-axis being hour of the day and y being the most reordered product of that time with a bar representing reorder ratio.

How often do people rebuy?

Inference: -Now, if we look at the above three graphs, it is inevitable that clusters till 40 are when reorders are happening the most, and there are none reordered after 30, and if a user takes more than 30 days, it is most certainly not reordering that product or got bored of it and seeing the clusters belong to the 4th or 5th cluster.

Features we will use :-

Baseline model :-

Our baseline model will be a neural network that will take in originally given values as features and try to map them to the target variable reordered.

Let’s take a look at gradients of last layer

Dead activation for the last 3 neurons ! (relu sometimes has this behaviour)

why dead through ?

relu( x) =0 if x <0: else 1

so it basically means those neurons got stuck in the negative side and could not recover

Let’s start modeling

Model 1:- logistic regression model using the best-found hyperparameters.

Since logistic regression does not give us actual probabilities value which is very important to predict in the Kaggle test data (will show you why later), we will pass the log regression object into a calibrated classifier to get probability values.

Model 2:- Here we will see how does a neural net performs with feature engineering.

Nothing fancy here a simple neural net with four dense layers (I tried with a dense-net type architecture and didn’t help at all, plus it was a lot slower because of no parameters).

The 3rd model will be the queen of machine learning XG-Boost:-

This is slightly better than even our neural net!

The great thing about the XG boost is we can see what features are the most helpful in predicting the target variable

Let’s Try Ensembles United we stand, divided we fall.

1) Split the whole data into train and test(80–20)

2) Now, in the 80% train set, we will split the train set into D1 and D2 (50–50).

Now from this D1, we will do sampling with replacement to create d1,d2,d3,….dk(k samples).

Now, we will create ‘k’ models and train each of these models with each of these k samples.

3) Now we pass the D2 set to each of these k models, we will get k predictions for D2 from each of them.

4) Now, using these k predictions create a new dataset, and for D2, we already know its corresponding target values, so now you train a meta-model with these k predictions.

5) Now, for model evaluation, we will use 20% data that we kept as the test set. Pass that test set to each of the base models, and we will get ‘k’ predictions. Now we create a new dataset with these k predictions and pass it to your meta-model, and you will get the final prediction. Using this final prediction and the targets for the test set, we can calculate the model’s performance score.

Here f1 ,f2,f3 and so on are the models !

Kaggle score :-

Method 1 :- np.argmax() on output probabilities .

Method 2 :-use a global threshold (found using deep learning f1 evaluation)

f1_ is the f1 score with .21 as threshold

Method 3 :-E(f1) optimization script

Strongest model is XG boost so we will use it for method 3 because it takes time to run on large dataset O(n**2)

Future work:-

● Use the output of the last dense layer of the baseline deep learning model as features.

● Use different base models like log regression, SVM, neural nets, etc. as base models for ensemble

● Predict using a set of the threshold for e.g., .21,.11,…51,45 … k, create k features and pass them into 1 meta-model and use that model to predict on test data

● https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38159 try this RNN approach