Predicting Store Sales — Random Forest Regression

Algorithm 4 Fun
9 min readFeb 13, 2022

--

In any business, sales are one of the most important factors in driving revenue and profits. Hence, this is also the reason why companies nowadays are brainstorming on how they can increase total sales.

Especially in a web/app based companies, where data is generated on a daily basis. Companies would like to utilize those datas into an actionable insights for increasing their sales.

In the rest of this article , I will outlined some of the basic methods on how companies can generate their own predictive model.

The dataset used can be downloaded from this link, It has several shop_id and item_id with different total item sold per day and item price. The main goal from this dataset is to determine how many items sold in the upcoming month.

Future Sales Prediction Model

Overview : To develop a ML Model for forecasting future total item in multiple stores and items

Objectives :
1. Analyze and generate insights from the given dataset
2. Determine Models for Forecasting
3. Measuring Model Accuracy
4. Generate Prediction Results

File Description :
1. sales_train.csv — the training set. Daily historical data from January 2013 to October 2015.
2. test.csv — the test set. You need to forecast the sales for these shops and products for November 2015.
3. items.csv — supplemental information about the items/products.
4. item_categories.csv — supplemental information about the items categories.
5. shops.csv- supplemental information about the shops.

Data Fields :
1. ID — an Id that represents a (Shop, Item) tuple within the test set
2. shop_id — unique identifier of a shop
3. item_id — unique identifier of a product
4. item_category_id — unique identifier of item category
5. item_cnt_day — number of products sold. You are predicting a monthly amount of this measure
6. item_price — current price of an item
7. date — date in format dd/mm/yyyy
8. date_block_num — a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,…, October 2015 is 33
9. item_name — name of item
10. shop_name — name of shop
11. item_category_name — name of item category

Guidelines

In this article, the followings are going to be the steps for our model development :
1. Data Preparation, Cleaning, and Manipulation
2. Exploratory Data Analysis
3. Feature Engineering, Model Development and Evaluation
4. Generate Prediction Results

1. Data Preparation, Cleaning, and Manipulation

The above is for loading the files and checking whether there are any null values within the dataset.

After loading the dataset, we merge the data based on each unique ids and remove duplicates if any.

#Creating final df for training model and submissiontrain_final = sales_train.merge(items, on = 'item_id', how = 'left').merge(item_categories, on = 'item_category_id', how = 'left').merge(shops, on = 'shop_id', how = 'left')
test_final = test.merge(items, on = 'item_id', how = 'left').merge(item_categories, on = 'item_category_id', how = 'left').merge(shops, on = 'shop_id', how = 'left')
#Check duplicates in dataset and remove if any
print('Total Rows Before Removing Duplicate : ', train_final.shape[0])
print('Total Duplicate Rows : ',train_final.duplicated().sum())train_final = train_final[~train_final.duplicated()]
print('Total Rows After Removing Duplicate : ',train_final.shape[0])

Here are the printed results from the above code :

  • Total Rows Before Removing Duplicate : 2935849
  • Total Duplicate Rows : 6
  • Total Rows After Removing Duplicate : 2935843

There are only 6 duplicated rows in our merge dataset and It’s actually okay to keep the data as It doesn’t affect much in our model training.

#Changing date format and add additional columnstrain_final['date'] = pd.to_datetime(train_final['date']).dt.datetrain_final['sales'] = train_final['item_price']*train_final['item_cnt_day']
train_final['year'] = pd.DatetimeIndex(train_final['date']).year
train_final['month'] = pd.DatetimeIndex(train_final['date']).month

Since, we want to train our model based on year and month, we created both ‘year’ and ‘month’ columns for grouping later on.

Moving forward, in order to reduce any noise in the dataset, we are going to remove outliers from the data.

#Printing Values for each percentilefor a in range(0,101,10) :
print(f'{a}th percentile value for item_cnt_day is {np.percentile(train_final["item_cnt_day"],a)}')
for a in range(0,101,10) :
print(f'{a}th percentile value for item_price is {np.percentile(final["item_price"],a)}')

Results for each loops above :

  1. item_cnt_day :
0th percentile value for item_cnt_day is -22.0
10th percentile value for item_cnt_day is 1.0
20th percentile value for item_cnt_day is 1.0
30th percentile value for item_cnt_day is 1.0
40th percentile value for item_cnt_day is 1.0
50th percentile value for item_cnt_day is 1.0
60th percentile value for item_cnt_day is 1.0
70th percentile value for item_cnt_day is 1.0
80th percentile value for item_cnt_day is 1.0
90th percentile value for item_cnt_day is 2.0
100th percentile value for item_cnt_day is 2169.0

2. item_price :

0th percentile value for item_price is -1.0
10th percentile value for item_price is 149.0
20th percentile value for item_price is 199.0
30th percentile value for item_price is 299.0
40th percentile value for item_price is 349.0
50th percentile value for item_price is 399.0
60th percentile value for item_price is 573.96
70th percentile value for item_price is 799.0
80th percentile value for item_price is 1190.0
90th percentile value for item_price is 1999.0
100th percentile value for item_price is 307980.0

From the results above, the data should be filtered lower than 90th percentile and more than 0th percentile based on item_cnt_day. Then, we will filter using item_price column ( > 0th Percentile Value and ≤ 90th Percentile Value ).

#Removing Outliersfinal = train_final[(train_final[‘item_cnt_day’]>0)&(train_final[‘item_cnt_day’] < train_final[‘item_cnt_day’].quantile(0.96))]final = final[(final['item_price'] > 0)&(final['item_price']<final['item_price'].quantile(0.92))]

After removing duplicated, outliers, and checking null values, the data should be ready for exploration and training.

2. Exploratory Data Analysis

The sales on a daily standpoint are showing sideways for 3 years, with some drastic increased in end of 2013 and 2014. Since we have removed the outliers from previous step, the data is showing better consistency ( Less Noise and Volatility ).

Distribution Plot

The distribution plot represents right skewness on both Item_Price and Sales ( where as the majority of the values are under 500 ). While on the other hand, Item_Cnt_Day columns have value 1 as shown in the plot.

Let’s list out the Top 10 shops and items in the dataset.

Based on the Top 10 Shops and Items on the data, It can be concluded that even the top performers are generally underperforms in term of total item sold. There are huge gaps in term of total item sold between the Top 1 to Top 10, with roughly 60% differences.

Correlation Heat Map

Apparently, the features above don’t have direct correlation with total item sold on a daily basis. Hence, there might be other factors with high correlation with It but since the data presented is limited then It’s actually best to find other data sources for discovering what cause item sold in these shops and items.

3. Feature Engineering, Model Development and Evaluation

Since our dataset above is showing low ( and negative ) correlation between features and the targeted variable. We are going to use Random Forest Regression for training our model.

Here are the reasons why I decided to use Random Forest :

  • Good for Avoiding Overfitting
  • Great for Experimentation ( Flexible parameters adjustment )
  • Handle Linear and Non-Linear Relationship well
  • Less influenced ( or not influenced ) by outliers

But there are also several drawbacks in using this algorithms :

  • Computationally costly ( especially If the dataset is large )
  • Hard to interpret the model
  • Little control over what the model does behind the scene ( this is why It’s called random in the first place )

So, let’s start by splitting our dataset into features and label before training our model.

#Selecting shop_id and item_id that exist in test datasettest_shop_ids = test['shop_id'].unique()
test_item_ids = test['item_id'].unique()
monthly_item_cnt = monthly_item_cnt[monthly_item_cnt['shop_id'].isin(test_shop_ids)]
monthly_item_cnt = monthly_item_cnt[monthly_item_cnt['item_id'].isin(test_item_ids)]
#Import Model and Metricsfrom sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from numpy import mean, std
from pprint import pprint
X,y = monthly_item_cnt[['year','month','shop_id','item_id']], monthly_item_cnt['item_cnt_day']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 42)
random_rf = RandomForestRegressor(random_state = 42)
pprint(random_rf.get_params())
Random Forest Parameters

Next, we can start adjusting the parameters for our model.

bootstrap = [True, False]
n_estimators = [int(x) for x in np.linspace(start=200,stop=2000, num = 10)]
max_depth = [int(x) for x in np.linspace(start=10, stop = 110, num = 11)]
max_depth.append(None)
max_features = ['auto','sqrt']
min_samples_split = [2,5,10]
min_samples_leaf = [1,2,4]
random_grid = {
'bootstrap' : bootstrap,
'n_estimators' : n_estimators,
'max_depth' : max_depth,
'max_features' : max_features,
'min_samples_split' : min_samples_split,
'min_samples_leaf' : min_samples_leaf
}
pprint(random_grid)
Parameters for Training

Since everything has been set, now we are ready to train our model.

rf_random = RandomizedSearchCV(estimator = random_rf, param_distributions = random_grid,
cv = 3, n_jobs = -1, random_state = 42, n_iter = 3, scoring='neg_mean_absolute_error',
verbose = 2, return_train_score = True)
rf_random.fit(X_train,y_train)

As you guys can see from above that n_iter is set to 3 and CV is set to 3, which means that the model is going to be trained ( or fitted ) 9 times using random combination from the parameters above.

For the best outcome, more training should be done but since my laptop shuts down before It can finished when I set n_iter more than 10 then I had to scale It down to 3 ( even this almost failed ).

After the training has finished, we can evaluate the model using our test dataset that we splitted.

def evaluate(model, X_test, y_test) :
prediction = model.predict(X_test)
error = abs(prediction - y_test)
mape = 100 * np.mean(error/y_test)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(error)))
print('Accuracy = {:0.2f}%.'.format(accuracy))

return accuracy
random_model = rf_random.best_estimator_
evaluate(random_model, X_test, y_test)

The model showed 60% accuracy which actually not so great to be honest but since I had to wait for more than 1+ hours for training this model, I decided to stop at this point.

But for you who probably have more than 32GB RAM, I encourage you guys to experiment on the model by adjusting the parameters and setting more iterations on the model. Since, the pros of using Random Forest is that we can increase Its accuracy by setting the right parameters ( and this can only be done with more trial and error ).

4. Generate Prediction Results

In the final section, we are going to use the test data above for prediction the outcome using our trained model above.

#Generate item_cnt_month result using above modeltest_df = test.copy()
test_df['year'] = '2015'
test_df['month'] = '11'
result_rf = random_model.predict(test_df[['year','month','shop_id','item_id']])
result_rf_df = pd.DataFrame(result_rf)
final_result = pd.merge(test_df, result_rf_df, left_index=True, right_index=True)
final_result = final_result.rename(columns={0:'item_cnt_month'})
final_result = final_result[['ID','item_cnt_month']]
final_result.shape

That’s the end of It, and If you go through the Kaggle websites on the submission tab, you will see that the expected submission should be 214200 rows with two columns ( ID and item_cnt_month ).

I hope that you guys find this article helpful in using Random Forest for developing your predictive modeling.

Thank you for reading this content and please leave me a comment below or you can contact me from my social media profile ( or E-mail ) !

--

--

Algorithm 4 Fun

Data Science & Analytics | Bringing Data into Actions | Reach me out for Data Discussion !