Machine Learning Pipeline for Beginners -Retail Returns Dataset Part-II

Wahab Aftab
Geek Culture

--

This article will help you understand the modeling aspect of the Machine Learning Pipeline, We have already completed the Pre-Modelling Phase which you can see here:

https://wahabaftab.medium.com/machine-learning-pipeline-for-beginners-retail-returns-dataset-part-i-2132cfcc9e6a

Complete codes for Part 1 and 2 are also given at the end of this article. Now let’s jump straight into modeling our dataset!

Modeling:

A good practice to do before modeling is to see if some of the features are identical or provide the same information. Multiple features of such nature are redundant for our model. We visualize a correlation matrix to see how the various features correlate with each other:

Correlation Matrix

Lighter color means high correlation and vice versa. We observe that all columns are independent and show little correlation with each other. Thus they contribute differently and provide various artifacts which are useful for our model. Let’s start with the modeling now. First, we split our data into the training set and test set. Since we have 100k rows, a test set of 15% of the data seems sufficient.

x = df.drop(['return'],axis=1) #training features
y = df['return'] # target variable
X_train, X_test1, y_train, y_test= train_test_split(x, y, test_size=0.15, random_state=42)
#converting to numpy array
X_train= np.array(X_train)
y_train = np.array(y_train)
X_test1 = np.array(X_test1)
y_test = np.array(y_test)

The next step is data normalization which increases the efficiency and performance of the model. Our dataset has column values with different ranges. This makes it harder for models to learn. We can fix this by normalizing our data to a 0–1 range.

#normalizing features
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test1)

This normalizes our data to 0–1 values. Standard scalar fits on our training dataset and normalizes our data based on those values both for the testing and training sets. Many people make the mistake of fitting on the test set also but we only need to transform the test set not fit it.

For modeling, there are several algorithms like decision trees, logistic regression, SVM, Random Forest, and XGBoost, etc. The deep and detailed explanation of each of these algorithms demands a blog post of its own. Here I outline the reason for choosing any of these algorithms. Since our problem falls under binary classification i.e. we have to predict whether a person will return an item or not (2 values ~binary), we use classifiers, and all of the mentioned algorithms work well for binary classifications. Decision trees classifier generally outperform others and are more robust to outliers hence using them is suitable. Random forest and XGBoost are both decision tree-based algorithms using ensembling techniques and are a more advanced version of decision trees as well as current state-of-the-art. XGBoost is a bit complicated to understand but Random Forest is simply harnessing multiple decision trees, pooling their outcomes, and predicting the class with the highest votes. Let’s use Random Forest for our problem. There is a built-in library which we can call to apply Random Forest as shown:

clf=RandomForestClassifier(random_state=1,n_estimators=200,class_weight='balanced',
min_samples_leaf=5,
min_samples_split=10)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

The code above trains our classifier. The values inside the function like n_estimators, class weights, etc. are hyperparameters which we tune manually to gain the best performance. Our model is now trained. Let’s do some testing on it:

y_pred = clf.predict(X_test) #use the model to predict on the validation data

Here we use the test set to see what our model predicts. We can then compare the predicted values against the original values to see how well the model performed.

We can see that the model achieved an accuracy of 68% on the test set which is not bad. We can also use other metrics like F1-Score, Precision, Recall, or AUC of ROC to measure performance.

Future Work:

I have applied these metrics in the final code and have additionally performed predictions on totally unseen data with no target feature. Feel free to skip the cost part in the final code. Finally, we have completed a basic beginner’s level machine learning pipeline. Advancing further, we can do things like:

  • Using mean instead of median
  • Using one-hot encoding instead of label encoding in some places
  • Removing outliers
  • Using different algorithms
  • Hyperparameter tuning
  • Using different performance metrics

Final Notes:

This article along with Part-I is for beginners who are passionate about machine learning and want to learn. I hope you have learned something after reading this. Please feel free to add any comments. Any feedback is truly appreciated. Don’t hesitate to share this! Thank you!

Final Code:

https://github.com/wahabaftab/Machine-Learning-Pipeline-for-Beginners

--

--