You’ve probably heard of machine learning, considering how famous and impactful it has been in the past decade. Many people, who might not harbour sufficient knowledge of how it works, still consider it ‘magic’ where we accurately predict what will happen using data. As the field is advancing with countless applications, people with no experience in the field are finding it interesting and want to explore what is actually really happening behind the scenes. In this article, we will explore this magic and learn to build a Machine Learning (ML) model given some data. It will help novice learners understand the basic ML pipeline.
To start off, we talk about our dataset. Our dataset is of an online retail store that wants to understand that which of their customers will return an item that they have purchased. In the case of an item being returned, the store has to face losses in terms of shipping charges, since most stores provide free shipping. To mitigate this loss, the store aims to discover the possibility that a consumer will return an item in order to take some counter measures. You can download the dataset and complete code from the Github link at the bottom of Part 2 article. Let’s look inside the dataset we are working with:
#Reading the dataset
df=pd.read_csv("training_set.csv")#finding total rows and columns
print ("Columns in the dataset: ", df.shape)
print ("Rows in the dataset : ", df.shape)df.head()
We observe that we have a total of 14 columns. Of out these 14, the return column is the target variable, where a value of 1 means that the customer will return an item and 0 means that the customer will not. We have a total of 100,000 rows in the dataset. Let’s see the ratio of values in our target feature:
We notice that there are almost equal number of 1's and 0's in our dataset. This aspect negates class imbalance and gives us a balanced dataset.
As we aim to discuss the whole ML Pipeline, we cannot jump straight into modelling. We first have to make our dataset suitable for modelling. This includes data cleaning, pre-processing, feature engineering and feature selection. Let’s start with the data cleaning step.
Data Cleaning and Pre-processing:
Our first order of business is to look for null or missing values as they are irrelevant to our model. We can search for them in each column as follows:
#checking the count of null values in each column
Now, most of the columns have 0 null values, however 2 columns have several null values. There exist certain ways to properly deal with such values. I am going to explain 3 basic methods as follows:
- Remove the 2 columns entirely from the dataset. However, since only around 10 % of the values are missing, this does not look like a favourable option. Additionally, these columns seem useful for later use.
- Delete rows with missing values. This method can work if there are negligible amount of null values, but in this case we can lose almost 18,000 of our data records which is utterly unacceptable.
- Replace missing data value with the mean or median value of the respective column. This step is also called data imputing. It involves calculating the mean or median of known values and substituting it in place of the null values. This is good practice as it does not involve losing data. Unfortunately, in our case, we are unable to calculate a mean date and substitute it, hence we will have to perform feature engineering later.
We proceed with option 3 above and progress with the stated feature engineering step as follows:
In this section, we will format a couple of features (columns) to make them more useful for us. We witness that we have several features with the date/time format. To extract meaningful information from them, let’s start with user_dob. In order to make this feature simple and easy to understand, we extract the age of the customer from it by subtracting the date of birth from the current date. Also, since we know that this column has null values, we later substitute them with the median of the calculated age. The relevant code will look something like this:
#function to convert date to speicif format
#for null values set dob as current date to identify them as their age will be 0
return datetime.strptime(date_str, '%Y-%m-%d')
#subtracting dob with current date and converting to years.
df['Age']=df.Age.apply(lambda x: int(str(x).split(" ")) )
The above code computes the age of all consumers. In order to perform data imputing, a good practice is to find the mean or median of each class and replace each null value with this newly calculated value. I find the median generally better than the mean and have used it for the purposes of this article.
#function to impute 0 values
x0=x[x['return']==1] # subset comprising of return
x1=x[x['return']==0] # subset comprising of non-return
x0[y] =x0[y].map( lambda x : x0[y].median() if x == 0 else x) #replacing returns 0s with returns median
x1[y] =x1[y].map( lambda x : x1[y].median() if x == 0 else x) #replacing non-returns 0s with non-returns median
return pd.concat([x0,x1])#Imputing 0 values with median for each class
We can see that we have successfully converted date/time data into numerical data.
We can do similar steps for the delivery and order date. We can find the time it takes for the order to be delivered by subtracting the order date from the delivery date. This can be a feasible feature as consumers might return items if their delivery took a substantial amount of time. Also, we can find the delivery month or weekday from the date as it is possible that items delivered on Mondays have a higher chance of return or items ordered during Easter or Christmas are more likely to be kept. Please find the complete code of this step on the Github link mentioned at the bottom of Part 2 article.
After computing these features, we can do similar data imputing for delivery_time, month, weekday as before. We can do similar steps like calculation of Age for user_reg. After extracting useful features, we can remove the original columns as follows:
Machine Learning models are based on mathematical inputs. They only understand numerical values and perform calculations on them to produce results. Since our dataset contains non-numerical data, also called categorical data, we have to process it in order to make it usable. Let’s see our categorical data columns:
#getting total columns
print("Total Columns in dataset are: ",list(cols))#getting only numerical columns
print("Numerical Columns in dataset are: ",list(numerical_cols))#getting categorical columns by subtracting numerical columns from total columns
print("Categorical Columns in dataset are: ",list(categorical_cols))
We observe that there are multiple categorical columns. For month and weekday we can easily convert them to a numerical format as 1–12 for month and 0–6 for weekdays. For other columns, we have multiple methods i.e. we can one-hot encode or just assign a number label to a value. I am inclined to use label encoding in this beginner’s tutorial, by using built-in libraries. Let’s perform label encoding on user_title:
le = preprocessing.LabelEncoder()
print ("Before Encoding: ",df['user_title'].unique())#encoding user_title to numerical values
df['user_title'] = le.fit_transform(df.user_title.values)print ("After Encoding: ",df['user_title'].unique())
We can see that before the encoding, we had 5 different values of categorical data, but now after encoding, we have numerical data. We do not need to use a library for month/weekday as we can assign our own encoding for it. Item_size is the next categorical variable which seems important. Let’s look at its values:
We can see that item size has a mixture of non-numerical and numerical values. This aspect is interesting since we will have to convert only a few values. A problem arises that since we do not have any mapping as to what xl, l and xxl mean in numerical terms, we cannot simply replace them. Based on a real world scenario, we can safely assume that xxl > xl > l and so on. Hence, we can divide our numerical data into 5–6 groups depending on the number of categorical values and then assign each group to one category of values. We can easily do this with quantiles which segment data into groups based on values.
#removing + sign from sizes
#if size is categorical replace it with nan
df.item_size_numerical.quantile([0.02,0.08,.12, .3,.6, .8,0.98])
Here we have 6 quantiles, 0.98 means values which lie above 98% of the data and 0.02 refers to values which are at 2%. Hence, we can say that 48 is larger than 98% of values so we can assign xxxl to it and iterate similarly for other quantiles.
else: #for non-reported size we choosing mid value
Now this will convert all our non-numeric sizes to numerical values as follows:
Now that we have fixed item sizes, the next two categorical variables are color and state. We can safely deduce that color and state have little to no impact on the rate of returns hence they are not useful for the model. Furthermore, we can also remove other redundant columns like IDs. Our final cleaned dataset is now ready for modelling and looks like this:
This completes our Pre-Modelling phase. We have completed all the required steps in detail and are confident with our dataset to do some modelling. You can check how to make predictions on the above pre-processed dataset here:
If you think something is wrong or needs some changing Please feel free to add any comments, Feedback is appreciated.