On the Importance of Data in Training Machine Learning Algorithms — Part One
Data plays a vital role in building machine learning algorithms. Often, machine learning practitioners tend to ignore the importance of the quality of data and rather search for “better” machine learning algorithms. In this series, I am planning to walk through various characteristics of data that would impact the quality of the resulting machine learning algorithm.
We will look at the following characteristics of data that will impact a machine learning model:
- Number of records available to train
- Quality of the training data and its impact
- Augmenting data
In all the above cases, we will keep the machine learning algorithm to be the same. In order to make this concrete, we will use the newly released TensorFlow Decision Forests and train a simple RandomForestModel.
Let’s look at the data that we will be using for this exercise. We will use product master data from BestBuy — released under a Creative Commons Zero v1.0 Universal license.
If we have a look at the first few rows of the data, we can see the following:
From the table above, we can see that there are three input fields (description, manufacturer, price) and three output fields (level1_category, level2_category, level3_category). We can gather even more information about the fields using pandas:
In total, we have 48,087 records with 6 fields, 4 of which are of type categorical, 1 being numerical type and 1 being string type. In order to control our experiments, we will set aside 8,087 records and play around with the 40,000 records. But, before we get into various experimental setups, let us take a look at the data distribution of the fields in the dataset that we will be working with.
We employ a tool called Facets, which is an open source data visualisation tool that “aids in understanding and analyzing machine learning datasets”. We will first look at the numeric field price, which looks as follows:
One point to note here is that most of the data lies within 2.5K but there are some records that even have ~28K as an input value — quite a few values are outliers in this dataset. This is also evident from the max value indicated in the table summary of this field. An important characteristic that needs to be noted in numeric field types is the mean and standard deviation of the data. This would be especially crucial when the task at hand would be to predict the value of the numeric field — a regression task.
If we look at the categorical fields, manufacturer, level1_category, level2_category, and level3_category, we see the following data characteristics:
An interesting point that can be seen from the summary above is that the categorical field level3_category misses ~14% of the data. This indicates that a machine learning algorithm trained to predict the value of this field would learn to predict NaN as one of the categories.
The summary also shows the most frequently appearing item for each field with Appliances being the most frequently appearing value for level1_category followed by Cell Phones being the most frequently appearing value for level2_category.
Summary of Data Exploration
We have explored the dataset that we would be working with and also seen some basic characteristics of the fields present in this data. In the next part of this series, we will setup an experiment to understand the impact of the number of training records on the test scores.
Drop a comment if you think there are other aspects of this data that should be explored.
If you’d like to know more about how the results look like for various algorithms, head over to part two of this series.