On the Importance of Data in Training Machine Learning Algorithms — Part Three

Karthik M Swamy
Analytics Vidhya
Published in
4 min readJul 20, 2021

In part one and two of this blog series, we saw how data played a role in improving the performance of a machine learning algorithm. We saw that the initial 50% of the data provided a large jump in test data performance and the subsequent data that we used added lesser information to improve what the model learns. In this article we will look at how we built this entire script and also see how we can extend this script to apply to new datasets.

TL;DR: You can find the complete code used in this blog series here: bit.ly/TFDFData.

We will first prepare the dataset by splitting the data into train and test. Remember that the test data always remains a constant. This can be accomplished with TensorFlow Decision Forests’ data loading capabilities tfdf.keras.pd_dataframe_to_tf_dataset as follows:

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label=label)     
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label=label)

With the data being split, we can subsequently build the classifier that we would like to compare against. The following are the classifiers that we could be using for this comparison:

  • RandomForest Classifier
  • Gradient Boosted Decision Trees
  • CART

All these models and their API definitions can be viewed in this page. We will proceed with using the train and test datasets to train and evaluate our classifier. A complete method definition for using the RandomForests classifier can be seen below:

def train_rf_model_with_dataframes(train_df, test_df, label):
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label=label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label=label)

# Specify the model.
model_1 = tfdf.keras.RandomForestModel(num_trees=30)

# Optionally, add evaluation metrics.
model_1.compile(metrics=["accuracy"])

t1 = time()
# Train the model.
with sys_pipes():
model_1.fit(x=train_ds)

evaluation = model_1.evaluate(test_ds)

Similarly, we will build methods for each of the different classifiers that we will be evaluating. We will monitor the performance metrics such as accuracy and training metrics such as time to compare the algorithms and their performance on this data.

New Metric Proposal

From the performance evaluation metrics (accuracy), it is quite clear that the performance of RandomForests seems to slightly perform better than Gradient Boosted Decision Trees when provided with more data. However, in most practical scenarios, a performance difference of ~3% accuracy improvement alone cannot serve as grounds for choosing an algorithm. Another important metrics that affects the end user would be the training time required in obtaining a given model.

When we compare the training time required by the different algorithms and plot it against the number of records, we see that CART requires almost constant time to train with various dataset sizes when compared to Random Forests or Gradient Boosted Decision Trees.

Plot of the train time requirements for each algorithm as a function of number of records used for training

When we look at the plot of the test accuracy obtained for each of these algorithms, we see a completely different picture.

Plot of the test accuracies for each algorithm as a function of number of records used for training

We see that Random Forests seem to perform better than Gradient Boosted Trees by a marginal percentage point as the number of records used for training gets bigger. A natural problem for any machine learning practitioner is to bring these metrics together.

One such metric would be a derived metric such as (test accuracy / train second). This would serve as a way to compare the test time accuracy metric and the train time required to achieve this metric into a single value.

For the dataset and the set of algorithms that we evaluated, we can calculate this value for the final metric to be:

Comparison of the (Test Accuracy / Train Second) metric for the different algorithms

From the table above, we can deduce that CART provides the best test accuracy score for the train second used for training this algorithm. Conventional algorithm performance comparisons would argue that Random Forests performs ~6% better than CART and hence would be the most obvious choice. However, when one thinks of larger datasets, Random Forests would linearly scale up in training time. This is the exact reason why such a new metric would help in understanding how the algorithm performs in terms of the train time required in achieving the test metric. With the new derived metric, we can state that CART would be the pick amongst the algorithms compared if we are happy with the test performance accuracy of ~76%.

Summary of Impact of Training Data

In this series, we took the example of a simple tabular dataset and evaluated the impact of the number of training records used on the test dataset that we used for the evaluation. On this particular dataset, we compared the performance of different algorithms and their train times. We then used a derived metric to conclude that CART might be a better algorithm given that this would be the best bang for one’s buck if we are happy with the test accuracy of CART. CART might be the algorithm of choice if we can see a similar derived metric on a few more datasets.

In our next blog, we will explore a few more datasets and simply compare the derived metric with the maximum trainable records possible for these datasets before we conclude that CART might be the to-go algorithm of choice for datasets of this type.

Do you think this derived metric makes sense? Can we improve this metric?

--

--

Karthik M Swamy
Analytics Vidhya

Sr. Data Scientist at SAP, Google Developer Expert in Machine Learning