The Motivation for Train-Test Split

Ahmed
4 min readOct 18, 2022

--

In Machine Learning (ML) workflows, we often split our dataset into a training dataset and a testing dataset (in the supervised context). We use the training dataset to train the model and the testing dataset to evaluate its performance. Often, a common split for a dataset is 80% for training and 20% for testing. In this article, we’ll explain the nuances and rationale for the train-test split.

Photo by Mike Hindle on Unsplash

Powered by Data

The core of ML models is the data. The model by itself is just an algorithm for forming rules and decisions based on data. In order for the model to be useful, it needs to be given data from which it generates the rules by learning a set of parameters / weights. Ideally, we want to provide the model with as much data as possible so that it can learn effectively. Note: when it comes to data, not all data is equal — so what is really meant by “as much data as possible” is “as much high quality, representative data as possible.” We want the data to capture the signal and have low noise (i.e., its high quality), and that it reflects the data we’d most likely encounter in deployment / application (i.e., its representative).

Photo by Element5 Digital on Unsplash

We can compare a model to a student. Both are capable of learning but need exposure in order to actually learn. For the student to become proficient, they need readings, assignments, and practice problems similar to how a model needs data to be more effective. Furthermore, if the tasks for the students are well-formed and relevant, they will perform better in the real world similar to models when they receive high-quality, representative data.

Thus, the key takeaway is that we should try to provide as much good data as we can to the model.

Evaluating Performance

After training our model with data, we want to assess its performance. Similar to students, we want to administer a test to see if the model does well. If we give students the same questions they practiced on, it’s hard to distinguish whether they actually learned or memorized the solution. The same can happen with models when they overfit or “memorize” the patterns in the training data. Thus, it’s important to use unseen data for testing to validate whether the model actually learned.

Photo by Nguyen Dang Hoang Nhu on Unsplash

Typically, in a supervised machine learning context, we split the original dataset for training and testing. A common partition is reserving 80% for training and 20% for testing. We can compare it to a teacher with a question bank — most of the questions will be released for practice with some leftover for tests.

As for the train-test ratio, the motivation for using an 80 / 20 split is loosely driven by the Pareto principle (also called the 80–20 rule), which states that 80% of the effect is driven by 20% of causes (and vice versa). The Pareto principle isn’t a mathematically guaranteed property, but many observed phenomena follow the Pareto principle. For example, wealth distribution (80% of all the wealth is held by 20% of people), social media marketing (80% of all social media shares are generated by 20% of posts), agricultural production (20% of farmers produce 80% of crops), etc. Given how ubiquitous the 80–20 rules can be, it is a common train-test split ratio within the ML community. Note: other split ratios are used, such as 70 / 30 or 90 / 10, but are less common.

The key takeaway is that for evaluating ML models, we should use unseen data. Typically, this is done by splitting the original dataset into training and testing, often using an 80 / 20 split.

The Challenge of New Data

Given that ML models perform better when given more data and they need to be evaluated on unseen data, why wouldn’t we just collect new data? Then, we can use the full dataset for training while also exposing it to unseen data for testing.

There are several challenges with this approach:

  1. Infeasibility: collecting new data can be time intensive and / or expensive, thus making it infeasible. According to research, data collection is often the biggest bottleneck in data workflows, AI, and ML.
  2. Heterogeneity: even if collecting data is feasible, the data collected might not align with the original dataset. Perhaps the new data measures a different sample / population or is missing certain features.
  3. Scalability: relying on new data can slow down experimentation. Collecting data might be slow, hence we’d need to wait for enough data to discard or green-light a model. Furthermore, our ability to cross-validate, which is using different shuffles of the dataset, becomes limited.

The key takeaway is that using the original dataset (assuming it’s high quality and representative) for training and testing removes the hassle of data collection, ensures the data is homogeneous (represents the same population and has the same feature set), and enables faster experimentation.

Conclusion

In supervised ML, we utilize the dataset for both training and testing. We typically reserve 80% for training and 20% for testing. We rely on test-train split because collecting new data is difficult, can be different from the original dataset, and slow down experimentation.

--

--