How To Split Train and Test Data

Anya
5 min readDec 28, 2018

--

Concepts explained for beginners!

Photo by Stephen Dawson on Unsplash

Machine learning is the new hot field of data science. All sorts of machine learning algorithms can be used to find solutions for many problems ranging from sentiment analysis to predict stock prices to disease classifications using brain imaging. The data that is fed to the algorithm is essential in forming the accuracy of the outcome. In machine learning, it is crucial to have training and testing data that is properly split into features and labels to be able to have models that provide good predictions.

In this tutorial I will explain the concepts of train and test data as well as giving a mini demo of how to split data using Scikit-Learn in Python.

We are first going to be looking at what I mean by features, labels, train data and test data to have a better understanding of how to break down the information we have. I have created a toy dataset to explain the concepts. The example dataset given below should provide us with a visual way of conceptualizing what these concepts we need to know in our data frame are. The toy dataset has 4 columns that give information about dogs’ weight in kilograms, height in centimeters, activity level on a scale of 1–10 and lastly the length of their ears broken into two categories as short and long. We want to find out given the weight, height and activity level of a dog if the dog will have short or long ears. (Just a friendly reminder this dataset is only made up to illustrate the method, it is not based on any real data.)

The first 4 rows of our imaginary toy dataset (Again please note that: The toy dataset is not a real dataset and the size of it isn’t sufficient for machine learning. It is only to explain concepts)

Let’s start with labels. Labels are what we want to predict or classify. In other words, labels are the outcomes we want from our algorithm. In our toy dataset, we want to predict if a dog has long or short ears. Our label would be the column with the header ‘short/long’. Since there is only two options given in the data, ‘long’ and ‘short’ , it is represented with number 0 for short and 1 for long. This is a binary classification problem since the outcome we want is to classify the given information as either a 1 or a 0, long or short. What gives us information about the length of the dog ear in the other columns such as weight, height and activity level are the features.

Features and Label shown

When splitting the data, X is conventionally the features and y is the label.

We would specify the X and y as shown below before splitting the data.

In some real world instances, some columns might be unrelated to the label. In that case, I recommend either excluding them from the features or finding a connection with other features on the dataset to tailor it as another column to add to the dataset.

Firstly we will import our dependencies numpy, pandas, and scikit-learn.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

We will display our data frame to get a better sense of what we would like to predict or classify. In this case I have created our toy dataset from scratch and converted it to a pandas data frame.

a = [[10, 20, 3,1],
[10, 10, 2, 0],
[12, 13, 4, 1],
[5, 10, 8, 0],
[10, 17, 10, 0],
[10, 20, 3,1],
[10, 10, 2, 0],
[12, 13, 4, 1],
[5, 10, 8, 0],
[10, 17, 10, 0],
[10, 20, 3,1],
[10, 10, 2, 0],
[12, 13, 4, 1],
[5, 10, 8, 0],
[10, 17, 10, 0],
[10, 20, 3,1],
[10, 10, 2, 0],
[12, 13, 4, 1],
[5, 10, 8, 0]]
df = pd.DataFrame(a, columns = ["weight", "height", "activity level", "short(0)or long(1) ears"])

Let’s define X by converting the features columns into a numpy array and excluding the ear length column. For y, we will convert only the label column into a numpy array. Please note that scaling and normalizing the data before splitting is also important but will not be covered in this post.

X = np.array(df.drop(['short(0)or long(1) ears'],1))y = np.array(df['short(0)or long(1) ears'])

It is a good habit to check the shape or X and y to make sure the shapes are inline to have a proper fit when training. Otherwise, different shapes might disrupt the learning process. It can be also beneficial to print out X and y to see if the information we want is in place. Let’s display the shapes.

X.shape

This can be simply done with the .shape function.

y.shape

Now that we have X and y. We can split the data using Scikit Learn’s train_test_split method. What this is doing is to divide the data to have training data from X: often named X_train, training data that corresponds to the X values from y: y_train and data that will be used to evaluate or in other words test out how machine learning model performed X_test and y_test.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

The test_size refers to how much of the data will be put away as the test data. In this case 0.2 refers to %20 of the data. This number should be between 0 and 1 corresponding to a percent scale. The bigger the test_size, less amount of training data will be split. This might result in our model not having enough data to train with. Although there is not a hard set of rules for the test_size, it is usually around 20% of the data. The random_state is a number that determines an internal random integer that will be used to split the data randomly. The number we use should not carry a big weight on the model or the predictions.

There you have it! Split train and test data ready for training and testing for your machine learning algorithm.

--

--