In this post, I am going to walk you through a simple exercise to understand two common ways of splitting the data into the training set and the test set in scikit-learn. The Jupyter Notebook is available here.
Let’s get started and create a Python list of numbers from 0 to 9 using range():
X = list(range(10))
print (X)[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Then, we create another list which contains the square values of numbers in X using list comprehension:
y = [x*x for x in X]
print (y)[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Next, we will import model_selection from scikit-learn, and use the function train_test_split( ) to split our data into two sets:
import sklearn.model_selection as model_selectionX_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25, random_state=101)print ("X_train: ", X_train)
print ("y_train: ", y_train)
print(“X_test: ", X_test)
print ("y_test: ", y_test)X_train: [4, 9, 3, 5, 7, 6, 1]
y_train: [16, 81, 9, 25, 49, 36, 1]
X_test: [8, 2, 0]
y_test: [64, 4, 0]
By specifying the train_size as 0.75, we aim to put 75% of the data into our training set, and the rest of the data into the test set. Because we only have ten data points, the program automatically rounded the ratio to 7:3. It’s okay to omit the test_size parameter, if you already got the train_size specified, and you don’t mind the annoying warning message :)
Another observation is that the numbers in the lists after splitting do not follow the same ascending order as before. In another word, by default, the program ignores the original order of data. It randomly picks data to form the training and test set, which is usually a desirable feature in real-world applications to avoid possible artifacts existing in the data preparation process. To disable this feature, simply set the shuffle parameter as False (default = True).
You’ve probably seen people using cross_validation.train_test_split( ) for smiliar tasks. In fact, it’s just an old way of doing the same thing. Do the following
import sklearn.cross_validation as cross_validationX_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.75, random_state=101)
will generate exactly the same outputs as above, given that we assigned the same number to Random_state. If you want your results to be stochastic each time, simply leave it as the default value “None”.
from sklearn.model_selection import KFold
import numpy as npkf = KFold(n_splits=5)
X = np.array(X)
y = np.array(y)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(“X_test: ", X_test)X_test: [0 1]
X_test: [2 3]
X_test: [4 5]
X_test: [6 7]
X_test: [8 9]
By specifying the n_splits parameter as 5, both of the X and y sets were divided into five folds (the y sets now shown here). You probably noticed that, this time, the program always picked two neighboring numbers from the original data sets, which means the data points were not shuffled (Why is the default setting of the shuffle parameter here different from that in train_test_split ?) Nevertheless, use
kf = KFold(n_splits=5, shuffle=True)
will give you the same mixing effect for the original data sets as what we’ve seen before.
In addition, scikit-learn provides useful built-in functions to calculate the error metrics of multiple folds of test sets to evaluate machine learning models. For example,
will report one score of mean absolute error.
Feel free to create your own jupyter Notebook and play around with those functions! Have fun!