Splitting CSV Into Train And Test Data

Hola everyone!

While working with datasets, a machine learning algorithm works in two stages — the testing and the training stage. Normally the data split between test-train is 20%-80%.

In order to successfully implement a ML algo, you need to be clear about how to split the data into testing and training, and this short post talks exactly about that.

We will start by installing packages needed.

Installing Packages For Split

We will be using pandas to import the dataset we will be working on and sklearn for the train_test_split() function, which will be used for splitting the data into the two parts.

Next, we will start our program by importing the packages needed for the process.

Importing Required Packages

As explained above, pandas for importing dataset and sklearn for train_test_split() function.

The next step would be importing the dataset. We will use Forest Fires Dataset from UC Irvine Machine Learning Repository.

Importing Forest Fire Data
Forest Fire Data Output

The CSV file contains the following data, displayed using the head() function. Now, we will be splitting the following data into labels and features. Labels are the data which we want to predict and features are the data which are used to predict labels.

Dividing Into Labels And Features

Here, we have used temp as the label for predicting temperatures in y, data other than temp is taken as features using the drop() function in X.

Our last step would be splitting the data into train and test data, we will do that using train_test_split() function.

Splitting The Data
Training And Testing Data

In the train_test_split() function, we passed the variable X and y that we obtained previously, along with test_size=0.20 which is used to indicate that the test data should be 20% of the total data and rest 80% should be train data. We used head() function to print the first five elements of both the data. The shape() function was used to get an idea about the rows and columns of the data we have obtained. Notice that train data has 413 rows whereas test data has 104 rows, which is 20% of the original data, exactly the result we wanted!

I hope this small skill will come handy in every Machine Learning program you would be doing in future which involves working with CSV files. I also hope you enjoyed the post.

adios