Splitting CSV Into Train And Test Data
While working with datasets, a machine learning algorithm works in two stages — the testing and the training stage. Normally the data split between test-train is 20%-80%.
In order to successfully implement a ML algo, you need to be clear about how to split the data into testing and training, and this short post talks exactly about that.
We will start by installing packages needed.
We will be using
pandas to import the dataset we will be working on and
sklearn for the
train_test_split() function, which will be used for splitting the data into the two parts.
Next, we will start our program by importing the packages needed for the process.
As explained above,
pandas for importing dataset and
The next step would be importing the dataset. We will use Forest Fires Dataset from UC Irvine Machine Learning Repository.
The CSV file contains the following data, displayed using the
head() function. Now, we will be splitting the following data into labels and features. Labels are the data which we want to predict and features are the data which are used to predict labels.
Here, we have used
temp as the label for predicting temperatures in
y, data other than
temp is taken as features using the
drop() function in
Our last step would be splitting the data into train and test data, we will do that using
train_test_split() function, we passed the variable
y that we obtained previously, along with
test_size=0.20 which is used to indicate that the test data should be 20% of the total data and rest 80% should be train data. We used
head() function to print the first five elements of both the data. The
shape() function was used to get an idea about the rows and columns of the data we have obtained. Notice that train data has 413 rows whereas test data has 104 rows, which is 20% of the original data, exactly the result we wanted!
I hope this small skill will come handy in every Machine Learning program you would be doing in future which involves working with CSV files. I also hope you enjoyed the post.