CSV Data Processing Pipeline For ML/DL Projects Using Python
The CSV datasets that are readily available may not always be processed and some cleanup may be required. This is a pretty common step in Machine Learning(ML) and Deep Learning(DL) projects.
In this article, I’ll cover basic techniques for data pre-processing and preparing your CSV dataset for ML/DL applications in Python. Depending on application, you may need to execute some or all of these steps…
Pre-processing Tasks :
- Importing the dataset.
- Separating the dependent and independent variables.
- Handling missing data.
- Encoding categorical data.
- Splitting the dataset (Train Set and Test set).
- Feature Scaling.
Importing the dataset :
Different libraries can be used for importing the dataset, I prefer Pandas as it’s the easiest to use and provides a lot of operations on data all in one place. Also, pandas has integrated functions to read CSV, making the data import task, a breeze.
import pandas as pd
dataset = pd.read_csv('path_to_dataset/file_name.csv')
Separating dependent and independent variables :
I’ve written a bit about this in my previous article, and I’ll quote the same here.
In simple terms, dependent variables are those that you need to predict, and independent variables are ones using which we predict the dependent variables. These need to be separated as only independent variables need to be passed to network while dependent ones are required to calculate difference (loss) between them and the values predicted by the network
According to conventions, the Independent variable(s) are named ‘X’ and dependent variable(s) are named ‘y’. Now, considering that the dependent variable is in last n columns… (1 in this case), the code for it is as simple as :
n = 1
X = dataset.iloc[:, :-n].values
y = dataset.iloc[:, -n].values
Extracting rows is simple in python even without the use of pandas, but extracting columns is a bit tedious. Pandas’ ‘iloc’ method is way more convenient and easy to implement.
Handling missing data :
Many of the times, the dataset may have some missing values. Such missing values, if not dealt with, will lead to errors while fitting our models/training.
Missing values can be dealt with by either
- Discarding rows that have missing values.
- Fill in (or impute) the missing values.
Discarding Rows :
In most of the cases, discarding the rows is not advisable as the size of data for training will be reduced and also, if the dataset was balanced for its attributes(like if it has equal number of entries for each class), removing rows will result in data imbalance and the model you train later may become biased.
In the rare case that you do need to remove the rows, it can be done so in pandas by ‘dropna’ function :
dataset.dropna(inplace=True)
Filling in (or imputing) missing values :
Filling in the missing values is a better alternative to removing rows. These values can be filled in using various methods, most common of them being replacing with mean of that column.
Method 1 : Using Pandas
dataset.fillna(dataset.mean(), inplace=True)
Method 2 : Using sklearn
I prefer using sklearn as it here you can specify what value is to be replaced (may it be NaN or other) and also the technique to be used for replacement. As I mentioned, most common is replacing by mean, but other techniques like replacing with mode or median are also used and can be implemented in sklearn. (Specify this in strategy parameter)
import numpy as np
from sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values=np.nan, strategy='mean')
start = 1
end = 3imputer.fit(X[:, start:end])
X[:, start:end] = imputer.transform(X[:, start:end])
Here, I wanted to fill missing values for columns 2 to 3 and indexing starts from 0 so I gave start as 1 and end index is exclusive, so end is given as 3
Encoding Categorical Data :
Now that we have properly dealt with missing data, it’s time to deal with categorical data.
We can see that the first column in the dataset is categorical (It takes value amongst several categories) and is not numeric in nature. Such attributes need to be converted to numeric so that it can be easily interpreted by our model in next stages of the project.
One Hot Encoding :
One Hot Encoder takes all the discrete categories for an attribute and adds those many columns in the data, one for each category. The presence of category is indicated by a value of ‘1’ for that column. (eg: For current data, there are 3 different countries in first column so One Hot Encoder will add 3 columns to the data, one for France, one for Spain and one for Germany)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoderct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')X = np.array(ct.fit_transform(X))
If you are using Tensorflow for your project you can also use tf.one_hot() function to achieve the same result.
Label Encoder :
If the categorical variable is binary, or for some reason the number of columns cannot be increased, a label encoder can be used. Label encoder essentially converts nominal attribute to numeric by taking the number of discrete attributes and assigning each of them a number.
Eg. Here the dependent variable is binary, so if label encoder is used, the ‘No’ will be converted to 0 and ‘Yes’ will be converted to 1. (For more categories, the label encoder will assign next numbers, i.e. 2, 3, 4 and so on).
This intuitively makes more sense to us as adding columns in one hot encoding seems redundant, but in practice, One Hot Encoding is actually preferred over Label encoding. This is because numerical attributes are considered by value by the ML algorithms, meaning it will consider the attribute with label 4 to be of more value than one with label 3 (As 4>3), although in reality, both have equivalent significance.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
In case of binary attributes, label encoder is preferred as only 1s and 0s will be the labels and adding columns for this is unnecessary.
Splitting the data (train set and test set) :
Again, I’ve written a bit about this in my previous article, and I’ll quote the same here.
We will train our network on certain section of data. Testing determines the real world working of our model. Consider an example where you have been taught a new word and your teacher gives you an example by using it in a sentence. Now, to show that you have got the meaning of that word, you must use it in another sentence, using it in the same sentence is useless. Similarly, testing our model with the same data on which it was trained is useless. Thus we split the dataset and reserve some part of it for testing, which will give us a real estimate of how good the model works on previously unseen data.
That said, lets move on to the code…
sklearn provides the simplest way to split data into train and test sets, and its just a matter of 2 lines. One argument it requires is the test set size. This signifies the fraction of data which you want to use as the test set. (Eg : test_size=0.2 will keep 20% of data in the test set.)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
Feature Scaling :
We’re almost done… Last section is feature scaling. As you can see, the last 2 attributes in the training set are numerical. Again, being numerical, ML algorithms will give more significance to the attributes with higher value, in this case, salary attribute has values in thousands, while ages are less than 100, so to remove this bias and give equal significance to (normalize) all attributes, feature scaling is essential.
Feature scaling fits the numerical attributes to a common scale (typically -1 to +1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
start = 3
X_train[:, start:] = sc.fit_transform(X_train[:, start:])
X_test[:, start:] = sc.transform(X_test[:, start:])
3rd and 4th columns had to be scaled thus start index is set as 3, and no end index signifies that rest all columns are to be considered.
And that’s it… We’ve pre-processed our data and are now ready to use it in our ML/DL application.
Summary :
- Import Dataset : Load dataset. Use Pandas(read_csv).
- Separate Variables : Dependent and independent. Use Pandas(iloc).
- Missing Values : Delete or fill. Use sklearn(SimpleImputer).
- Encode Categorical Data : One hot encoding or label encoder. Use sklearn.
- Split Data : Train set and test set. Use sklearn(train_test_split).
- Feature Scaling : Normalize attributes. Use sklearn(StandardScaler)
Thankyou for reading my article. Give a clap if you found it useful.
Also Read :
Resources :
Dataset and code : Check out my GitHub Repository
More About Me At :
GitHub :
LinkedIn :
www.linkedin.com/in/rohan-hirekerur