Data Preprocessing in Machine Learning Model
Data Preprocessing
Data preprocessing is the raw data is converted into the understandable format data. The raw data contain missing data, noisy data, and many errors. All the problems will be solved by using data preprocessing techniques. The whole data set is split up into the train and test set. That train set is used to train the machine learning models.
Raw Data
Raw data is unprocessed computer data. This information may be stored in a file, or may just be a collection of numbers and characters stored on somewhere in the computer’s hard disk. For example, information entered into a database is often called raw data.
Why Data Preprocessing ?
The raw data contains a lot of missing data, noisy data and errors, so we cannot use that format in the machine learning models. The preprocessed data only used in the machine learning models.
The better data preprocessing has increased the accuracy of the model. So the data preprocessing is most important in the machine learning and deep learning models.
Data Preprocessing Techniques
1. Getting the dataset.
2. Importing libraries.
3. Importing dataset.
4. Encoding Categorical Data.
5. Finding Missing Data.
6. Splitting dataset into training and test set.
7. Feature scaling.
1. Getting the Dataset
First, we get the dataset from the website or any other place. The dataset have the csv or excel format. The csv means Comma Separated Values, the excel is a normal Microsoft excel format. The titanic data set is used for this data preprocessing techniques.
2. Importing libraries
The all required packages are imported for data preprocessing. the pandas package is used for handling the data sets. The Numpy is used for array operation performed on the datasets. the matplotlib is used for visualizing the data.
3. Import Datasets
The dataset is imported into the python file by using the pandas.
4. Encoding Categorical Data
The dataset has some volumes in string format. The machine learning model only allowed the numerical values. The string is converted into the numerical values by the categorical value techniques.
Two techniques:
1.label encoder
2.one hot encoder
Label Encoder
SK learn provides a very efficient tool for encoding the levels of categorical features into numeric values Label Encoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels.
The “sex” column is not numerical value. So it is converted into numerical values.
One Hot Encoder
5. Finding Missing Data
The missing data is filled by the various methods.the following methods are used for the filling the missing values.
1. mean/median
2. zero method
3. most frequent values
Above top 5 column have more missing values. Now we can fill the missing values in “age” and “fare” columns.
Mean/Median
The mean value is found in the missing value columns, then the mean value is filled in the missing values places.
Zero Method
The missing values are filled with zeros.
Most Frequent Values
The missing values is filled with the most frequently used data in that particular column.
6. Splitting Dataset Into Training and Test Set
The data set is split into the train set and test set. The important feature only chooses for that model training process, because the accuracy is increased by choosing the most important feature in the data sets.
7. Feature Scaling
Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. the feature scaling is used in the regression algorithms to get more accuracy.
Now the data is ready to build the machine learning models.
If you want the full code,visit my Github page