Impute Missing Values With SciKit’s Imputer — Python
Missing values can lead to inconsistent results
This article is focused on substituting the missing values in the dataset using the SciKit Imputer.
Missing values can lead to inconsistent results. We can either ignore the rows with missing data columns or substitute the values with some calculated output.
When the dataset is too small, we can’t afford to lose the row data even if it contains missing columns. In those cases, we will look at substituting the column data with some values. This article is focused on this issue.
We will talk about the code and concept in detail.
Let's consider the data available to us.
The data contains some missing values for the age column. Missing values are marked as
NaN. We need to look for ways of handling these missing data points.
The missing data can be handled in the following ways:
- Removing rows with missing data.
Removing Rows With Missing Data
As stated earlier, ignoring the rows with the missing data can lead to inconsistent results as the data that is removed can be crucial for further calculations and might contain some important observations.
Although it is not a recommended way, we can still remove rows if the dataset is large. We can consider that, in a large dataset, the rows with the missing values may have a very small impact.
The above code will remove all the rows of a column that has
NA values, hence excluding these rows from further evaluation.
These could be scenarios of dropping the row:
- Drop all rows that have any
- Drop only if the entire row has
- Drop only if a row has more than two
NaN(missing) in a specific column.
For more on these scenarios, refer to this article.
Despite it being an easier option, we should not use this method.
Let's look for an alternative to make the dataset more consistent.
Replacing NaN With Data
Imputation is another approach to resolve the problem of missing data.
The missing column values are substituted by another computed value. There might be scenarios where the dataset is small or where each row of the dataset represents a critical value.
In those cases, we cannot remove the row from the dataset. The missing values can be imputed.
There are different strategies to define the substitute for the missing value.
The value can be substituted by these values:
- The mean value of the other column values available in the training dataset.
- The median value of the other values available in the training dataset.
- Substitute with the most frequent value in the training dataset.
To achieve the required substitution, we can use Scikit’s Imputer class.
SciKit’s Imputer Class
This class constructor takes the following parameters as input:
missing_values: This is the actual value that needs to be replaced. The developer can specify the value that needs to be considered for replacement. Let’s say we need to replace all
NaNoccurrences in the dataset. We can even replace integer or string values in the dataset
strategy: We have a different strategy to calculate the missing values. The strategies that can be used are
axis: This parameter takes either
1as input value. It decides if the strategy needs to be applied to a row or a column. In this case,
0represents a column and
1represents the column strategy. For the value
0, it looks for all the values in the column and for
1, it looks for all the row data to calculate the value according to the strategy.
verbose: This defines the verbosity of the imputer. Value default is
copy: This decides if a copy of the original object needs to be made or whether the original data need to be transformed. By default, it is set to
true, which signifies that a copy of the original object is created.
In the code below, we will look at how to replace the missing values with the mean value of the row data that belongs to the same column.
Let’s understand the above code in detail.
Line 4 to Line 6
In line 4, we use an external library,
pandas, to read the data from the CSV file. The data read from the file can be referred to as the data frame. The data frame defines the entire set of data present in the data file.
In Line 5,
iloc is used to retrieve the rows and columns that need to be evaluated.
It can be used to derive a subset of the available data frame by specifying the row and column index.
We can specify the range of the rows and columns that need to be selected. Specifying only
: in the row or column parameter states that all the values need to be selected
trainingData = dataFrame.iloc[:, :].values
The above code states that we need to select all the rows and columns from the
dataFrame into the variable
trainingData = dataFrame.iloc[0:2, :].values
Similarly, the code above states that we are interested in the rows ranging from 0 to 2, and all the columns available in the
Note: The upper value “2” will be ignored.
trainingData = dataFrame.iloc[:, 1:4].values
The above code states that we are interested in copying all the rows and columns ranging from 1 to 4 in the
Note: The upper value “4” will be ignored.
See this article for more detail.
The above code extracts two variables,
dataset out of the main
Shortly, we will look at the reason for this.
Line 7 to Line 10
We are aware of the fact that the data is missing in the CSV files for certain columns.
Now, the next task is to fill these empty columns with some calculated values. To calculate the values that need to be substituted, we need to have some data that require calculations. We can extract this data from the
dataFrame available is not too large, we specify the interest in all rows and column’s data to be a part of the calculation to extract the value to be substituted in the missing columns.
dataFrame is selected as a part of the training data, by specifying
: for both row and column indexes.
The imputer is how the missing values are replaced by certain values.
The value to be substituted is calculated on the basis of some sample data which may or may not represent the entire set of data available.
It reduces the variability in the data sample and helps attain some consistent results for the provided dataset.
The following lines of code define the code to fill the missing values in the data available.
We need to import
sci-learn to process the data. Let's look for the above lines of code one-by-one.
imputer = Imputer(missing_values=”NaN”, strategy=”mean”, axis = 0)
Initially, we create an imputer and define the required parameters.
In the code above, we create an imputer which states the following properties:
- The value that needs to be replaced is
- Substitution needs to be done with a
- As the value of
Axisis defined as
- imputation needs to be done in the column.
- As we haven’t defined any
verboseparameters, it will default to
- We create a copy of the data by not providing the copy value.
Now that the imputer is created, it can be used to substitute the values with the specified strategies and parameters in the entire dataset.
In the data shown above, we can see that the
age column has some missing data. And, we need to fill the data of these columns with a
mean value attained from the data provided as training data.
Training Data for Machine Learning
It is very important to understand the meaning of training data. It refers to the set of data that can be used to calculate the data which is required to substitute the missing columns.
In the above case, we plan to substitute
mean data in the missing columns, but that
mean value needs to be calculated. To calculate the value, we can specify the row and the column that can be used to calculate the value of
mean for substitution.
imputer = imputer.fit(trainingData[:, 1:2])
In the code above, we provide all the rows and all the values of the
age column for calculation of the
mean value that would be substituted.
age values in all the rows are provided as training data, all the available age values are used to find the
mean value for substitution.
We can also provide a subset of rows for calculation. Imagine that we have a million rows, providing all the dataset’s rows for
mean calculation would be too costly. So, in those cases, we can provide a subset of data.
imputer = imputer.fit(trainingData[10:20, 1:2])
In the above code, we specify that the
age value from the rows indexed from 10 to 20 will be involved in the calculation of the
mean value needs to be substituted with the missing column values. The more rows involved in the mean calculation, the more consistent the result. Therefore, we need to select the training data wisely.
Substituting the Dataset With Calculated Values
Once the value has been calculated from the training dataset provided, we can substitute that value in the missing columns of the actual dataset.
dataset[:, 1:2] = imputer.transform(dataset[:, 1:2])
The code above substitutes the value of the missing column with the mean values calculated by the imputer, after operating on the training data set.
On transformation, all the missing data in the
age column is substituted with the mean value calculated, and the missing values now have some relevant data to provide consistent results.
Once the data has been transformed, the missing value has been substituted with the value that is calculated from the
age received from the training dataset provided to the imputer.
Use this technique wherever you have missing data, before applying any of the machine language algorithms to it.
I hope you liked this piece. Thanks for reading!