Missing values can lead to inconsistent results

Mayank Gupta
Aug 8 · 8 min read
Photo by Franck V. on Unsplash

This article is focused on substituting the missing values in the dataset using the SciKit Imputer.

Missing values can lead to inconsistent results. We can either ignore the rows with missing data columns or substitute the values with some calculated output.

When the dataset is too small, we can’t afford to lose the row data even if it contains missing columns. In those cases, we will look at substituting the column data with some values. This article is focused on this issue.

We will talk about the code and concept in detail.


Data

Let's consider the data available to us.

The data contains some missing values for the age column. Missing values are marked as NaN. We need to look for ways of handling these missing data points.

The missing data can be handled in the following ways:

  1. Removing rows with missing data.
  2. Replacing NaN with data.

Removing Rows With Missing Data

As stated earlier, ignoring the rows with the missing data can lead to inconsistent results as the data that is removed can be crucial for further calculations and might contain some important observations.

Although it is not a recommended way, we can still remove rows if the dataset is large. We can consider that, in a large dataset, the rows with the missing values may have a very small impact.

The above code will remove all the rows of a column that has NaN or NA values, hence excluding these rows from further evaluation.

These could be scenarios of dropping the row:

  • Drop all rows that have any NaN (missing) values.
  • Drop only if the entire row has NaN (missing) values.
  • Drop only if a row has more than two NaN (missing) values.
  • Drop NaN (missing) in a specific column.

For more on these scenarios, refer to this article.

Despite it being an easier option, we should not use this method.

Let's look for an alternative to make the dataset more consistent.


Replacing NaN With Data

Imputation is another approach to resolve the problem of missing data.

The missing column values are substituted by another computed value. There might be scenarios where the dataset is small or where each row of the dataset represents a critical value.

In those cases, we cannot remove the row from the dataset. The missing values can be imputed.

There are different strategies to define the substitute for the missing value.

The value can be substituted by these values:

  1. The mean value of the other column values available in the training dataset.
  2. The median value of the other values available in the training dataset.
  3. Substitute with the most frequent value in the training dataset.

To achieve the required substitution, we can use Scikit’s Imputer class.


SciKit’s Imputer Class

This class constructor takes the following parameters as input:

  1. missing_values: This is the actual value that needs to be replaced. The developer can specify the value that needs to be considered for replacement. Let’s say we need to replace all NaN occurrences in the dataset. We can even replace integer or string values in the dataset
  2. strategy: We have a different strategy to calculate the missing values. The strategies that can be used are mean, median, and most_frequent.
  3. axis: This parameter takes either 0 or 1 as input value. It decides if the strategy needs to be applied to a row or a column. In this case, 0 represents a column and 1 represents the column strategy. For the value 0, it looks for all the values in the column and for 1, it looks for all the row data to calculate the value according to the strategy.
  4. verbose: This defines the verbosity of the imputer. Value default is 0.
  5. copy: This decides if a copy of the original object needs to be made or whether the original data need to be transformed. By default, it is set to true, which signifies that a copy of the original object is created.

In the code below, we will look at how to replace the missing values with the mean value of the row data that belongs to the same column.

Let’s understand the above code in detail.


Line 4 to Line 6

In line 4, we use an external library, pandas, to read the data from the CSV file. The data read from the file can be referred to as the data frame. The data frame defines the entire set of data present in the data file.

In Line 5, iloc is used to retrieve the rows and columns that need to be evaluated.

It can be used to derive a subset of the available data frame by specifying the row and column index.

We can specify the range of the rows and columns that need to be selected. Specifying only : in the row or column parameter states that all the values need to be selected

trainingData = dataFrame.iloc[:, :].values

The above code states that we need to select all the rows and columns from the dataFrame into the variable trainingData.

trainingData = dataFrame.iloc[0:2, :].values

Similarly, the code above states that we are interested in the rows ranging from 0 to 2, and all the columns available in the dataFrame.

Note: The upper value “2” will be ignored.

trainingData = dataFrame.iloc[:, 1:4].values

The above code states that we are interested in copying all the rows and columns ranging from 1 to 4 in the dataFrame available.

Note: The upper value “4” will be ignored.

See this article for more detail.

The above code extracts two variables, trainingData and dataset out of the main dataFrame.

Shortly, we will look at the reason for this.


Line 7 to Line 10

We are aware of the fact that the data is missing in the CSV files for certain columns.

Now, the next task is to fill these empty columns with some calculated values. To calculate the values that need to be substituted, we need to have some data that require calculations. We can extract this data from the dataFrame.

As the dataFrame available is not too large, we specify the interest in all rows and column’s data to be a part of the calculation to extract the value to be substituted in the missing columns.

The entire dataFrame is selected as a part of the training data, by specifying : for both row and column indexes.

The imputer is how the missing values are replaced by certain values.

The value to be substituted is calculated on the basis of some sample data which may or may not represent the entire set of data available.

It reduces the variability in the data sample and helps attain some consistent results for the provided dataset.

The following lines of code define the code to fill the missing values in the data available.

We need to import imputer from sci-learn to process the data. Let's look for the above lines of code one-by-one.

imputer = Imputer(missing_values=”NaN”, strategy=”mean”, axis = 0)

Initially, we create an imputer and define the required parameters.

In the code above, we create an imputer which states the following properties:

  1. The value that needs to be replaced is NaN.
  2. Substitution needs to be done with a mean strategy.
  3. As the value of Axis is defined as 0,
  4. imputation needs to be done in the column.
  5. As we haven’t defined any verbose parameters, it will default to 0.
  6. We create a copy of the data by not providing the copy value.

Now that the imputer is created, it can be used to substitute the values with the specified strategies and parameters in the entire dataset.

In the data shown above, we can see that the age column has some missing data. And, we need to fill the data of these columns with a mean value attained from the data provided as training data.


Training Data for Machine Learning

It is very important to understand the meaning of training data. It refers to the set of data that can be used to calculate the data which is required to substitute the missing columns.

In the above case, we plan to substitute mean data in the missing columns, but that mean value needs to be calculated. To calculate the value, we can specify the row and the column that can be used to calculate the value of mean for substitution.

imputer = imputer.fit(trainingData[:, 1:2])

In the code above, we provide all the rows and all the values of the age column for calculation of the mean value that would be substituted.

Here, the age values in all the rows are provided as training data, all the available age values are used to find the mean value for substitution.

We can also provide a subset of rows for calculation. Imagine that we have a million rows, providing all the dataset’s rows for mean calculation would be too costly. So, in those cases, we can provide a subset of data.

imputer = imputer.fit(trainingData[10:20, 1:2])

In the above code, we specify that the age value from the rows indexed from 10 to 20 will be involved in the calculation of the mean value.

This calculated mean value needs to be substituted with the missing column values. The more rows involved in the mean calculation, the more consistent the result. Therefore, we need to select the training data wisely.


Substituting the Dataset With Calculated Values

Once the value has been calculated from the training dataset provided, we can substitute that value in the missing columns of the actual dataset.

dataset[:, 1:2] = imputer.transform(dataset[:, 1:2])

The code above substitutes the value of the missing column with the mean values calculated by the imputer, after operating on the training data set.

On transformation, all the missing data in the age column is substituted with the mean value calculated, and the missing values now have some relevant data to provide consistent results.

Once the data has been transformed, the missing value has been substituted with the value that is calculated from the mean of age received from the training dataset provided to the imputer.

Use this technique wherever you have missing data, before applying any of the machine language algorithms to it.


Conclusion

I hope you liked this piece. Thanks for reading!

Better Programming

Advice for programmers.

Mayank Gupta

Written by

9 Years of Experience with Front-end Technologies and MEAN Stack. Working on all Major UI Frameworks like React, Angular and Vue.

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade