The Imperative of Data Cleansing — part 2

Deal with Missing Data and Outliers

Wael Samuel
The Startup
7 min readMay 25, 2020

--

The Data are messy and full of errors

As we mentioned in first part The Imperative of Data Cleansing — part 1 we will continue with more data problems and how to solve it.

When you are collecting or working with data you might find that you have missing data in the form of missing values for fields or you might find that your data contains outliers that really don’t make much sense.

Missing Data

Data Problems

When you are working with data that is missing from the records that you have to train to your model. There are 2 approaches you could follow in order to deal with this data

1- Deletion

2- Imputation

Deletion A.K.A Listwise Deletion
Delete entire records (rows) if a single value (column) is missing.

This is a simple hassle-free technique to deal with missing values, but it can lead to bias because you might delete a very relative data.

Listwise deletion often is the most common in practice because it is easy, but it can lead to several problems. It can reduce the sample size that you are working with significantly. If you don’t have too many records and you applied this approach it might get into a situation where you have insufficient data to train your ML model.

Another thing to worry about, if values are not missing in random, removing them can introduce significant bias. For example, if your collecting data from sensors and there is a sensor has missing values in a particular field if you go ahead and dropped all record from that sensor. That can lead to significant bias.

So, it is pretty clear that simply dropping entire records which have a few fields missing not a great option which is why we move on to Imputation.

Imputation
Fill in missing column values rather than deleting records with missing values. Missing values are inferred from known data.

Once you have decided to use Imputation fill in missing values, there are a number of methods you can choose from very simple to very complex.

The simplest possible method is using the column average, you assume that the missing values essentially equal to the mean value in that column or for that feature. Another very similar option is to use the median value of that column or the mode of that column.

Another way to impute missing values I interpolate from other nearby values. This technique is useful when your records are arranged in some kind of inherent order.

Imputation to fill in missing values can be very complex, in fact you can build an entire ML model to predict missing values in your data.

Now you might want to perform imputation in a variate of ways

Multivariate Imputation

Univariate of Imputation: Rely on known values in the same feature or column.

Multivariate of Imputation: Use all Know data to infer missing data.

For example -for Multivariate of imputation- you might construct regression models from other columns to predict this particular column, you will iterate repeat this for all columns that contain missing values.

There are other techniques that you can apply to fill missing values as well.

Hot-deck Imputation

You will sort all the records you have based on any criteria. And for each missing value you can use the immediately prior available value.

Filling missing values using the previous value, one of your records have been ordered. This is specially used for fill time series data where progression in time has a useful meaning.

For Time series, equivalent to assuming no change since last time measurement.

A common technique that often uses is Mean Substitution.

Mean Substitution

For each missing value, substitute the mean of all available values. The mean substitution has the effect of weakening correlations between columns that exist in your data.
When you essentially say this an average data point there is nothing special about it, you weaken correlation and this can be problematic when you are performing bivariate analysis, bivariate analysis is analysis to determine the relationship between 2 columns.

If you want to intelligently predict the missing values in your data you might want to use Machine learning.

Regression

Fit a model to predict missing columns based on other column values. Applying this technique tends to strengthen correlations that exist in your data because you essentially saying that this column is dependent on the other column.

Regression and Mean Substitution to fill missing values have complementary strengths

Outliers

Data Problems

Outlies: is a data point that differs significantly from other data points in the same dataset.

If you visualize your data, you might find something like this

Outlier Visualization

It might be the entire record is an outlier in some manner or there are certain fields with outlier values.

When dealing with outlier data it is 2 step process first step is Identifying Outliers that exist in your data the second step is to use techniques to cope with these outliers.

There are specific ML algorithms that have been built for outlier detection but at a very basic level you can identify outliers by seeing the distance of that data point of the mean of your data or the distance from the line that you fit in your data.

Once you identify outliers, you can cope outliers using three techniques: Drop records with outlier data, cap/floor outliers, or set outliers to the mean value of that feature.

Outliers

Identifying Outliers: Distance from the mean

If you have a data point with a value far from the mean that can be considered to be an outlier, or you can perform some regression analysis and find a line or a curve that follow the pattern in your data in if you have a point that is far from this fitted line that is an outlier.

When you want quickly to summarize any set of data you are working with the first measure that you will indicate is the mean of that data.

The mean or average is the one number that best represents all of these data points.

Mean or Average of any set of data is the sum of all of the numbers divided by the count of the numbers.

Mean Point
Mean Equation

However, along with the mean the variation that exists in your data also important.

The variation: is a measure of whether the number in your dataset jumps around.

One important measure of the variation in your dataset is the range.
The ranges completely ignore the mean and it is swayed by outliers that are present in your dataset, that’s where variance comes in.

Variance is the second most important number to summarize any set of points you are working with.

Variance

Mean and Variance succinctly summarizes any set of numbers.

Along with variance another term you might encounter is the standard deviation.

Standard Deviation: is the square root of the variance and is a measure of variation in your data.

The standard deviation helps you express how far a particular data point lies from the mean.

Identifying Outliers: Distance from the fitted line

Identifying Outliers: Distance from the fitted line

Outliers might also be data points that do not fit into the same relationship as the rest of the data.

Coping with Outliers

Once you identify the outliers you need to figure out how do you deal with them, how do you want to cope with them?

Always start by scrutinizing outliers that exist in your data

If erroneous observation:

  • Drop the entire record if all attributes of that point are erroneous
  • In row or record has one attribute that you feel it is erroneous, Set it to the mean.

It quite possible that your outlier data is not an incorrect observation, it is a genuine, legitimate outlier. You have 2 approaches to deal with this kind of outliers:

  • Leave as is if model not distorted
  • Cap/Floor if the model is distorted, but first you need to standardize your data.

Final Words…

Since you need to use data for making critical business decisions, it is safe to say that cleansing and enriching your data periodically is a must.

Thanks for reading.
I hope you enjoyed my series about Data Cleaning.

Bye

--

--