#03 Data Cleaning: If you feed better, ML replies the better answer to you

機械学習における前処理

Akira Takezawa
Coldstart.ml
4 min readFeb 7, 2019

--

Hola! Welcome to “Short-Cut Machine Learning Series”.

Target is who wanna know …

  • Reason: The impact of Data Cleaning
  • Big Picture: Comprehensive preprocessing ideas
  • Code: The simplest python code for each preprocessing

— — —

Why you have to read this?

“Better Data > Fancier Algorithms“ by EliteDataScience.com

When I started to learn machine learning, I underestimated this data cleaning part of data science workflow, compared to fancy parts like modeling or feature engineering.

However, now I’m sure that the more you understand about ML, the more you realize how important and challenging data cleaning part is.

In the actual Data Science task, you’ll spend 70% of the time on this data cleaning process. Because unlike Kaggle, our data is not always such tidy.

Mostly in a statistical context, we need to clean up data, in order to function our ML model in a real implementation. Don’t worry, it’s just like cleaning your room, so much fun though. Let’s get started!

— — —

Menu

  1. Missing Value Handling
  2. Exclude Outlier
  3. Convert Data Type
  4. Dummy Treatment
  5. Regularization (Penalty for parameters)
  6. Normalization (Scale Adjustment)

1. Missing Value Handling

Firstly, the solutions for missing value should be different depends on the character of each data. I will mention the main 2 data types, Numerical values and Categorical values.

Import dataset:

Firstly, visualize the position of null values by 3 lines code:

Dark red part indicates null values. Now we have 4 columns which have null values, [“Car”, “BuildingArea”, “YearBuilt”, “CouncilArea”]. First 3 columns are numerical data, and only “CouncilArea” is categorical data.

I will start form missing value in numerical data at first.

Missing Value in Numerical Data

1. Replacing With mean or mode [ sklearn.impute.SimpleImputer() ]

Missing Value in Category Data

1. Drop Rows [ pandas.DataFrame.dropna() ]

As you can see, “CouncilArea” column loses their data mostly in later records. We can imagine data provider had some trouble to store data in a particular term. Therefore this time I will drop those rows.

2. Give An Unique Category [ pandas.DataFrame.fillna() ]

NOTE: “Missing Value” doesn’t always mean they are simply lacking data.

I got good insight from How to Handle Missing Data of Alvira Swalin. She explained that sometimes our data is not randomly losing data. Considering it, we can count it as meaningful values and give them a new category.

3. Apply ML models (KNN or Regression model)

Here I also found an efficient solution for the categorical missing value. They use a classificational machine learning model to predict the category of records which has a missing value. Interesting!

2. Exclude Outlier

But why does Outlier matter?

Yes, I don’t explain whole theory but you should just put this in your mind:

In machine learning and statistical analysis, technically you can not use variables(features) which don't have a normal distribution.

After you understand the premise for ML, you”ll think:

But how can I detect outlier?

OK, basically there are two ways to detect outliers:

  1. Smirnov-Grubbs test
  2. Z-score with the interquartile range (IQR)

2. Boxplot

iris data set cr 2019 akira t

3. Z-score

4. Convert Data Type

5. Dummy Treatment

6. Regularization (Penalty for parameters)

7. Normalization (Scale Adjustment)

  • MinMaxScaler: Convert all values into the range of 0→1.

— — —

Conclusion

— — —

References

--

--

Akira Takezawa
Coldstart.ml

Data Scientist, Rakuten / a discipline of statistical causal inference and time-series modeling / using Python and Stan, R / MLOps is my current concern