#03 Data Cleaning: If you feed better, ML replies the better answer to you

機械学習における前処理

Akira Takezawa

Published in

Coldstart.ml

4 min readFeb 7, 2019

Hola! Welcome to “Short-Cut Machine Learning Series”.

Target is who wanna know …

Reason: The impact of Data Cleaning
Big Picture: Comprehensive preprocessing ideas
Code: The simplest python code for each preprocessing

— — —

Why you have to read this?

“Better Data > Fancier Algorithms“ by EliteDataScience.com

When I started to learn machine learning, I underestimated this data cleaning part of data science workflow, compared to fancy parts like modeling or feature engineering.

However, now I’m sure that the more you understand about ML, the more you realize how important and challenging data cleaning part is.

In the actual Data Science task, you’ll spend 70% of the time on this data cleaning process. Because unlike Kaggle, our data is not always such tidy.

Mostly in a statistical context, we need to clean up data, in order to function our ML model in a real implementation. Don’t worry, it’s just like cleaning your room, so much fun though. Let’s get started!

— — —

.interpolate
sklearn.preprocessing.imputer
skelarn.preprocessing.StandardScaler
preprocessing.MinMaxScaler()
.lognormal(size=(3, 3))
preprocessing.normalize(X, norm='l2')
preprocessing.OrdinalEncoder()
preprocessing.OneHotEncoder()
PolynomialFeatures
Binarizer
FunctionTransformer
Imputer
label_binarize

1. Missing Value Handling

Firstly, the solutions for missing value should be different depends on the character of each data. I will mention the main 2 data types, Numerical values and Categorical values.

Import dataset:

df = pd.read_csv("melb_data.csv")
df = df.drop(labels="Price", axis=1) # drop Target values
df.shape>>> (13580, 20) # 20 columns, 13580 rows

Firstly, visualize the position of null values by 3 lines code:

# df.isnull().sum() is easiest way though
import seaborn as sns
nan = df.isnull()
sns.heatmap(nan, cmap="Oranges")

Dark red part indicates null values. Now we have 4 columns which have null values, [“Car”, “BuildingArea”, “YearBuilt”, “CouncilArea”]. First 3 columns are numerical data, and only “CouncilArea” is categorical data.

null_nums = ["Car", "BuildingArea", "YearBuilt"]
null_cat = [“CouncilArea”]

I will start form missing value in numerical data at first.

Missing Value in Numerical Data

1. Replacing With mean or mode [ sklearn.impute.SimpleImputer() ]

# strategy='mean', 'median', 'most_frequent'
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_array = imputer.fit_transform(df[null_nums])
new_df = pd.DataFrame(imp_array)
new_df.isnull().sum()>>> 0 # "Car" 
>>> 0 # "BuildingArea"
>>> 0 # "YearBuilt"

Missing Value in Category Data

1. Drop Rows [ pandas.DataFrame.dropna() ]

As you can see, “CouncilArea” column loses their data mostly in later records. We can imagine data provider had some trouble to store data in a particular term. Therefore this time I will drop those rows.

df.dropna(subset=["CouncilArea"], inplace=True)
df["CouncilArea"].isnull().sum()>>> 0 # null in "CouncilArea"

2. Give An Unique Category [ pandas.DataFrame.fillna() ]

NOTE: “Missing Value” doesn’t always mean they are simply lacking data.

I got good insight from How to Handle Missing Data of Alvira Swalin. She explained that sometimes our data is not randomly losing data. Considering it, we can count it as meaningful values and give them a new category.

# this is just for coding example
df["CouncilArea"].fillna(value="New Category", inplace=True)

3. Apply ML models (KNN or Regression model)

Here I also found an efficient solution for the categorical missing value. They use a classificational machine learning model to predict the category of records which has a missing value. Interesting!

2. Exclude Outlier

But why does Outlier matter?

Yes, I don’t explain whole theory but you should just put this in your mind:

In machine learning and statistical analysis, technically you can not use variables(features) which don't have a normal distribution.

After you understand the premise for ML, you”ll think:

But how can I detect outlier?

OK, basically there are two ways to detect outliers:

Smirnov-Grubbs test
Z-score with the interquartile range (IQR)

2. Boxplot

# I'm using iris data set
sns.boxplot(x="variable", y="value", data=pd.melt(iris_df))

3. Z-score

4. Convert Data Type

5. Dummy Treatment

6. Regularization (Penalty for parameters)

7. Normalization (Scale Adjustment)

MinMaxScaler: Convert all values into the range of 0→1.

from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler().fit(data)
print(scaler.transform(data))>>> [[0.   0.  ],[0.25 0.25],[0.5  0.5 ],[1.   1.  ]]

— — —

Conclusion

— — —

#03 Data Cleaning: If you feed better, ML replies the better answer to you

機械学習における前処理

Target is who wanna know …

Why you have to read this?

Menu

1. Missing Value Handling

2. Exclude Outlier

4. Convert Data Type

5. Dummy Treatment

6. Regularization (Penalty for parameters)

7. Normalization (Scale Adjustment)

Conclusion

References

Written by Akira Takezawa