By — Jen Neng Ng
This story is continue from a Series of :
Part 2: Data Analysis
This tutorial is using R language, the reason is less coding effort and quick plotting. The drawback may be syntax is not understandable (syntax too short) and less deep learning model support.
There are 39 columns in DataDriven.org. The main objective is to predict the ordinal variable “damage_grade”. This column presents the level of damage grade affected by the earthquake. There are three grades of damage measurement: 1: Low damage 2: Medium amount of damage 3: Almost or completely destructed.
This data structure plot is useful when the dataset has multiple files, especially on Kaggle competition. You can have a glance at how to combine it into a single data frame.
Run the above code will generate a quick EDA report
Figure 2 shows the overall data statistic, it shows 32 continuous columns (Binary+Numeric) and 8 discrete columns (Category). Since there is only 8 discrete columns, it will be faster when we perform one-hot encoding and consume lesser ram to store the data into the dataframe.
Another thing to take a look at is the missing observations and columns, it could be possible that some of the dataset provided has missing data in particular observation (row). These steps determine whether we should perform data cleaning/imputation process.
The histogram above shows the frequency distribution, it provides a quick view of numeric continuous variable distribution. The interesting point to observe is the small outlier point on the right side of the age and height_percentage plot.
QQ plot is quite common for data science to verify the normality of data distribution. The observation here is to verify whether the head and tail curve is close to the normality line. The plot above shows building_id, geo_level_1_id, geo_level_2_id, and geo_level_3_id is “fat tail” means it has more extreme data than normal distribution and less data at the center of the distribution. (plot reference)
On the other hand, the QQ plot indicates how the variable range or measurement for each damage_grade. From there, you can identify which color line are stacking together and which do not. The color line that is not stacking together, measure that the variable is useful to distinguish whether the particular observation data is under which damage_grade. The plot has shown us that all of the numeric continuous variables is useful.
Optionally, you can also apply a log function to fix the data distribution. (reference)
Correlation heat map is a useful plot to analyze the multivariate relationship. It defines the relationship for each individual variable. The plot has identified that count_floors is highly correlated with height_percentage. Also, has_secondary_use and has_secondary_use_agriculture has high correlation too.
When variables are high correlated mean both of them are having almost the same distribution, It won’t provide useful variance if you have a duplicate variable. So, we can determine later whether to remove the high correlated variables or combine them with PCA.
The PCA plot above (Figure 8–10) measure the PCA for numeric variables. Based on the plot, it suggests combining these 8 variables into 6 components. Then these 6 components are able to explain 86% variance. In general, we will pick the high score and combine it into a component for each group which is called “Factor loading”. However, the plot above shows us that some variables have able to load into multiple components such as geo_level_3_id and count_families. (Cross-loading) We can use another library to perform VARIMAX rotation:
The VARIMAX rotation helps us to solve the cross-loading issue. It is now provides a better view of what variable we should combine. The best to combine is rotated component 1 by loading height_percentage and count_floors_pre_eq. Other components just have one variable, it is not worth to perform.
You may also check the chi-square and p.value. When p.value is less than 0.05, it means it is significant. (useful)
The box plot here can help us to find out that the damage_grade is higher when count_floors_pre_eq or height_percentage is getting higher. On the other hand, damage_grade is getting lower when geo_level_1_id is getting higher.
Another advantage of using box plots is outlier identification. At the age box plot there, it indicates some black dots are on the far right side. It can be removed in the data cleaning section, but we must verify it in the experiment section.
The data preprocessing step can go through as:
1) Data Splitting (perform splitting before upsampling to avoid interpolate data used for validation )
2) Data Cleaning (remove outlier and duplicate row)
3) Data Sampling (up-sampling to balance dependent variable data before any encoding)
4) Data Scaling (rescale continuous variable data)
5) Data Transformation (One hot encoding & PCA transformation)
6) Feature Selection (Optional)