How to work with masked data in Machine Learning?

Akash Sambhangi
Analytics Vidhya
Published in
8 min readDec 20, 2019

If you thought working with labeled data to derive meaningful relations and building a predictive model was difficult.

What if the data you have is masked for some reason and you have no understanding of the categories,labels or field names?.

Long story short-Exploratory Data Analysis and Feature Engineering, But since solutions are best understood with a problem being solved I shall walk you through my solution to a kaggle competition which uses masked data.

Kaggle Competition-Mercedes-Benz Greener Manufacturing.

Problem Statement-Can you cut the time a Mercedes-Benz spends on the test bench?.

Performance Metric- R² value, also called the coefficient of determination.

Let us now take a look at the data we have in hand.

In the above screenshot ID,X0 to X385 are features for the dataset and y is the output variable(time spent on test bench). As you can notice the feature names are anonymous, Imagine an alternative scenario where the features names are given then they might give us very useful insights for example:- a feature can be 4WD(1 or 0) testing a car with 4WD might require more time on the test bench, since that is not the case here we have to rely on statistical measures to find useful features.

This dataset also suffers from dimensionality curse as the number of features in the dataset(377) is high with reference to the number of rows(4209).

Lot a lot of approaches focus more on the modelling part but one should always remember garbage in is garbage out. we will perform the below listed tasks to solve this problem.

  1. Perform EDA on the data and get any many insights as possible
  2. Since we have high number of dimensions we shall try and eliminate unwanted/less important features.
  3. Come up with new features by leveraging the knowledge derived from EDA.
  4. Preparing models and evaluating the solution.

1.Exploratory Data Analysis:-

Removing Outliers from the output variable

Distribution of output variable ‘y’

From the above plots we can see that most of the points lie between the range 75–150 and there are very few points above 150 hence we shall consider this as upper-limit to all our datapoints.

Setting 150 as upper limit for output variable.

Types of features

From the above cell we can see that the total number of features are 376 out of which 368 are numerical , 8 are categorical and one output variable.

Let us now see how many of these features are binary and how many are constant .

When there is a binary feature there is always a chance for it to be constant hence it is important to check for the same. In the above cell we can see that there are 13 constant features since these have no value in our supervised learning approach we can simply drop these features.

Uni-variate Analysis

Boxplot + Stripplot of categorical variables

Visualization of categorical variables against the output variable

From the above plot we can observe the following:

  1. The feature ‘X0’ is has a good variation in the output variable range between different categories.
  2. A few categories such as ‘y’ in feature ‘X1’ and ‘x’ in feature ‘X5’ are good indicators of low testing time.
  3. The feature ‘X4’ is not very useful as it does not give good differentiation.
  4. Features like ‘X3’,’X6' and ’X8' show a considerable amount of overlap, hence might not be very useful.

Check if the feature ‘ID’ is useful

Plot of feature ‘ID’ vs output variable

We can observe a slight decreasing trend in output variable with increase in value of ‘ID’, hence we shall consider this feature for our analysis.

2.Feature Removal(an extension to EDA).

Removing categorical features which show minimal variation in output variable.

From the box plots it is notable that categorical features X8,X6,X4,X3 are almost indistinguishable across the categories.

We know that majority of features in our dataset are binary hence we can check the variance of each feature for example if there exists a feature which is ‘1’ for 99% of the datapoints and is ‘0’ for the rest then this feature might not be very useful in our analysis, hence we shall remove all those features with very low variance.

Dropping columns with low variance

From the above code we can see that a total of 135 features have very variance.

Removing duplicate features and features which are very similar to others in the dataframe.

There might be cases where the features in dataset are either duplicated or very similar, an example of very similar feature is if we have two features where one is “Hybrid”(tells if the car is a hybrid with both IC engine and Electric motor) and the other is “Battery pack”(tells us if the car has a battery pack or not) now it is certain that every hybrid car will have a battery pack and features like these can be very similar.

There are number of similarity measures to be chosen from, With reference to the paper
Rogers-Tanmoto distance is chosen, it is very similar to Jaccard distance but it uses Bitwise operations.

Rogers-Tanmoto distance is calculated for each pair of binary features and plotted.

From the above plot we can see that at a very low value less than 0.1 there is a knee point we shall choose 0.006 as the threshold.

Pairs of features which are very similar to each other.

After dropping one of the features from each pair we are left with 192 features in the dataset which is a considerable drop from 376 we had in the beginning.

3.Feature Engineering.

Feature ‘X0’ clustering labels.

We have seen that the feature ‘X0’ is very useful, so we shall try and use this feature to cluster the datapoints so that each cluster of datapoints can represent a certain range of output variable.Using KNN algorithm the data points are clustered on the feature ‘X0’, a total of four clusters are formed.The ‘cluster_target_encoder()’ is responsible for fitting the data on a KNN model.

From the above cell we can see that the clusters are decently differentiated with a little overlap, hence the cluster labels can be used as a new feature.

Feature Learner.

Take all the numerical columns(binary) and form pairs or triplets of them and find the sum of these pairs/triplets to form potential new features, The new feature will be assessed based on its correlation with the output variable.
The features with highest correlation will be added as new features to our dataset. In order to compare a continuous variable(output) and a categorical/binary variable we will use Point-Biserial correlation coefficient.

The above code snippet shows the formation of new features and calculation of Point-Biserial Correlation coefficient of each new feature with the output variable, this information is then stored in a dictionary ‘feature_dict’ where the key is the name of new features and value is the Point-Biserial Correlation coefficient value.

We then choose features which have high correlation, the below screenshot shows the new features that were added.

Now to test if our newly added features are useful we will use a simple Random Forest Regression model and plot the feature importance.

Feature Importance

From the above plot it is evident that the engineered features such as triplets, labels formed from clusters and even the ID feature lie among the most useful features.

Trust me the tough part is done.

All that is left now is to put out data into a model and do some basic hyper-parameter tuning.

4.Preparing models and evaluating the solution.

A tuned XGBoost Regressor is chosen as model here, here is an example of how the tuning is done.

From the above plot we can see that lower depths (range of 1–3) will yeild the best results similarly number of trees, colsample_bytree and gamma we arrive at the final model.

Final XGBoost model
Scores at the end of model training.

Now to evaluate how good the solution is we can always use kaggle’s private and leader board as reference (we are focusing more on private leader board because we want our data to perform well on majority of the unseen data)

The score attained by this solution lies in the top 4% of leader board clearly this is not the best solution but remember the fact that we put only little effort in modelling part where no stacking techniques were used.

This shows that even if we have masked data EDA and Feature Engineering are two key tools in data science that can be leveraged to attain good results. The complete code for this solution is present in this github link

References:-

  1. https://www.appliedaicourse.com/

2. https://www.kaggle.com/anokas/mercedes-eda-xgboost-starter-0-55

3. https://www.kaggle.com/daniel89/mercedes-cars-clustering/

4. https://www.kaggle.com/deadskull7/78th-place-solution-private-lb-0-55282-top-2

You can find my LinkedIn Profile- with this link

--

--