Mercedes Benz Greener Manufacturing With Machine Learning.

Harshwardhan Jadhav
The Startup
Published in
23 min readOct 22, 2020

Car Testing Time Predictions using Machine Learning Model

In this blog, I am showcasing my work on the kaggle problem statement ‘Mercedes-Benz Greener Manufacturing’.

Image Courtesy: https://i.pinimg.com/originals/5b/ac/4e/5bac4e30d414a7eda8d137af0b1b33d4.jpg

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

So before starting Let’s just get the idea of the whole flow of this work,

Contents:

  1. Business Problem
  2. Mapping Business Problem into Machine Learning problem
  3. Performance Metric
  4. Data Loading and EDA(Exploratory Data Analysis)
  5. Feature Engineering
  6. Existing Approaches
  7. My First cut approach
  8. Machine Learning models
  9. Model Comparison
  10. Final ML Pipeline
  11. Future Extensions
  12. References

Now let’s start,

1. Business Problem:

Any car we see running on the road is not manufactured and put directly on the road for our use. Every car or a bike or any other running vehicle on the road we see on the road goes through several testing procedures which it has to pass so as to hit the road for regular use. This testing is performed so as to ensure the safety and reliability of the vehicle when it will be used in real-world scenarios. Testing can include many steps so it is obviously a time-consuming process. More time is required for testing since tests should be performed considering all the real-life situations, So more testing time leads to more testing cost, and also as the testing time increases, Co2 emissions from the vehicle also increases with that. But testing is a very essential step and no automobile manufacturer can skip this because every vehicle configuration they manufacture has to go through all the testings so as to ensure the safety of the occupants and the reliability of that vehicle. So as a popular premium automaker Mercedez cannot compromise about the safety of the vehicle and the occupant in fact all the vehicle manufacturer’s goal is to have a robust and efficient testing system, and nowadays all of them are moving towards automation.

Mercedes Benz and all other automakers are trying to automate their testing systems so as to develop efficient testing systems for their vehicles. Automated systems will help to eliminate the errors due to variability in human behaviors which is inherent and also it is safer to auto-test than putting a human on the driver seat for testing. So the aim is to reduce testing time by analyzing the currently available data which is collected from hundreds of tests on thousands of car configurations.

The basic problem statement is to create a machine learning model that will predict the accurate time a car spends on the test bench. The car configuration is nothing but selected various customization options available and the features for a particular car. Accurate models will help to reduce the total time spent on testing by allowing the vehicles with the same configurations to test successively.

There are many features in the car configuration, e.g. if there are any cars belong to some class-D but they have one/many additional features which other class-D cars don’t have that/those features, in such cases, there will be a different testing time for them. Hence for such cases, the machine learning model can help to predict accurate time spent on a test bench for cars with the same class but some different features.

1.1 Business Objectives and Constraints

  • Predicting accurate time a car spends on the test bench
  • No strict latency constraints, few seconds to a few minutes prediction time is okay, but not hours.

2. Mapping Business Problem into Machine Learning problem:

2.1 Data

2.1.1 Data Overview

This dataset contains an anonymized set of variables, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.

The ground truth is labeled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.

We have two comma-separated files:

  • train.csv — Contains the training set with 4209 rows (data points) and 378 columns (features) with labels
  • test.csv — Contains the test set with 4209 rows (data points) and 377 columns (features) with no labels

Columns:

Link to the data set: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data

2.2 Mapping the real-world problem to a Machine Learning Problem

2.2.1 Type of Machine Learning Problem

As our aim is to predict the testing time which is a continuous variable, we can surely say this is a regression machine learning problem. And as we have a labeled dataset here, it is a supervised machine learning problem. Mercedes Benz will implement the best performing model into their testing procedure which will result in efficient testing without harming their standards and which also will help greener manufacturing by reduction of Co2 emissions.

3. Performance metric:

Now we know this is a machine learning regression problem, we have to use an appropriate metric for the performance evaluation of our prediction model. Here it is already given in the competition to use the R² metric for evaluation. R² is also known as Coefficient of Determination, R-squared gives the percentage variation in ‘y’ (test time in this case) explained by ‘x-variables’ (a combination of car custom features in this case). In simple words, R² gives us the percentage of data points that fall within the regression line. The higher the R² value, the higher will be the data points that fall within the line. E.g. if the R² value is 0.66 then it indicates that 66% of data points are lying within the regression line of the total data points. Mathematically R² is denoted as follows:

There 4 cases for the values of R²:

R² metric is very sensitive to outliers. The algorithm that best explains the variation in testing times will be the optimal machine learning model for the task. So this is the best metric to be used for evaluation in this problem, as Mercedes is really interested to know how the different testing times for different configurations can be represented in a machine learning model.

As I have explained in the four cases above we can see the R² metric ranges from -∞ to 1 as -∞ R² score being worst and 1 R² score being the best model. Also, there are very rare cases where R² can be negative, so generally, we get an R² score between 0 and 1. So for R² metric upper bound is 1 but in the case of RMSE and MAE, the score ranges from 0 to ∞(infinity) there is no upper bound so it will be difficult for us to compare the model with the baseline model score. The benefit of using the R² metric is that it is having an upper bound 1 beyond which the score cannot increase so we can compare our model score with the baseline model score, hence the R² metric is preferred over RMSE and MAE in this problem.

4. Data Loading and EDA(Exploratory Data Analysis):

4.1 Data Loading:

By using pandas library load the datasets we have in the CSV format:

  • Here we can see there are 4209 datapoints indexing from 0 to 4208 and 378 columns/features.

We have three types of data in the dataset:

  • float64(1): Dependent feature, testing time in seconds
  • int64(369): Independent Binary features
  • object(8): Independent Categorical features
  • Here we can see there are 4209 datapoints indexing from 0 to 4208 and 377 columns/features.

We have three types of data in the dataset:

  • int64(369): Independent Binary features
  • object(8): Independent Categorical features

We can see here we have the same number of data points in the train and test dataset.

Now Let’s start the actual EDA,

4.2 Statistical description of ID and Dependent variables:

First I will check for any missing values value in the whole data set,

Let’s also check for duplicate rows,

Now let’s check if there are any duplicate values in the ‘ID’ column,

4.3 Check the distribution of Dependent variable:

Histogram Plot of Testing Time(Dependent Variable)
Scatter plot of Testing Time(Dependent Variable)
  • From this above two plots, I can see most of the points are belonging to the range 75 to 150 seconds. Here we are able to see the extreme point which is taking more than 250 seconds for testing. Maybe this car configuration is one which is bought very rarely, also can be quite an expensive one and thus we have only one such datapoint available here. We can consider this as an outlier for sure because it is only one point which is far away from others.
  • Now I have found one outlier, we should check for more outliers. For that let’s find out percentiles of testing times and then I will decide a threshold for valid data points based on that. Now I will just check the 90th percentile to 100th percentile only.
90.0th percentile:  115.25
91.0th percentile: 116.0484
92.0th percentile: 116.89160000000001
93.0th percentile: 118.0376
94.0th percentile: 119.056
95.0th percentile: 120.80600000000001
96.0th percentile: 122.4
97.0th percentile: 125.89319999999998
98.0th percentile: 129.2992
99.0th percentile: 137.4304
100th percentile: 265.32

I already declared the 100th percentile value i.e. 265.32 as an outlier, now let’s check the 99th percentile to 99.99th percentile so as to decide the threshold.

99.0th percentile:  137.4304
99.1th percentile: 139.09024
99.2th percentile: 140.1836
99.3th percentile: 140.81639999999993
99.4th percentile: 142.6480000000001
99.5th percentile: 146.23040000000006
99.6th percentile: 149.0374399999998
99.7th percentile: 151.4276800000003
99.8th percentile: 154.68695999999994
99.9th percentile: 160.38328000000087

Let’s use 155 as threshold time and consider values all above 155 as outliers

4.4 Analysis of Independent Features:

- We have total 8 categorical features.
- We have total 368 Binary features.

4.4.1 Analysis of Categorical Features:

Categorical variables

Let’s plot Boxplot for each categorical feature,

Boxplots of X0-X1-X2 respectively
Boxplots of X3-X4-X5 respectively
Boxplots of X6-X8 respectively

From these above boxplots for categorical features I can say:

  • We can see after the removal of outliers(extreme ‘y’ values) dataset is looking much cleaner
  • Features X0, X1, X2, X3, X5, X6, X8 contain some important information as their variance is quite high
  • Feature X4 seems to have very less variance which means it has very less information

Hence, I can surely remove X4 from the dataset because of its low variance(less information)

4.4.2 Analysis of Binary Features:

Binary variables

Binary features are way more than categorical features in quantity (368), I think it is better to find the variance of all these features and analyze them than visualizing one by one as it will become a cumbersome task. So I am finding variances of all the Binary features and will choose the important features based on their variances.

Binary Features and corresponding variances

Now I am plotting a scatter plot for the above data frame values so as check their distribution.

The binary variable vs Variance Scatter plot
  • The above scatter plot says there are many values which are having 0 variances.
  • Roughly I am also able to see some features with the same variance, but it is not clearly visible, I will check it by their exact values.

Now let’s check for the features which are having 0 variances and the same variance,

There following 13 features having 0 variance:

['X11' 'X93' 'X107' 'X233' 'X235' 'X268' 'X289' 'X290' 'X293' 'X297'
'X330' 'X339' 'X347']
There following 53 features having same variance:

['X35' 'X37' 'X39' 'X57' 'X76' 'X84' 'X94' 'X102' 'X113' 'X119' 'X120'
'X122' 'X130' 'X134' 'X136' 'X146' 'X147' 'X157' 'X172' 'X194' 'X199'
'X205' 'X213' 'X214' 'X216' 'X222' 'X226' 'X227' 'X232' 'X239' 'X242'
'X243' 'X244' 'X245' 'X247' 'X248' 'X253' 'X254' 'X262' 'X263' 'X266'
'X279' 'X296' 'X299' 'X302' 'X320' 'X324' 'X326' 'X360' 'X364' 'X365'
'X382' 'X385']
  • From the above analysis of categorical features and Binary features, I have come to the conclusion that I can drop the features which are having zero variance, same variance, and the ones with very little variance. Because their low variance will not contribute much to the prediction of testing time while modeling. Now I will collect all the features to be dropped together.
  • These are the total 67 features I can drop:
['X4', 'X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X339', 'X347', 'X35', 'X37', 'X39', 'X57', 'X76', 'X84', 'X94', 'X102', 'X113', 'X119', 'X120', 'X122', 'X130', 'X134', 'X136', 'X146', 'X147', 'X157', 'X172', 'X194', 'X199', 'X205', 'X213', 'X214', 'X216', 'X222', 'X226', 'X227', 'X232', 'X239', 'X242', 'X243', 'X244', 'X245', 'X247', 'X248', 'X253', 'X254', 'X262', 'X263', 'X266', 'X279', 'X296', 'X299', 'X302', 'X320', 'X324', 'X326', 'X360', 'X364', 'X365', 'X382', 'X385']

Now let’s just find out some important binary by using a random forest model and just perform visual EDA for those features only. I have trained a RandomForest model on Binary Features and used the ‘feature_importances_’ attribute so as to get the important features.

These are top 8 binary features :
['X314', 'X315', 'X118', 'X29', 'X54', 'X189', 'X46', 'X127']
Boxplots for X29-X16-X54 respectively
Boxplots for X118-X127-X189 respectively
Boxplots for X314-X315 respectively

From the above plots, I can see these important binary variables are pretty well distributed. I can say as the feature’s presence or absence is affecting the testing time to change.

  • For features ‘X314’, ‘X315’, ‘X118’, ‘X189’ when these are present in the car configuration then most of the configurations take more time for testing.
  • For features ‘X29’, ‘X54’, ’X127', the configurations which are not having these features are tending to take more testing time.
  • For ‘X46’, the configurations having almost similar testing when it present and also when it is not present in the configuration.

4.4.3 EDA Summary:

  • There are no NaN values in the dataset
  • There are no duplicate rows in the dataset
  • Clipped dependent variable at 155 as threshold time and considered values all above 155 as outliers
  • Removed low variance categorical feature: [‘X4’]
  • Removed zero variance binary features: [‘X11’, ‘X93’, ‘X107’, ‘X233’, ‘X235’, ‘X268’, ‘X289’, ‘X290’, ‘X293’, ‘X297’, ‘X330’, ‘X339’, ‘X347’]
  • Removed same variance binary features: [‘X35’, ‘X37’, ‘X39’, ‘X57’, ‘X76’, ‘X84’, ‘X94’, ‘X102’, ‘X113’, ‘X119’, ‘X120’, ‘X122’, ‘X130’, ‘X134’, ‘X136’, ‘X146’, ‘X147’, ‘X157’, ‘X172’, ‘X194’, ‘X199’, ‘X205’, ‘X213’, ‘X214’, ‘X216’, ‘X222’, ‘X226’, ‘X227’, ‘X232’, ‘X239’, ‘X242’, ‘X243’, ‘X244’, ‘X245’, ‘X247’, ‘X248’, ‘X253’, ‘X254’, ‘X262’, ‘X263’, ‘X266’, ‘X279’, ‘X296’, ‘X299’, ‘X302’, ‘X320’, ‘X324’, ‘X326’, ‘X360’, ‘X364’, ‘X365’, ‘X382’, ‘X385’]

5. Feature Engineering:

5.1 Data Preprocessing

I am using LabelEncoder for encoding categorical features

Code For Preprocessing the data

5.1.1 Encode the training and testing dataset:

First of all, let’s dropped all the less informative columns, I found above and then encoded the features.

Train Data Encoded
Test Data Encoded

Now let’s create some new features,

5.2 PCA(Principal Component Analysis):

PCA of binary features

5.3 SVD(Singular Value Decomposition):

Recent work by Gavish and Donoho provides an optimal truncation value, or hard threshold, under certain conditions, providing a principled approach to obtaining low-rank matrix approximations using the SVD.

It determines the optimal hard threshold ‘τ’ for singular value truncation under the assumption that a matrix has a low-rank structure contaminated with Gaussian white noise. This work builds on significant literature surrounding various techniques for hard thresholding of singular values.

If X ∈ R^n×m is rectangular and m < n, then the aspect ratio

β = m/n

When noise is unknown there is no closed-form solution for ‘τ’, and it must be approximated numerically,

For unknown noise and a rectangular matrix X ∈ R^n×m, the optimal hard threshold is given by:

σmed is the median singular value

Here, ω(β) = λ(β)/µβ, where µβ is the solution to the following problem:

The median µβ and hence the coefficient ω(β) are not available analytically;

Some values of coefficient ω(β) are tabulated in the Table below for convenience.

Now let’s find out the Hard Threshold using the above method

Now by using the above value for the number of components add the SVD feature

SVD of binary features

5.4 Interactions:

  • Two-way interactions X314, X315
  • Three-way interactions X118, X314, X315

5.5 Gaussian Random Projections:

6. Existing Approaches:

6.1 The 11th place solution:

This solution uses the clustering approach so as to check target distribution. He observed that there are broadly four groups (clusters).

One of the cluster groups is almost constant. He used only binary features for the final submission and dropped all the categorical features. He used 4 Xgboost estimators and tuned them independently and then by using subsets of features he combined the final result. He got a score of 0.55263 on the private leaderboard. He kept holdout set for cross-validation which helped him to trust his performance and achieve this score on the leaderboard. The interesting thing is he dropped all the categorical features and getting this much better results.

6.2 Stacking ML algorithm for Mercedes-Benz Greener Manufacturing Competition:

This blog is written by Amey Laddad. He explains how he approached the problem. In the starting, he explains what is stacking as his whole approach is stacking based.

Stacking Algorithms Basic Workflow

He trained three types of models, the first one is XGboost and found that TSVD, PCA & ICA generated features are contributing effectively in the prediction, 2nd he trained Multi-layer perceptron where he found The Loss & R2 metric graphs converge after few epochs and the model is good at predicting without overfitting, and finally, the stacking model which consists of LassoLarsCV and GradientBoostingRegressor estimators and did not include PCA, SVD & ICA features for this model, stacking gives better results than first two models.

7. My First cut approach:

a. I have preprocessed the required variables in the feature engineering part. Now I will split data into two sets train and cross-validation.

b. So now the data is ready and I am ready to create a baseline regression model. I will use KNeighborsRegressor as a baseline model

c. I will check it’s performance on train and cv data by using the R² metric.

d. While training the baseline model I will go first with the original features without including the newly created features. I will try different combinations of features in further models.

e. Based on the results of the baseline model I will use different models further for improvement.

8. Machine Learning Models:

8.1 KNeighborsRegressor

I will use the only original dataset for this model,

import time
start = time.time()
leaf_sizes = list(range(1,50))
neighbors = list(range(1,30))
norms=[1,2]
# create parameters dictionary
parameters = {'leaf_size':leaf_sizes, 'n_neighbors':neighbors, 'p':norms}
#Create a KNN Regressor model
knn = KNeighborsRegressor()
#Tune hyperparameters using RandomizedSearchCV
regressor = RandomizedSearchCV(knn, param_distributions=parameters, verbose=10, n_jobs=-1)
#Fit the model
best_regressor = regressor.fit(X_train, y_train)
# get the best parameters
best_K = best_regressor.best_estimator_.get_params()['n_neighbors']
best_leaf_size = best_regressor.best_estimator_.get_params()['leaf_size']
norm = best_regressor.best_estimator_.get_params()['p']
#Print The best parameters
print('Best K:', best_K)
print('Best leaf_size=', best_leaf_size)
print('Best norm:', norm)
elapsed = time.time() - start
print(f"Time elapsed: {elapsed}")
# Output:
Best K: 13
Best leaf_size= 41
Best norm: 1
Time elapsed: 12.097684621810913
  • Train R2 = 0.5817266349749117
  • Test Private R2 = 0.43537
  • Test Public R2 = 0.46360

Here I have used KNeighborsRegressor() with RandomizedSearchCV, This model is overfitting on the original dataset because we are getting low R² value for both cross-validation and testing set than the training set.

Let’s try another model,

8.2 Decision Tree Regressor:

With Original Dataset,

import time
start = time.time()
depth = [1, 5, 10, 50, 100, 500, 1000]
# create parameters dictionary
parameters = {'max_depth' : depth}
#Create a Decision Tree Regressor model
dtr = DecisionTreeRegressor()
#Tune hyperparameters using RandomizedSearchCV
regressor = RandomizedSearchCV(dtr, param_distributions=parameters, verbose=10, n_jobs=-1)
#Fit the model
best_regressor = regressor.fit(X_train, y_train)
# get the best parameters
best_max_depth = best_regressor.best_estimator_.get_params()['max_depth']

#Print The best parameters
print('Best max_depth:', best_max_depth)
elapsed = time.time() - start
print(f"Time elapsed: {elapsed}")
# output:
Best max_depth: 5
Time elapsed: 2.031101703643799
  • Train R2 = 0.6312890285381911
  • Test Private R2 = 0.53732
  • Test Public R2 = 0.55129

Here I tried DecisionTreeRegressor with RandomizedSearchCV, this model is performing well than the previous KnearestRegressor model. Now let's try DecisionTreeRegressor with newly feature engineered features.

8.3 Decision Tree Regressor:

With ‘Original Dataset+ PCA + SVD’,

import time
start = time.time()
depth = [1, 3, 5, 10, 50, 100, 500, 1000]

# create parameters dictionary
parameters = {'max_depth' : depth}
#Create a Decision Tree Regressor model
dtr = DecisionTreeRegressor()
#Tune hyperparameters using RandomizedSearchCV
regressor = RandomizedSearchCV(dtr, param_distributions=parameters, verbose=10, n_jobs=-1)
#Fit the model
best_regressor = regressor.fit(train_svd_pca, y_train)
# get the best parameters
best_max_depth = best_regressor.best_estimator_.get_params()['max_depth']

#Print The best parameters
print('Best max_depth:', best_max_depth)
elapsed = time.time() - start
print(f"Time elapsed: {elapsed}")
# Output:
Best max_depth: 3
Time elapsed: 6.751556634902954
  • Train R2 = 0.6155964859539103
  • Test Private R2 = 0.53800
  • Test Public R2 = 0.55070

Here Decision Tree Regressor With ‘Original Dataset+ PCA + SVD’, is giving little bit improved results than the previous one.

8.4 Random Forest Regressor:

With Original Dataset,

import time
start = time.time()

# Number of trees in random forest
n_estimators = [10, 25, 50, 100, 200, 300, 400, 500]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [3, 5, 10, 15, 20, 25, 30]
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

# create parameters dictionary
parameters = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
#Create a Random Forest Regressor model
rf = RandomForestRegressor()
#Tune hyperparameters using RandomizedSearchCV
regressor = RandomizedSearchCV(rf, param_distributions=parameters, verbose=10, n_jobs=-1)
#Fit the model
best_regressor = regressor.fit(X_train, y_train)
# get the best parameters
best_max_depth = best_regressor.best_estimator_.get_params()['max_depth']
best_n_estimators = best_regressor.best_estimator_.get_params()['n_estimators']
best_max_features = best_regressor.best_estimator_.get_params()['max_features']
best_min_samples_split = best_regressor.best_estimator_.get_params()['min_samples_split']
best_min_samples_leaf = best_regressor.best_estimator_.get_params()['min_samples_leaf']
#Print The best parameters
print('Best max_depth:', best_max_depth)
print('Best n_estimators:', best_n_estimators)
print('Best max_features:', best_max_features)
print('Best min_samples_split:', best_min_samples_split)
print('Best min_samples_leaf:', best_min_samples_leaf)

elapsed = time.time() - start
print(f"Time elapsed: {elapsed}")
# Output:
Best max_depth: 5
Best n_estimators: 300
Best max_features: auto
Best min_samples_split: 100
Best min_samples_leaf: 2
Time elapsed: 78.28886032104492
  • Train R2 = 0.6290333793346686
  • Test Private R2 = 0.54763
  • Test Public R2 = 0.55442

Cool, this Random Forest Regressor with original features is giving better results than the previous models. Let’s try this model on new features too.

8.5 Random Forest Regressor:

With ‘Original Dataset+ PCA + SVD’,

import time
start = time.time()

# Number of trees in random forest
n_estimators = [10, 25, 50, 100, 200, 300, 400, 500]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [3, 5, 10, 15, 20, 25, 30]
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3, 5, 10]

# create parameters dictionary
parameters = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
#Create a Random Forest Regressor model
rf = RandomForestRegressor()
#Tune hyperparameters using RandomizedSearchCV
regressor = RandomizedSearchCV(rf, param_distributions=parameters, verbose=10, n_jobs=-1)
#Fit the model
best_regressor = regressor.fit(train_svd_pca, y_train)
# get the best parameters
best_max_depth = best_regressor.best_estimator_.get_params()['max_depth']
best_n_estimators = best_regressor.best_estimator_.get_params()['n_estimators']
best_max_features = best_regressor.best_estimator_.get_params()['max_features']
best_min_samples_split = best_regressor.best_estimator_.get_params()['min_samples_split']
best_min_samples_leaf = best_regressor.best_estimator_.get_params()['min_samples_leaf']
#Print The best parameters
print('Best max_depth:', best_max_depth)
print('Best n_estimators:', best_n_estimators)
print('Best max_features:', best_max_features)
print('Best min_samples_split:', best_min_samples_split)
print('Best min_samples_leaf:', best_min_samples_leaf)

elapsed = time.time() - start
print(f"Time elapsed: {elapsed}")
# Output:
Best max_depth: 5
Best n_estimators: 500
Best max_features: auto
Best min_samples_split: 100
Best min_samples_leaf: 10
Time elapsed: 290.3807260990143
  • Train R2 = 0.6364927215350347
  • Test Private R2 = 0.54145
  • Test Public R2 = 0.55076

This model with new features is not performing as good as the model with original features.

8.6 Random Forest Regressor:

With ‘Original Dataset+PCA+SVD+GRP+Interactions’

start = time.time()

# Number of trees in random forest
n_estimators = [10, 25, 50, 100, 200, 300, 400, 500]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [2, 3, 5, 10, 15, 20, 25]
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 5, 10, 15, 25]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 3, 5, 10]

# create parameters dictionary
parameters = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
#Create a Random Forest Regressor model
rf = RandomForestRegressor()
#Tune hyperparameters using RandomizedSearchCV
regressor = RandomizedSearchCV(rf, param_distributions=parameters, verbose=10, n_jobs=-1)
#Fit the model
best_regressor = regressor.fit(train_grp_pca_svd_inter, y_train)
# get the best parameters
best_max_depth = best_regressor.best_estimator_.get_params()['max_depth']
best_n_estimators = best_regressor.best_estimator_.get_params()['n_estimators']
best_max_features = best_regressor.best_estimator_.get_params()['max_features']
best_min_samples_split = best_regressor.best_estimator_.get_params()['min_samples_split']
best_min_samples_leaf = best_regressor.best_estimator_.get_params()['min_samples_leaf']
#Print The best parameters
print('Best max_depth:', best_max_depth)
print('Best n_estimators:', best_n_estimators)
print('Best max_features:', best_max_features)
print('Best min_samples_split:', best_min_samples_split)
print('Best min_samples_leaf:', best_min_samples_leaf)

elapsed = time.time() - start
print(f"Time elapsed: {elapsed}")
# Output:
Best max_depth: 5
Best n_estimators: 500
Best max_features: auto
Best min_samples_split: 5
Best min_samples_leaf: 3
Time elapsed: 390.1909713745117
  • Train R2 = 0.6527769059803106
  • Test Private R2 = 0.54220
  • Test Public R2 = 0.54920

Here this is performing slightly better, but not better than the model with original features.

8.7 XGBoost Regressor:

With ‘Original Dataset+PCA+SVD+GRP+Interactions’,

neigh=XGBRegressor(random_state=42, n_jobs=-1) 
parameters = {'learning_rate':[0.001,0.01,0.05,0.1,1],
'n_estimators':[100,150,200,500],
'max_depth':[2,3,5,10],
'colsample_bytree':[0.1,0.5,0.7,1],
'subsample':[0.2,0.3,0.5,1],
'gamma':[1e-2,1e-3,0,0.1,0.01,0.5,1],
'reg_alpha':[1e-5,1e-3,1e-1,1,1e1]}

reg=RandomizedSearchCV(neigh,parameters,cv=5, scoring='r2', return_train_score=True, n_jobs=-1,
verbose=10)
reg.fit(train_grp_pca_svd_inter, y_train)
best_max_depth = reg.best_estimator_.get_params()['max_depth']
best_n_estimators = reg.best_estimator_.get_params()['n_estimators']
best_colsample_bytree = reg.best_estimator_.get_params()['colsample_bytree']
best_subsample = reg.best_estimator_.get_params()['subsample']
best_gamma = reg.best_estimator_.get_params()['gamma']
best_reg_alpha = reg.best_estimator_.get_params()['reg_alpha']
best_learning_rate = reg.best_estimator_.get_params()['learning_rate']
#Print The best parameters
print('Best max_depth:', best_max_depth)
print('Best n_estimators:', best_n_estimators)
print('Best colsample_bytree:', best_colsample_bytree)
print('Best subsample:', best_subsample)
print('Best reg_alpha:', best_reg_alpha)
print('Best gamma:', best_gamma)
print('Best learning_rate:', best_learning_rate)
# Output:
Best max_depth: 2
Best n_estimators: 100
Best colsample_bytree: 1
Best subsample: 1
Best reg_alpha: 10.0
Best gamma: 0
Best learning_rate: 0.1
  • Train R2 = 0.6551805553802594
  • Test Private R2 = 0.54682
  • Test Public R2 = 0.54759

XGboost regressor is performing a little bit similar to the RandomForest model.

8.8 Stacking Regressor:

With ‘Original Dataset+PCA+SVD+GRP+Interactions’,

Here I combined all the above models except the KnearestRegressor and use a stacking regressor. I have used the Ridge Regression model as a Meta Regressor.

Load saved models,

filename = 'RF_Orig_Feat_model.sav'
rf1 = joblib.load(filename)
print(f'Loaded {filename}')
# Output:Loaded RF_Orig_Feat_model.savfilename = 'RF_PCA_SVD_inter_GPR_model.sav'
rf2 = joblib.load(filename)
print(f'Loaded {filename}')
# Output:Loaded RF_PCA_SVD_inter_GPR_model.savfilename = 'RF_PCA_SVD_model.sav'
rf3 = joblib.load(filename)
print(f'Loaded {filename}')
# Output:Loaded RF_PCA_SVD_model.savfilename = 'XGB_PCA_SVD_inter_GPR_model.sav'
xgb = joblib.load(filename)
print(f'Loaded {filename}')
# Output:Loaded XGB_PCA_SVD_inter_GPR_model.sav

Stack them together,

ridge_reg =  Ridge(random_state=42, fit_intercept=False, alpha=0)
stacked_model = StackingCVRegressor(regressors=(rf1, rf2, rf3, xgb),
meta_regressor=ridge_reg,use_features_in_secondary=False,
refit=True, cv=5)

stacked_model.fit(train_grp_pca_svd_inter, y_train)
  • Train R2 = 0.6530002337544227
  • Test Private R2 = 0.55017
  • Test Public R2 = 0.55310

So we can see clearly this stacking regression is performing way better than the individual models. Stacking is the best thing, all the kagglers use it and the reason is it gives much much better results. But this score can be improved further so let’s create a custom model and see if it works.

8.9 Custom Model:

With ‘Original Dataset+PCA+SVD+GRP+Interactions’,

The basic flow of this custom model is shown in the below figure,

Custom Model Flow Diagram

Code for the model:

Code snippet of Custom model

I used the above-shown model for different number of K’s as follows, base models used here are DecisionTreeRegressors()

import time
start = time.time()
n_estimators = [3,5,10,20,50,75,100,225,500]
train_scores = []
cv_scores = []
n_models = []
for k in n_estimators:
model, R2_train, R2_cv = CustomEnsembleRegressor(D_train, y_train.values, k)
n_models.append(model)
train_scores.append(R2_train)
cv_scores.append(R2_cv)
print(f"Train R2:{R2_train}, CV R2:{R2_cv}")

elapsed = time.time() - start
print(f"Time elapsed: {elapsed}")
# Output:100%|██████████| 3/3 [00:00<00:00, 573.99it/s]Train R2:0.08220279658407303, CV R2:-0.01337314473045259100%|██████████| 5/5 [00:00<00:00, 985.04it/s]Train R2:0.09105626276132484, CV R2:-0.01563382377423217100%|██████████| 10/10 [00:00<00:00, 1151.36it/s]Train R2:0.1297652945230663, CV R2:-0.03525505486195346100%|██████████| 20/20 [00:00<00:00, 1238.28it/s]Train R2:0.1456421992646002, CV R2:-0.024534879198908177
100%|██████████| 50/50 [00:00<00:00, 773.68it/s]Train R2:0.16835058802053093, CV R2:-0.05085436003144661
100%|██████████| 75/75 [00:00<00:00, 792.55it/s]Train R2:0.19105718148989337, CV R2:-0.036930656218675306
100%|██████████| 100/100 [00:00<00:00, 772.30it/s]Train R2:0.20514241360525387, CV R2:-0.028996842506092246
100%|██████████| 225/225 [00:00<00:00, 798.80it/s]Train R2:0.23149152275845875, CV R2:-0.037854168858211024
100%|██████████| 500/500 [00:00<00:00, 788.68it/s]Train R2:0.27910345041117934, CV R2:-0.01996754386214894Time elapsed: 267.02857732772827

And got these results,

Train vs Cv R2 score of the custom model

So it is obvious from the graph that this is the worst model I have developed, This is giving a very less R2 score on both training and cross-validation data, So no point in checking this model on the kaggle board.

8.10 Averaging Model:

With Original Data,

In this model I am using only the original data with label encoded categorical features, I have not included any of the advanced features for this model.

Following is the logic for the model,

The flow of Averaging Model

As shown above,

a. I have used two Xgboost regression models and trained them on the whole given train data.

b. Predicted the target values for the testing data by using both the models,

c. Finally, I used the average of these predicted values as final predictions.

Code for the same is as follows,

By using this averaging approach,

  • Train R2 = 0.6142291
  • Test Private R2 = 0.55179
  • Test Public R2 = 0.55631

This is the best score than any other previous models, this score stands at the 401st position on the private leaderboard which lies in the top 10% of all contestants.

9. Model Comparisons:

The following table shows the comparison of all models I used,

Model Comparison Table
Submission File
Kaggle Private Leader Board

10. Final Pipeline:

Now I have the best model, which is the averaging model.

So, let’s just create a final pipeline for the model which we can use for productionization. Following is the code for the final pipeline,

Final Pipeline

Now by using the above pipeline let’s predict the testing time for a set of 10 data points,

For these following 10 points,

Here I have used the pipeline and predicted the testing time for the 10 car configurations.

11. Future Extensions:

a. For solving this problem other techniques can be used like deep MLP neural network.

b. We can also try the models with more interaction features by finding out the correlations with the target variable than I used here.

12. References:

  1. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/
  2. https://www.epa.gov/ghgemissions/global-greenhouse-gas-emissions-data
  3. https://www.geeksforgeeks.org/ml-r-squared-in-regression-analysis/
  4. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36242
  5. https://arxiv.org/pdf/1305.5870.pdf
  6. https://www.youtube.com/watch?v=9vJDjkx825k
  7. https://blog.goodaudience.com/stacking-ml-algorithm-for-mercedes-benz-greener-manufacturing-competition-5600762186ae
  8. https://medium.com/@williamkoehrsen/capstone-project-mercedes-benz-greener-manufacturing-competition-4798153e2476
  9. https://www.kaggle.com/adityakumarsinha/stacked-then-averaged-models-private-lb-0-554?scriptVersionId=1337077
  10. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/37700#411807
  11. https://statisticsbyjim.com/regression/interaction-effects/
  12. https://www.appliedaicourse.com/

I would like to thank the whole appliedaicourse team and especially my mentor Jithin sir who guided me throughout the case study.

Full Work with all the code can be found here on my Github profile:

Connect me On LinkedIn:

--

--

The Startup
The Startup

Published in The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +772K followers.

Harshwardhan Jadhav
Harshwardhan Jadhav

Written by Harshwardhan Jadhav

Data Scientist | Mechanical Engineer

No responses yet