Mercedes-Benz Greener Manufacturing

Can you cut the time a Mercedes-Benz spends on the test bench?

Nutenki Vinay Kumar
The Startup
13 min readSep 12, 2020

--

Author’s Note: This is the Completion report for my appliedaicourse[1] Capstone Project. All work is original and feel free to use/expand upon/disseminate. [Numbers in brackets are citations to the sources listed in the references section].

Bird’s-eye view of the project:

  • Intuition towards the given Business problem, and real-world use cases of this solution.
  • Usage of ML/DL to solve the problem, downloading/scraping/extracting data from the source.
  • Data Description and Improvements to the existing approaches.
  • Exploratory data analysis with observations, plots, and Feature Engineering: which consists of 10 steps as briefed below:

(a, b, c): Loading the data, conversion of categorical data into numerical, Missing value analysis.

(d, e): Data Visualization, analysis of data.

(f): Reducing the dimensionality of data Hugely by Detecting Multicollinearity using Variation Inflation Factor(VIF), which reduces the complexity of models, computational power.

(g): Implementing the Gavish-Donoho’s method to find the optimal value of ‘k’ and plotting the singular value curves to visualize the concept practically.

(h): Finding top important features by RFECV and RFE methods.

(i): Adding new features using Dimensionality Reduction techniques

(j): Generating new features using the Two-way and Three-way Feature Interaction, from the top features.

  • Tuning various models to find the best Hyperparameters, fitting models with the best Hyperparameters, analyzing how far the Feature engineering worked, and Comparing the Final results of all the Models

1. Explanation of Business Problem:

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

Daimler is challenging Kagglers to tackle the Curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

The motivation behind the problem is that an accurate model will be able to reduce the total time spent testing vehicles by allowing cars with similar testing configurations to be run successively in different paths at Vehicle Testing Layout as shown in the figure below.

Vehicle Testing Layout

Examples of Custom features: 4WD, added air suspension, a head-up display, etc

2. Use of ML/DL:

This problem is an example of a Machine-Learning / Deep-Learning Regression task, to predict the continuous target variable(duration of the test).

3. Source of Data:

Data is downloaded from this link Mercedes-Benz Greener Manufacturing, Kaggle competition[2], and unzipped.

Thankfully this is not a big dataset. So it is added to google drive and unzipped directly in the google colab as below.

But if the dataset is too big then it is better to use the CurlWget (chrome extension) to import the data.

4. Data Description

This dataset contains an anonymized set of variables, each representing a custom feature in a Mercedes car. For example, a variable could be 4WD, added air suspension, or a head-up display.

The ground truth is labeled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.

Remarkable Improvements to the existing approaches:

  1. Detecting Multicollinearity using VIF(Variance Inflation Factor).
  2. Finding the Optimal value of ‘k’ in TSVD, with reference[3] to the paper published by Gavish and Donoho[4] The Optimal Hard Threshold for Singular Values is 4/\sqrt{3}”.

5. Exploratory Data Analysis(E.D.A) and Feature Engineering:

( a ). Loading the dataset:

  • Dataset is loaded into a pandas DataFrame.
  • Train dataset of size (4209, 378), the Test dataset is of size (4209, 377).
  • Out of these 8 are Categorical features, 1 feature is ID, and 368 are binary.
  • One extra column in the training dataset is named ‘y’, it is the class label.
Pandas Dataframe loaded with the training dataset

( b ). Categorical to Numerical feature conversion and aligning the train and test data frames:

  • Categorical features of train and test data are converted into numerical using pandas general function get_dummies
  • There may be a different number of unique categories a column of train and test data frame.
  • Due to which there is a difference in the shapes of data frames.

output: {‘X0_aa’, ‘X0_ab’, ‘X0_ac’, ‘X0_q’, ‘X2_aa’, ‘X2_ar’, ‘X2_c’, ‘X2_l’, ‘X2_o’, ‘X5_u’, ‘y’}

  • In the above code, we are identifying non-common features

output: (4209, 554) (4209, 554)

  • We have aligned the data frames by making an inner join of the data frames.

( c ). Null/Missing value analysis:

  • Due to improper handling of missing values, the results obtained will differ from ones where missing values are not present.
  • Rows with missing data can be deleted or can be filled using Data Imputation techniques as mentioned in this link.[5]
  • In the case of multivariate analysis, if there is a larger no of missing values, then it can be better to drop those cases(rather than to do imputation and replace them).
  • On the other hand in univariate analysis, imputation can decrease the amount of bias in the data, if the values are missing at random.[6]
  • But in our dataset doesn’t have any missing values.

( d ). Data Visualization:

First, let’s take only the Class variable and plot it on the y_axis and it’s resettled indices on the x_axis.

  • Because ID’s are not continuous units.
  • Over plotting is one of the most common problems in DataViz. When your dataset is big, dots of your scatterplot tend to overlap, hence we reduced the size of dots to accommodate more number of dots in a unit area.
  • From the above diagram, we can see that the class label(y-time) looks like a line, apart from a small portion of points at the ends are not on the line.
  • and also there’s only one point whose time is above 250 which is an outlier.
  • Because of all the class labels not lying on a line, the Metric R² square won’t have large values, it’s very sensitive to outliers, as SSres increases.
  • The best possible R² Square value is 1.0.

Plotting the PDF, CDF, and BoxPlot of the Class Variable:

PDF
CDF

From the PDF and CDF, we can see that:

  • Almost all data points have a Class variable below 140.
  • so the points having a class label more than 140 can be considered as outliers.
BoxPlot
  • BoxPlot drawn concerning class label very beautifully shows the distribution of data based on a five numbered summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”) and here we can consider the values which are larger than max value as outliers for sure.
  • Outlier data points are dropped.

( e ). Dropping columns with unique values:

  • Features with unique values don’t contribute any valuable information, instead, they increase the number of dimensions, hence we are dropping such features.
  • [‘X11’, ‘X93’, ‘X107’, ‘X233’, ‘X235’, ‘X268’, ‘X289’, ‘X290’, ‘X293’, ‘X297’, ‘X330’, ‘X347’], these features contain only zero’s hence they are dropped.

( f ). Detecting Multicollinearity using Variance Inflation Factor(VIF):

  • Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.[7][8]
  • This means that one independent variable can be predicted from another independent variable in a regression model.
  • This can be a problem in a regression model because we would not be able to distinguish between the individual effects of the independent variables on the dependent variable.
  • Multicollinearity may not affect the accuracy of the model as much. But we might lose reliability in determining the effects of individual features on the model. and that can be a problem when it comes to interpretability
  • This is a Bivariate analysis.
  • VIF determines the strength of the correlation between the independent variables. It is predicted by taking a variable and regressing it against every other variable (or) the VIF score of an independent variable represents how well the variable is explained by other independent variables.
  • VIF — Conclusion:1 = No Multicollinearity, 4–5 = Moderate, 10 or greater = Severe.
  • Generally, a VIF value greater than 10 is considered as severe, whereas in our dataset we even have features with VIF value infinity and even 3 digit numbers.
  • We have dropped all the features which have a VIF score of infinity, excluding the top_20_features (details of the top_20_features will be explained in the 7th section).

( g ). Finding the Optimal value of ‘k’ in Truncated SVD, Gavish-Donoho method[3][4][9]

Why Gavish-Donoho method? what’s explained in it? what are the proven conclusions derived from this paper?

  • Truncated SVD is a matrix factorization technique that factors a matrix W into three matrices U, S, VT. Typically it is used to find the principal components of a matrix
  • Truncated SVD is different from regular SVD. Given an n*n matrix, SVD will produce matrices with n columns, whereas Truncated SVD will produce matrices with a specified number of columns.
  • We need to Truncate the SVD because we need the matrix of optimal columns ‘k’ that could accommodate maximum information. For example, if we have the rank ‘k’ = n, the maximum it could be(Higher), then complete information will be preserved(noise also) and accuracy might or might not be more but, the complexity of the model will be high; If we have the rank ‘k’ very low then the information might be lost and model might be less accurate and model will not be too complex.
  • So, we need to find the sweet spot, optimal ‘k’ where we get the most of the information in W but without overfitting, to noise, or some little features we don’t care about.
  • This can be done in many ways by analyzing the singular values, and find out the elbow or knee but, they don’t work unless you have a sharp dropoff in the singular values. Hence, Gavish and Donoho’s method is the best one to find out the optimal rank ‘k’, given some assumptions on the data.
  • As written in the above equation, our data X can be written as the sum of the true low-rank data signal(Xtrue), and Xnoise which is assumed to be Normally distributed with Zero mean and variance=1 also called as Gaussian noise and it can be large or small depending on the magnitude of gamma.
  • The Orange curve corresponds to the Gaussian noise matrix, and the Green curve corresponds to our actual high dimensional data
  • Gavish and Donoho realized that when the singular values from the SVD of high-dimensional data when plotted, the curve(Green one) looks like the curve corresponding to the singular values from the SVD of the best fit Gaussian noise matrix, and at some point it deviates as shown in the above figure, and here that level is named as Noise floor.
  • This Noise floor separates the signal and noise.
  • The first singular value that is larger than the biggest singular value of the noise matrix is the Threshold, and values below it are truncated.
  • The application of this method is explained below taking the two possible cases.

Case 1: X is a square matrix and gamma is known.

  • Truncate all the values below the threshold(tau) value
  • here n = dimensions of square matrix X, gamma= amount of noise(known)

Case 2: X is a rectangular matrix and gamma is unknown.

  • In this case, all we have is measurements of Singular value distribution.
  • Based on the median singular value and aspect ratio of the rectangular matrix, we can infer the best fit noise distribution.

Conclusion:

  • We have data that has structure and noise, even if we don’t know how much noise is added we can estimate it from the median singular value, and then we can infer the Optimal tau(threshold), to truncate the singular values below tau to give the optimal rank ‘r’

Code:

  • After applying all the preprocessing steps above the data is stored in a pandas data frame with the name x_filtered.
  • from the code snippet above we’ll get the singular matrices out data matrix and Gaussian noise matrix
  • On plotting, we got the above curves and the horizontal line at y=tau
  • Hence, we have decided to take the value of ‘k’ as 2 to truncate.

( h ). Finding the top_20_features:

  • Here we found the important features using the Recursive feature elimination
  • sklearn’s RFECV automatically generates an optimal number of important features and RFE generates top n features according to our demand.

output: Index([‘X314’], dtype=’object’)

  • RFECV using RandomForestRegressor with the best parameters which are obtained by tuning it on the dataset.

output: Index([‘X29’, ‘X314’, ‘X315’], dtype=’object’)

  • RFECV using the default XGBRegressor

output: Index([‘X314’], dtype=’object’)

  • RFECV using DecisionTreeRegressor using the best max_depth which is found by tuning the model on the dataset.
  • From the output of the above three cells, we have learned that X314, X315, and X29 are the most important features and X314 is more important that X315 and X29
  • Using Recursive feature elimination we will find the top 20 important features and perform bivariate analysis on them

output: Index([‘ID’, ‘X29’, ‘X48’, ‘X54’, ‘X64’, ‘X76’, ‘X118’, ‘X119’, ‘X127’, ‘X136’, ‘X189’, ‘X232’, ‘X263’, ‘X279’, ‘X311’, ‘X314’, ‘X315’, ‘X1_aa’, ‘X6_g’, ‘X6_j’], dtype=’object’)

  • RFE using RandomForestRegressor to output top_20_features
  • This set of top_20_features is the superset of the important features obtained by RFECV.

( i ). Adding new features using Dimensionality reduction techniques:

TSVD:

  • As found from the Gavish and Donoho’s method, we are using 2 components of Truncated SVD

output: (4194, 2)

  • Also generating 2 features from PCA and ICA, available features reduction techniques at sklearn.decompositon, and let’s see whether they are useful or not.

PCA:

output: (4194, 2)

ICA:

output: (4194, 2)

  • Adding all the new features generated through Dimensionality reduction techniques, to the data frames

output: (4194, 127) (4209, 127)

( j ). Generating new features using the Two-way and Three-way feature interaction of the top_20_important features

6. Modeling:

( a ). RandomForestRegressor.

  • Let’s perform Hyperparameter tuning
  • Initializing all the parameters
  • Fitting the RandomSearchCV model

output: {‘bootstrap’: True, ‘max_depth’: 70, ‘max_features’: ‘auto’, ‘min_samples_leaf’: 40, ‘min_samples_split’: 110, ‘n_estimators’: 500}

  • printing the best parameters
  • Initiating a model with the best hyperparameters, and fitting it to the data set.
  • Plotting bar plots of the relative importance of the features of this model, in predicting the class label.
  • As we can see the feature ‘X314+X315’ generated by Two-way feature interaction, played an important role in predicting the class label.
  • Like RandomForestRegressor, for other models below, the same procedure is carried out, check out the code in my GitHub.

( b ). XGBoostRegressor.

  • Coding is similar for all the models, hence it’s not presented in this blog, check out my GitHub repository for code.
  • Even in this model, the ‘X314+X315’ feature has the highest relative importance compared to other features, but the relative importance index is lower compared to that of RandomForestRegressor.

( c ). DecisionTreeRegressor.

  • Even in this model, the ‘X314+X315’ feature has the highest relative importance compared to other features.

The results of all models are summarized in the table below, and for the code refer GitHub.

7. Comparison of all the models:

  • Out of all the models RandomForestRegressor got the highest public score.
Screenshot from kaggle submission

8. Future Work

  • New important features should be generated to improve the performance of the model.
  • Deep Learning Models should be applied and tuned to improve results.

9. References

  1. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
  2. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data
  3. https://arxiv.org/pdf/1305.5870.pdf
  4. https://ieeexplore.ieee.org/document/6846297
  5. https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2F6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
  6. https://medium.com/r/?url=https%3A%2F%2Fweb.stanford.edu%2Fclass%2Fstats202%2Fcontent%2Flec25-cond.pdf
  7. https://medium.com/r/?url=https%3A%2F%2Fwww.sigmamagic.com%2Fblogs%2Fwhat-is-variance-inflation-factor%2F%23%3A~%3Atext%3DIf%2520there%2520is%2520perfect%2520correlation%2Cto%2520the%2520presence%2520of%2520multicollinearity
  8. https://medium.com/r/?url=https%3A%2F%2Fwww.analyticsvidhya.com%2Fblog%2F2020%2F03%2Fwhat-is-multicollinearity%2F
  9. https://medium.com/r/?url=http%3A%2F%2Fwww.pyrunner.com%2Fweblog%2F2016%2F08%2F01%2Foptimal-svht%2F

10. GitHub and LinkedIn

--

--