Mercedes Green Manufacturing: Kaggle Competition

SUMEET SAWANT
Analytics Vidhya
Published in
4 min readJul 30, 2020

As part of my continuing data analysis learning journey I thought of trying out past completed Kaggle competition in order to test my skills and knowledge so far . While going through the datasets I came across this Mercedes Green Manufacturing Kaggle competition conducted in sometime in 2017.

Coming from a automotive domain I though this could be a good dataset to apply by data analysis skills. On reading the competition description I could relate to this problem even more closely . The competition is asking, given a set of anonymous categorical and binary variable can you predict the time which the car will take to complete its testing.

As a engineer from this domain I can completely see the importance of such a model . I know how time consuming vehicle testing can be. The process consists of building a prototype car, instrumenting it and then running the required tests . The major bottle neck in car testing occurs during instrumentation phase which requires to de-assemble the car ,fit the required recording instruments and then re-assemble the car.

Another bottle neck during testing is also the availability of testing equipment such as drive cells required to run the test.

All this factors results in man-hours wastage and a increased development time in the vehicle development program. This adds unplanned over-head cost to the company. Hence a model which can predict how much time a car will take to complete a test will help better plan and manage cost and resources.

Here is a link to the competition page . The evaluation metric for this competition is R-square which is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. The winner of this competition 3 years back has a R-square value of 0.555. My aim here was to come as much close to the above value in a week’s time.

Description about the Dataset

As with most Kaggle competition this dataset has its column names removed so cant discuss much on it . The shape ( row ,columns) of the dataset where (4209,377). There where 8 categorical columns and 1 dependent variable and other where binary columns with just 1 or 0 as its values .Even the 8 categorical columns had lot of unique values so using one hot encoding for categorical variables was going to be tough with so little data provided for training.

Distribution of the dependent variable (Time required for testing)

There was one outlier value which was dropped from the data-set.

As the evaluation metric was R-square I started with linear regression namely Ridge ,Lasso and Bayesian regression models from Sk-learn library on just the binary features of the data. I also engineered some features from the binary variables shown below . The dependent variable values where normalized using Min-Max Scalar and care was taken to avoid data leakage into Train and Validation sets by splitting the data first and then applying Min-Max-Scalar.

Split and Scale to avoid leakage
Features Engineered from Binary Variable row wise

To my surprise they performed really well on the held out validation set giving a R-square value of 0.59 ( By well I mean as compared to the competition winner) By using a average ensemble of the three model I was able to get a public LB and private LB score of 0.535 and 0.531 respectively . Not a bad start.

Ensemble means combining the outputs of different models to get one prediction . The idea lies in using the knowledge of the crowd . By averaging across different model we can improve our evaluation metric score

After this to increase my score further It was necessary to include the categorical columns . I used Label Encoding to convert the string values to unique integers so as to feed to my model . For this stage I choose to go with Random Forest and XG-Boost. XG-Boost model was tuned using Random Search with 5 fold cross validation which also gave me an R-squared value of ~0.60.

For my final submission I tried many different combinations of models such as ensemble and stacking of models with just binary features using Random Forest , Ridge , Lasso and Bayesian

Ensemble of models using just binary features , just categorical features and the entire dataset

My Best R squared submission 0.54440 consisted of a weighted ensemble of

0.4 * XG-Boost (on entire dataset) + 0.6* Stacking of (Ridge, Lasso ,Bayesian and Random Forest on just binary features)

Please feel free to check out the various models I tried out in my Kaggle notebook or my Github profile

I hope to keep this spree going as I feel the only way to improve in data analysis and data building skills is by working hands on.

--

--