When you participate in a competition, you learn a lot and the substantial learning comes from getting exposed to how others tackle the same problem as you but come up with different and better solutions.
So here I present the best part of the competition
My purpose of writing this article is not to provide the solution to the problem but to show how people do different things. There may be some errors in the codes because my aim was to teach what are different parts of a Machine Learning Hackathon, how same work different codes can do and to show what are the most common feature engineering and user specific feature engineering.
In this article I have used codes snippets from the following users:
The Workflow of a Machine Learning Competition
- Importing the data
- Feature Engineering
- Outlier Treatment
- Missing Values Treatment
- Scaling the Data
- Splitting the Data
- Training the Model on Training data
- Predicting on the testing data and submitting the result.
Solutions worth reading:
- Ziron’s Team: It is the highest rank solutions I could look so there is something in it. It’s enjoy reading a topper’s solution
- Sandeeppat: A very detailed kernel. He has talked in depth about his feature Engineering and a neat and clean kernel
Codes : Code File
Importing the Data: The best part about Hackathon are that you don’t have to collect the data, compare it to your personal projects where you have to collect data that takes a lot of time and a lot of effort. Data gathering can be a fun part to do but not always. Here you have to unzip the training data import it in the Pandas Dataframe.
The reason behind the difference in top 20 ranks is mere luck, but the reason behind top 20 and top 100/200 is Feature Engineering. There is not a single sure shot trick of feature engineering, people learn it in due course of time. Some people create a function for creating feature which they commonly use, many makes the feature on the fly depending on the competition, both methods are equally good one gives speed, other gives customizability. Here I would show some features which were created by almost everyone but differently and some feature that were unique to the user.
In every solution I read this was the most common feature
Age was calculated as the difference between Disbursed Date column and the Date of Birth Column.
Most of the people noticed the glitch in the Date of Birth column, did you?
The glitch was that pandas read the dob “01–01–74” as “01–01–2064” which should have been read as “01–01–1964”. I will list the different approaches people used to tackle this glitch:
2. Disbursal Date and Date of Birth
Both are in “Date” type formats, but they can’t be used in the form they are, so the most common method is to extract the “month”, “date” and “year” but other features can also be extracted such as “Is it start, middle or end of the month”, “day of the date”, “quarter of the year”, “week no”, “Absolute date from 1 Jan 1970” .
Method 1 (Raw)
Method 2(Using function)
3. Extracting Month and Year from “Average Account age” and “Credit History Length”
Most of the people extracted “Months” and “Year” from the column but some also calculated “Total Time in Months” for the column but it was good to see that everyone did it in their own fashion some used regular expression some used split and some created used, so you are free to use the method which you find easy.
4. Perform CNS Score Description (Bureau Report)
Without a look this is an important feature people have used different strategies to decrease the number of categories in it, I will try to list them you can try all combinations and keep which gives the best result:
5. Perform CNS Score
Many people have kept the values of this column as it is but some have tried to bin the values using “pd.cut” or “pd.qcut” but some has also used their definitions to bin the variable.
6. Features Based on “Primary Accts”,“Secondary Accts”,”ltv”,”disbursed amount”.
Features based on the combination of Primary Accounts, Secondary Accounts and Disbursed Amount were used by majority of people, I have tried to gather some of the most common, you can try them and choose whichever works for you.
7. Some Unique Features from all the kernels:
7.1Feature — Based on Anomalous Branch:
Anomalous Branch — Keeps track of the branches, from where, certain loans have been sanctioned and then the buy has been done at a showroom far from that bank, possibly even in a different state or city. This is tracked by seeing the usual showrooms from where buys take place if a loan is sanctioned from a branch. Certain anomalies detected in this list have been tracked in this feature.
7.2 Outstanding Balance accounts
The idea behind this being more the number of accounts a customer has with outstanding balance, the less reliable he would be expected to be.
7.3 Prime Def, Secondary Def, Total Def and Def in Last Six Months
Default in Last Six Months
8. Features created by Groupby Function:
These types of features are widely used in Hackathons and they provide good results but at a cost. The cost is we don’t know which feature is good or bad unless we train a model and get feature importance, so we must create many features and then do feature selection on them. Here I have used features created by Ziron’s Team.
9. Outlier in Disbursed Amount
People often forget that the accuracy is often decreased by one or two outlier points and we overfit our models to decrease that error which make our models less robust to testing data, Removing outlier is a necessity but I only saw one person do it, even I forgot to do it, hats off to that person :
10. Missing Values in Employment Type
There is only one column which has missing values “Employment Type”. Almost everybody has created a new category for missing due to the reason that missing values were not that less to be filled.
11. Scaling the Data
Scaling is done because we have high varying features and since, most of the machine learning algorithms use Euclidean distance between two data points in their computations so we need to bring all features to the same level of magnitudes, but only few kernels have implemented scaling. Some commonly used scaling techniques are Standard Scaler, Minmax Scaler, Robust Scaler. But to my surprise one of the kernel has used Quantile transform to scale the data which was new for me too.
Sampling is done to tackle the problem of imbalanced classes. This is a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes. In this situation, the predictive model developed using conventional machine learning algorithms could be biased and inaccurate.
Splitting the Data
The data is usually split between training and testing in the ratio of 75: 25 or 80:20 depending on choice, this is done so that we can train our model on one set and test it on another.
Mostly the splitting is done based on classes which is achieved using train_test_split:
Stratified Splitting: Instead of using the class labels as stratification, using the similarity between the train and test set as a parameter for stratification may give a better model, considering that the model gets an idea about how similar/dissimilar the data points in train and test set are.
14. Training the model:
The most commonly used models in the kernels were Catboost and LightGBM using k-fold cross validation. A model with good training and testing score and the difference between training and testing score is not significant is considered a good model. Here I have shown the training method used by Ziron’ team you but you can train your model based on your preferences.
15. Submission files
Submission files are outputs on the testing data for which our model is evaluated, and a score is given based on the evaluation metric e.g. area under ROC curve, RMSE (Root Mean squared Error), RMSLE (Root mean square on the logged values). Submission files are mostly Excel or CSV files.