How to Enter your First Machine Learning Competition

Rebecca Vickery
vickdata
Published in
6 min readDec 17, 2018

--

Photo by rawpixel on Unsplash

Machine learning competitions can be a really useful way to learn how to validate and improve models. I found that data science practice problems on Analytics Vidhya were a great starting point. These allow you access to simple datasets on which to practice your machine learning skills and benchmark yourself against others. I think they offer a great introduction to approaching these problems before perhaps moving onto something a bit more challenging such as Kaggle competitions.

I initially found it challenging to know how to build a model and make a submission. I decided to write this post to help anyone else who may be starting out and experiencing the same problem.

The competition I am going to cover in this post is the Loan Prediction Practice Problem. In this competition you are asked to create a model to predict whether or not a person would be approved for a home loan, given a number of fields from an online search form. In the following post, I am going to walk through step by step how to get started with building the first model for this task using Python in JupyterLab. I am not going to go into great detail about each step but wanted to show the end-to-end process of how to approach a classification problem.

Getting started

For this problem, you are given two CSV files, train.csv and test.csv (they actually have longer file names when downloaded but the first step I took was to rename them to make handling them much easier). The data sets can be downloaded here. We are going to use the training file to train and test the model, and we then use the model to predict the unknown values in the text file. The first steps are to read in the files and perform some preliminary analysis to determine what pre-processing we will need before attempting to train a model.

In the below code I am reading in the renamed CSV files using the pandas read_csv function. I have also dropped the Loan_ID column from the train set as I won’t need this for training the model.

import pandas as pdtrain = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train = train.drop('Loan_ID', axis=1)

Once the files have been imported I run the following function to get some more information about the dataset I am going to be working with.

def describe_data(df):
print("Data Types:")
print(df.dtypes)
print("Rows and Columns:")
print(df.shape)
print("Column Names:")
print(df.columns)
print("Null Values:")
print(df.apply(lambda x: sum(x.isnull()) / len(df)))
describe_data(train)

This gives me the following information about the data set;

  • That we have a number of columns that are non numeric which will need some pre-processing before they can be used to build a model.
  • That we have a relatively small data set consisting of 614 rows.
  • That there are no null values present in the data.

Pre-processing

Most machine learning models are unable to handle non-numeric columns, and missing values in data. It is therefore necessary to perform a number of pre-processing steps.

To make the process simpler I am going to be using a scikit-learn pipeline. These pipelines allow you to apply a series of transformative steps, and fit and predict on an estimator with just one line of code. These are highly reusable, cut down on the need to apply pre-processing steps multiple times on train and test sets, and make workflows reproducible and easy to follow.

Before creating the pipeline I need to specify which columns are numerical features and which are categorical features. I am doing this using the pandas select_dtypes function.

numeric_features = train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = train.select_dtypes(include=['object']).drop(['Loan_Status'], axis=1).columns

The next step is to create two transformer pipelines, one for the numerical features and one for the categorical features. In each pipeline is a SimpleImputer to handle null values. We don’t actually need this for our dataset as we don’t have any null values but it is good practice to always include one as in a real world example you do not want your prediction to fail because of a null value. For the numeric transformer I am simply imputing the nulls with the median value for the column, and for the categorical tranformer I am imputing them with the value “missing”.

In the numeric transformer I have also included a scaler function. This ensures that all the values in each feature are on the same scale. For the categorical transformer I am using OneHotEncoder to transform each unique value in the categorical columns into a new column containing a 0 or 1 dependant on wether or not the value is present.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

I then use the ColumnTransfomer to concatenate both the numeric and categorical transformers into an object called preprocessor.

from sklearn.compose import ColumnTransformerpreprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

Training the model

Once you have your transformation pipeline you can combine this with a classifier to create the full machine learning pipeline. This can be used to both train (fit) and predict on new data.

Before doing this we need to split the train data into a training and test set so that we can get an idea as to how well our model is predicting on new data. To do this I first specify the features (X) and the target (y). I then use the train_test_split function and have chosen to use 20% of the data for the test set.

from sklearn.model_selection import train_test_split
X = train.drop('Loan_Status', axis=1)
y = train['Loan_Status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

As this is a simple walkthrough and I am not showing the full process of obtaining the best performing model I am going to use a simple LogisticRegression model. The below code creates a logistic regression model that performs the defined transformations before fitting or predicting.

from sklearn.linear_model import LogisticRegression
lr = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])

I then fit the model on the training set, and obtain an accuracy score using the built in score method. This returns a score of 0.82. A glance at the leaderboard tells us that this is not a bad initial score but there is still some work to do as the top competitors are currently scoring in excess of 0.90.

lr.fit(X_train, y_train)
print("model score: %.3f" % lr.score(X_test, y_test))

To finish I will show you how to make your first submission. We now need to use the test.csv file that we read in earlier. Firstly we are going to create a new data frame with the Loan_ID column dropped.

test_no_id = test.drop('Loan_ID', axis=1)

We then use our model to predict on this new data.

test_predictions = lr.predict(test_no_id)

Finally we need to build the submission file, which we know from looking at the sample submissions available on the website, needs to consist of a csv file containing the Loan_ID’s and corresponding predictions. In the below code I have obtained the Loan_ID’s from the original test data frame. I am then building a new data frame by combining with the test_predictions.

Loan_ID = test['Loan_ID']
submission_df_1 = pd.DataFrame({
"Loan_ID": Loan_ID,
"Loan_Status": test_predictions})

Finally exporting this as a csv file using the pandas to_csv function. You can now upload this as a submission.

submission_df_1.to_csv('submission_1.csv', index=False)

Once you have made a submission you will get a score on the website. You can see here that the accuracy score for the submission is much lower than the score on the test set from the training data. This suggests that I have an overfit model which means that the model has learned patterns in the training data that are specific to that dataset and not necessarily applicable on the wider data.

In this post I have described how to prepare data, train a simple model and make your first submission to a machine learning competition. In another blog post I will talk through the steps you can take to improve on this score and move further up the leaderboard.

--

--