My First Kaggle Competition — Pump it Up

4 min readFeb 8, 2019

This week presented me with my first opportunity to join a Kaggle Competition as a newbie data scientist. It was a private project for my DS1 Lambda University cohort.

Below are my three take aways from the experience:

But first a picture of my son and his grandmother at my husband’s homestead in Ombombo, Namibia. Which is a very dry land and where the residents are completely reliant on a diesel powered water pump. I know how important the reliability of the pump is for the health and wellbeing of the residents and their livestock. This made me extremely excited that my first data science challenge dealt with predicting non-functioning pumps in Tanzania.

Take Away One — Formatting Data

In the competition, I was given training features, training labels, and test features. My first thought was: where are the test labels! I’m used to conducting accuracy scores by comparing predictions to the actual test labels. However, in the competition, you aren’t given test labels because that’s basically the answer key. So how do you test accuracy? By splitting training data into training and validation data.

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=.3, random_state = 0)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

Now you can compare the y_val to your y_pred in an accuracy_score.

After going through the process with your validation test data, you can assume the accuracy score for your actual test data will be similar. You then add X_test to your code instead of X_val. However, before you do remember that whatever modifications or feature engineering you did to your training data, you also have to do to your test data. This is my second takeaway.

Take Away Two — Concat Train and Test Data

Initially, I was manually copying and pasting all the feature engineering from my train data and then editing to dataset name to ensure the test data had the same structure as my train data. I then found this code to combine the two datasets to edit the features and then to split them when done. This saves time and ensures the category encoding is the same.

#to combine 
train_objs_num = len(train)
dataset = pd.concat(objs=[train, test], axis=0)#to separate
train = copy.copy(dataset[:train_objs_num])
test = copy.copy(dataset[train_objs_num:])

Take Away Three — All Features are created Equal but some are more equal than others

I created this heat map that shows the correlation between the various features. This helps you visual which features are highly connected to each other. Leaving highly correlated features in the dataset, can make the information look more important than it actual is. Removing one correlated feature will give you more accurate models.

import seaborn as sns
import matplotlib.pyplot as pltimport matplotlib.pyplot as pltdef plot_corr(df,size=30):
    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);
    
plot_corr(train_and_test)

Handling Longitude and Latitude

I also converted longitude and latitude to latbins and lonbins which binned them together:

location = train_features[['gps_height', 'longitude', 'latitude', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward']]
step = 0.2
to_bin = lambda x: np.floor(x / step) * step
location["latbin"] = location.latitude.map(to_bin)
location["lonbin"] = location.longitude.map(to_bin)

Joining Two Columns

I joined Region and District into one number. So an entry with a region_code that was 20 and district_code of 19 had a reg_dist code of 20.19. This way I could handle these two points as one.

# combining region and district codes
location = location[['region_code', 'district_code','latbin', 'lonbin', 'gps_height']]
location.head(20)
location['district_code'] = location['district_code'].astype(str)
location['region_code'] = location['region_code'].astype(str)
location['reg_dist'] = location[['region_code', 'district_code']].apply(lambda x: '.'.join(x), axis=1)
location['reg_dist'] = location['reg_dist'].astype(float)

My Model

My highest accuracy_score was an Extra Tress Classifier:

ETC = ExtraTreesClassifier(n_estimators=1000,
                         min_samples_split=10)ETC.fit(X_train, y_train)Which looks like an easy three lines of code, but it took four days to get there! My accuracy score was around .78 — .79 with this model.

My Second highest accuracy score was from a Logistic Regression and Standard Scaler pipeline with a OneHot Encoder:

pipeline = make_pipeline(
           ce.OneHotEncoder(use_cat_names=True),
           StandardScaler(),
           LogisticRegression(penalty='l1', solver='liblinear')
)pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_val)acc_score = accuracy_score(y_val, y_pred)Downloading CSV File for Submission

My accuracy score with this model was between .74 — .76.

Downloading CSV

Lastly, this may be a no brainer to a lot of data scientists but it took me a whole day to figure out how to merely submit my predictions. Here is the code I used:

predictions = pd.DataFrame(data={'id': test_features['id'], 'status_group': y_pred})from google.colab import files
predictions.to_csv('kaggle.csv', index=False)
files.download('kaggle.csv')