Tanzania is home to the Serengeti plains. image credit

Predictive Modeling for Tanzanian Water Pumps

A Kaggle Competition Walkthrough

Published in

Analytics Vidhya

7 min readJul 2, 2019

This post will serve as a textual and visual exposition of my predictive modeling process from initial exploration to hyperparameter tuning. We will explore a dataset infested with missing records and duplicate information, and then will attempt to build an accurate model to predict whether a given water pump is functioning, in need of repairs, or non-functional.

Phase 1: Import Data and Familiarize

Most data for Kaggle competitions can be found in the form of a .csv file linked in the competition information. Pandas provides an easy way to import a .csv file as a Pandas DataFrame object; pd.read_csv().

I usually find it helpful to examine a dataset’s individual qualities before diving head first into cleaning, wrangling, and feature engineering. Not every dataset is the same and different data will require different plans of attack. Pandas provides us .info() which prints a DataFrame's shape (59,400 instances, and 43 features), feature names, feature types, a count of non-null values for each feature, and the DataFrame's memory usage (19.9MB); it's a real workhorse.

df.info()
____________________________________________________________________<class 'pandas.core.frame.DataFrame'> 
Int64Index: 59400 entries, 0 to 59399
Data columns (total 43 columns):key_0                    59400 non-null int64 
id_x                     59400 non-null int64 
amount_tsh               59400 non-null float64 
date_recorded            59400 non-null object 
funder                   55765 non-null object 
gps_height               59400 non-null int64 
installer                55745 non-null object 
longitude                59400 non-null float64 
latitude                 59400 non-null float64 
wpt_name                 59400 non-null object 
num_private              59400 non-null int64 
basin                    59400 non-null object 
subvillage               59029 non-null object 
region                   59400 non-null object 
region_code              59400 non-null int64 
district_code            59400 non-null int64 
lga                      59400 non-null object 
ward                     59400 non-null object 
population               59400 non-null int64 
public_meeting           56066 non-null object 
recorded_by              59400 non-null object 
scheme_management        55523 non-null object 
scheme_name              31234 non-null object 
permit                   56344 non-null object 
construction_year        59400 non-null int64 
extraction_type          59400 non-null object 
extraction_type_group    59400 non-null object 
extraction_type_class    59400 non-null object 
management               59400 non-null object 
management_group         59400 non-null object 
payment                  59400 non-null object 
payment_type             59400 non-null object 
water_quality            59400 non-null object 
quality_group            59400 non-null object 
quantity                 59400 non-null object 
quantity_group           59400 non-null object 
source                   59400 non-null object 
source_type              59400 non-null object 
source_class             59400 non-null object 
waterpoint_type          59400 non-null object 
waterpoint_type_group    59400 non-null object 
id_y                     59400 non-null int64 
status_group             59400 non-null object dtypes: float64(3), int64(9), object(31) 
memory usage: 19.9+ MB

Labels

There are three possible values:

functional : the waterpoint is operational and there are no repairs needed
functional needs repair : the waterpoint is operational, but needs repairs
non functional : the waterpoint is not operational

Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset.

amount_tsh : Total static head (amount water available to waterpoint)
date_recorded : The date the row was entered
funder : Who funded the well
gps_height : Altitude of the well
installer : Organization that installed the well
longitude : GPS coordinate
latitude : GPS coordinate
wpt_name : Name of the waterpoint if there is one
num_private :Private use or not
basin : Geographic water basin
subvillage : Geographic location
region : Geographic location
region_code : Geographic location (coded)
district_code : Geographic location (coded)
lga : Geographic location
ward : Geographic location
population : Population around the well
public_meeting : True/False
recorded_by : Group entering this row of data
scheme_management : Who operates the waterpoint
scheme_name : Who operates the waterpoint
permit : If the waterpoint is permitted
construction_year : Year the waterpoint was constructed
extraction_type : The kind of extraction the waterpoint uses
extraction_type_group : The kind of extraction the waterpoint uses
extraction_type_class : The kind of extraction the waterpoint uses
management : How the waterpoint is managed
management_group : How the waterpoint is managed
payment : What the water costs
payment_type : What the water costs
water_quality : The quality of the water
quality_group : The quality of the water
quantity : The quantity of water
quantity_group : The quantity of water
source : The source of the water
source_type : The source of the water
source_class : The source of the water
waterpoint_type : The kind of waterpoint
waterpoint_type_group : The kind of waterpoint

Phase 2: Clean, Wrangle, and Engineer Features

Finding a good place to start when attacking a dataset for the first time is not a trivial act, as there are usually multiple obvious places to begin cleaning, feature engineering, and exploratory analysis. I’ll offer some reassuring words for the reader here: pick a place to start, trust your intuition and just start coding. If you hit a brick wall don’t be afraid to retreat, regroup, and attack elsewhere. It — and — you will be fine. We will start with cleaning the construction_year feature with df.replace().

features['construction_year'].replace({0:1993})
# 1993 is an arbitrarily chosen year

Feature Engineering

With this cleaned construction year feature we can engineer a pump age and an average pop served per year feature.

features['age'] = features['date_recorded'] - features['construction_year']features['pop/year'] = features['population'] / features['age']

We can take our feature engineering even further by creating a feature that represents the theoretical amount of water each person served by a wellhead has access to.

features['water/person'] = features['amount_tsh'] / features['population']

Encoding

Encoding is useful to quantify categorical or otherwise qualitative data. There is an ever expanding library of encoding methods focused on different types of data. We will be using one hot encoding for low cardinality features and ordinal encoding for our high cardinality features. We also use null value imputing and standard scaling on our data in order to clean it before we feed it to our model.

Exploratory Visualization

A smart way to explore data is to create visualizations. This is a very freeform process, and again each dataset is different, so there is no one size fits all strategy. Due to this part being very code intensive, I will just present the visualizations, However, if you are interested in the code, there is a link to a colab notebook at the end of this post.

Since the water pumps in our dataset are stationary devices it could help to visualize their geographic locations.

There seems to be no clear relationship between the latitude and longitude of a pump and its status. However there are many very distinct clusters of pumps which appear to have the same status.

Phase 3: Build a Baseline Model and an Overfitting Model

We will build a baseline model to set the lower bar of our prediction capabilities. I particularly like using a majority class prediction as a baseline because it is easy to wrap my head around.

X_matrix, y_vector = df.drop(columns='status_group'), df['status_group']majority_class = y_vector.mode()y_vector.value_counts(normalize=True)____________________________________________________________________functional                 0.543081 
non functional             0.384242 
functional needs repair    0.072677 
Name: status_group, dtype: float64

Since functional is the status of 54% of our instances we expect our majority class prediction to be ~54% accurate.

from sklearn.metrics import accuracy_scoremajority_prediction = [majority_class] * len(y_vector)accuracy_score(y_vector, majority_prediction)____________________________________________________________________0.543080808080808

Awesome, we now have a bottom line for our predictions going forward.

Let's create a model that overfits our data to ensure we can make worthwhile predictions. My usual choice for this task is a decision tree.

from sklearn.tree import DecisionTreeClassifierdecision_tree = DecisionTreeClassifier(max_depth=50)decision_tree.fit(X_matrix, y_vector)____________________________________________________________________DecisionTreeClassifier
(class_weight=None, criterion='gini',     max_depth=50,max_features=None, max_leaf_nodes=None,                        min_impurity_decrease=0.0, min_impurity_split=None,                        min_samples_leaf=1, min_samples_split=2,                        min_weight_fraction_leaf=0.0, presort=False,                        random_state=None, splitter='best')

We can see how our overfitted model performs by using a confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as snsdef con_matrix_analysis(model):
  x = model.predict(X_matrix)
  
  print(classification_report(y_vector, x,
        target_names=['Functional', 'Needs Repair', 'Not-Functional']))  con_matrix = pd.DataFrame(confusion_matrix(y_vector, x), 
             columns=['Predicted Functional', 'Predicted Needs  Repair', 'Predicted Not-Functional'], 
             index=['Actual Functional', 'Actual Needs Repair', 'Actual Not-Functional'])
                            
  sns.heatmap(data=con_matrix, cmap='cool')
  plt.show();
  return con_matrixcon_matrix_analysis(decision_tree)

Phase 4: Build an Accurate Model and Tune its Hyperparameters

We will use a random forest classifier for our final model. A random forest is exactly what it sounds like, a forest of decision trees. We will also use randomized-search cross-validation to select the best model form an assortment of hyperparameter settings.

Based on our cross-validation the random forest classifier with the most accurate results is a model with 92 estimators and a max depth of 26. When we pass the selected model to our confusion matrix analysis function we achieve the results below.

Phase 5: Predict and Export

The final phase in our process is to use our model to predict the status of pumps contained in our X_test dataframe, convert that dataframe to a .csv file, and export the file from our environment. Pandas again comes to the rescue with pd.to_csv() for converting dataframes to .csv files.

prediction = model.predict(X_test)submission['status_group'] = predictionsubmission.to_csv('submission_1.csv', index=False)from google.colab import files
files.download('submission_1.csv')

I hope I have presented an easy to follow and informative introduction to Kaggle competition, machine learning, and data science tasks for the reader. For those interested, the Colab notebook I used for this project is linked below.

Follow me on Twitter, GitHub, and LinkedIn

Here is the link to the Colab Notebook I used for this post.