Tanzania is home to the Serengeti plains. image credit

Predictive Modeling for Tanzanian Water Pumps

A Kaggle Competition Walkthrough

Matt Kirby
Analytics Vidhya
Published in
7 min readJul 2, 2019

--

This post will serve as a textual and visual exposition of my predictive modeling process from initial exploration to hyperparameter tuning. We will explore a dataset infested with missing records and duplicate information, and then will attempt to build an accurate model to predict whether a given water pump is functioning, in need of repairs, or non-functional.

Phase 1: Import Data and Familiarize

Most data for Kaggle competitions can be found in the form of a .csv file linked in the competition information. Pandas provides an easy way to import a .csv file as a Pandas DataFrame object; pd.read_csv().

I usually find it helpful to examine a dataset’s individual qualities before diving head first into cleaning, wrangling, and feature engineering. Not every dataset is the same and different data will require different plans of attack. Pandas provides us .info() which prints a DataFrame's shape (59,400 instances, and 43 features), feature names, feature types, a count of non-null values for each feature, and the DataFrame's memory usage (19.9MB); it's a real workhorse.

df.info()
____________________________________________________________________<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 0 to 59399
Data columns (total 43 columns):
key_0 59400 non-null int64
id_x 59400 non-null int64
amount_tsh 59400 non-null float64
date_recorded 59400 non-null object
funder 55765 non-null object
gps_height 59400 non-null int64
installer 55745 non-null object
longitude 59400 non-null float64
latitude 59400 non-null float64
wpt_name 59400 non-null object
num_private 59400 non-null int64
basin 59400 non-null object
subvillage 59029 non-null object
region 59400 non-null object
region_code 59400 non-null int64
district_code 59400 non-null int64
lga 59400 non-null object
ward 59400 non-null object
population 59400 non-null int64
public_meeting 56066 non-null object
recorded_by 59400 non-null object
scheme_management 55523 non-null object
scheme_name 31234 non-null object
permit 56344 non-null object
construction_year 59400 non-null int64
extraction_type 59400 non-null object
extraction_type_group 59400 non-null object
extraction_type_class 59400 non-null object
management 59400 non-null object
management_group 59400 non-null object
payment 59400 non-null object
payment_type 59400 non-null object
water_quality 59400 non-null object
quality_group 59400 non-null object
quantity 59400 non-null object
quantity_group 59400 non-null object
source 59400 non-null object
source_type 59400 non-null object
source_class 59400 non-null object
waterpoint_type 59400 non-null object
waterpoint_type_group 59400 non-null object
id_y 59400 non-null int64
status_group 59400 non-null object
dtypes: float64(3), int64(9), object(31)
memory usage: 19.9+ MB

Labels

There are three possible values:

  • functional : the waterpoint is operational and there are no repairs needed
  • functional needs repair : the waterpoint is operational, but needs repairs
  • non functional : the waterpoint is not operational

Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset.

  • amount_tsh : Total static head (amount water available to waterpoint)
  • date_recorded : The date the row was entered
  • funder : Who funded the well
  • gps_height : Altitude of the well
  • installer : Organization that installed the well
  • longitude : GPS coordinate
  • latitude : GPS coordinate
  • wpt_name : Name of the waterpoint if there is one
  • num_private :Private use or not
  • basin : Geographic water basin
  • subvillage : Geographic location
  • region : Geographic location
  • region_code : Geographic location (coded)
  • district_code : Geographic location (coded)
  • lga : Geographic location
  • ward : Geographic location
  • population : Population around the well
  • public_meeting : True/False
  • recorded_by : Group entering this row of data
  • scheme_management : Who operates the waterpoint
  • scheme_name : Who operates the waterpoint
  • permit : If the waterpoint is permitted
  • construction_year : Year the waterpoint was constructed
  • extraction_type : The kind of extraction the waterpoint uses
  • extraction_type_group : The kind of extraction the waterpoint uses
  • extraction_type_class : The kind of extraction the waterpoint uses
  • management : How the waterpoint is managed
  • management_group : How the waterpoint is managed
  • payment : What the water costs
  • payment_type : What the water costs
  • water_quality : The quality of the water
  • quality_group : The quality of the water
  • quantity : The quantity of water
  • quantity_group : The quantity of water
  • source : The source of the water
  • source_type : The source of the water
  • source_class : The source of the water
  • waterpoint_type : The kind of waterpoint
  • waterpoint_type_group : The kind of waterpoint

Phase 2: Clean, Wrangle, and Engineer Features

Finding a good place to start when attacking a dataset for the first time is not a trivial act, as there are usually multiple obvious places to begin cleaning, feature engineering, and exploratory analysis. I’ll offer some reassuring words for the reader here: pick a place to start, trust your intuition and just start coding. If you hit a brick wall don’t be afraid to retreat, regroup, and attack elsewhere. It — and — you will be fine. We will start with cleaning the construction_year feature with df.replace().

features['construction_year'].replace({0:1993})
# 1993 is an arbitrarily chosen year

Feature Engineering

With this cleaned construction year feature we can engineer a pump age and an average pop served per year feature.

features['age'] = features['date_recorded'] - features['construction_year']features['pop/year'] = features['population'] / features['age']

We can take our feature engineering even further by creating a feature that represents the theoretical amount of water each person served by a wellhead has access to.

features['water/person'] = features['amount_tsh'] / features['population']

Encoding

Encoding is useful to quantify categorical or otherwise qualitative data. There is an ever expanding library of encoding methods focused on different types of data. We will be using one hot encoding for low cardinality features and ordinal encoding for our high cardinality features. We also use null value imputing and standard scaling on our data in order to clean it before we feed it to our model.

Exploratory Visualization

A smart way to explore data is to create visualizations. This is a very freeform process, and again each dataset is different, so there is no one size fits all strategy. Due to this part being very code intensive, I will just present the visualizations, However, if you are interested in the code, there is a link to a colab notebook at the end of this post.

Since the water pumps in our dataset are stationary devices it could help to visualize their geographic locations.

There seems to be no clear relationship between the latitude and longitude of a pump and its status. However there are many very distinct clusters of pumps which appear to have the same status.

Phase 3: Build a Baseline Model and an Overfitting Model

We will build a baseline model to set the lower bar of our prediction capabilities. I particularly like using a majority class prediction as a baseline because it is easy to wrap my head around.

X_matrix, y_vector = df.drop(columns='status_group'), df['status_group']majority_class = y_vector.mode()y_vector.value_counts(normalize=True)____________________________________________________________________functional                 0.543081 
non functional 0.384242
functional needs repair 0.072677
Name: status_group, dtype: float64

Since functional is the status of 54% of our instances we expect our majority class prediction to be ~54% accurate.

from sklearn.metrics import accuracy_scoremajority_prediction = [majority_class] * len(y_vector)accuracy_score(y_vector, majority_prediction)____________________________________________________________________0.543080808080808

Awesome, we now have a bottom line for our predictions going forward.

Let's create a model that overfits our data to ensure we can make worthwhile predictions. My usual choice for this task is a decision tree.

from sklearn.tree import DecisionTreeClassifierdecision_tree = DecisionTreeClassifier(max_depth=50)decision_tree.fit(X_matrix, y_vector)____________________________________________________________________DecisionTreeClassifier
(class_weight=None, criterion='gini', max_depth=50,max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')

We can see how our overfitted model performs by using a confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
def con_matrix_analysis(model):
x = model.predict(X_matrix)

print(classification_report(y_vector, x,
target_names=['Functional', 'Needs Repair', 'Not-Functional']))
con_matrix = pd.DataFrame(confusion_matrix(y_vector, x),
columns=['Predicted Functional', 'Predicted Needs Repair', 'Predicted Not-Functional'],
index=['Actual Functional', 'Actual Needs Repair', 'Actual Not-Functional'])

sns.heatmap(data=con_matrix, cmap='cool')
plt.show();
return con_matrix
con_matrix_analysis(decision_tree)

Phase 4: Build an Accurate Model and Tune its Hyperparameters

We will use a random forest classifier for our final model. A random forest is exactly what it sounds like, a forest of decision trees. We will also use randomized-search cross-validation to select the best model form an assortment of hyperparameter settings.

Based on our cross-validation the random forest classifier with the most accurate results is a model with 92 estimators and a max depth of 26. When we pass the selected model to our confusion matrix analysis function we achieve the results below.

Phase 5: Predict and Export

The final phase in our process is to use our model to predict the status of pumps contained in our X_test dataframe, convert that dataframe to a .csv file, and export the file from our environment. Pandas again comes to the rescue with pd.to_csv() for converting dataframes to .csv files.

prediction = model.predict(X_test)submission['status_group'] = predictionsubmission.to_csv('submission_1.csv', index=False)from google.colab import files
files.download('submission_1.csv')

I hope I have presented an easy to follow and informative introduction to Kaggle competition, machine learning, and data science tasks for the reader. For those interested, the Colab notebook I used for this project is linked below.

Follow me on Twitter, GitHub, and LinkedIn

Here is the link to the Colab Notebook I used for this post.

--

--