Water Pump Failure Prediction in Tanzania

Prediction of maintenance needs for water pumps in Tanzania

Attribution: www.kickstart.org


The African continent has a quarter of the world’s arable land, and the majority of the labour force is engaged in the agricultural sector. In sub-Saharan Africa, only 4 percent of arable land is irrigated, severely constraining agricultural productivity in a region where an estimated one third of the population is chronically undernourished. By comparison, 37 percent of arable land is irrigated in Asia, 24 percent in Northern Africa and 15 percent in Latin America. This structural imbalance contributes to wide-spread poverty and precarious food security leading to adverse malnutrition, health and eductaion outcomes for the populace.[1][2]

While large, centralized irrigation schemes built around big water storage dams boosts food production and reduces famine risks for millions of people, they have often proven to be donor-dependant, beyond central government capacity for maintance and environmentally destructive.[3]

By contrast, decentralized irrigation, where the local communities take ownership of the water resources, can be more cost-effective and avoid the environmental and social downsides of big dam-and-canal systems. According to the Food and Agriculture Organization (FAO), lower-cost, more water-efficient irrigation technologies have the potential to greatly expand small-scale irrigation in East and Southern Africa.[4]

Problem Statement


The Rural Water Supply Network estimates that from a sample of 60,000 handpumps installed across sub-Saharan Africa every year, up to 40% of those in the region are not functional over a 20 year time period.[5] As with any new technology, the laudable focus on improving acces to water infrastructure also comes with a crisis of failure. A recent study concluded that the aggregated costs to stakeholders of rural water supply failure in Africa represent a lost investment in excess of $1.2 billion.[6]

There are three primary reasons for this absymal failure rate:

  1. Most handpumps are manufactured in India with poor quality controls and recycled parts.
  2. Inaccurate placement of groundwater extraction boreholes.
  3. The shortage of skilled labor for placement and maintenance. [7]

The aim of this project is to predict the maintance requirements of the water pumps in Tanzania. Accurately identifying the pumps prone to failure will assist the stakeholders including the Tanzanian Ministry of Water and Irrigation (TMWI), donors and community organizations to better allocate preventive and curative interventions to ameliorate this issue of failure.

The dataset

The dataset was obtained through the a www.DrivenData.org data challenge competition and attributed to the Taarifa waterpoints dashboard, which aggregates data from the TMWI.

The dataset is comprised of over 60,000 observations and 42 features. The data was very noisy and incomplete with a number of important features riddled with null values. The majority of the variables are categorical, although we have a handful of numerical features, as well as a couple of temporal features.

After removing a number of duplicates and columns with high number of missing values, the final dataset was reduced to 33 features. In the interest of brevity only those which were identified as important will be discussed in the inference section.

The geo-spatial information was clustered using the HDBscan methodology for dimensionality reduction.

The dependent variable and prediction baseline

Model Selection and Evaluation

Model Evaluation Metrics: The project used accuracy as the best measurement for prediction where accuracy is defined as the fraction of correct predictions compared to the total predictions. Accuracy is the proximity of measurement results to the true value.

By Pekaje at English Wikipedia — Transferred from en.wikipedia to Commons., GFDL, https://commons.wikimedia.org/w/index.php?curid=1862863

Guide to Visualization: The project used three visualizations to measure the predictive accuracy, Receiver Operating Characteristic Curve, the Class Prediction Error chart, and the Confusion Matrix.

The Receiver Operating Characteristic (ROC) Curve describes a) the inverse relationship between sensitivity and specificity, b) test accuracy; accuracy can be visualized by the degree of the closeness of the graph to the top and left-hand borders, where the test is more accurate further away from the diagonal,c) the likelihood ratio; given by the derivative at any particular cutpoint.

The Class Prediction Error (CPE) chart provides a visual depiction of the model accuracy in predicting the right classes.

Each row of the Confusion Matrix represents the instances in a predicted class while each column represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).


The project used three models to predict the maintanance requirements for each pump: Logistic Regression, XGBoost, and Neural Network.

Logistic Regression

Logistic regression uses an equation as the representation where input values are combined linearly using weights or coefficient values to predict an output value such as a multiclass value. Logistic Regression also provides coefficients to determine the effect of each feature on classification of each class.

LR train score :

LR test score :


Classification And Regression Trees (CART) are simple and interpretable but have limited predictive power. When combined with gradient boosting, these models become more accurate and improve the prediction.

XGBoost is a stripped down decision tree which uses gradient boosting for speed and performance. The Newton boosting used by XGBoost is likely to learn better structures and includes an extra randomization parameter, i.e. column subsampling. This helps reduce the correlation of the trees even further.

XGBoost train score: 0.78
XGBoost test score: 0.77

Neural Network

A neural network with L1-L2 regularization, neuron dropout rate of 0.5 and an early stopping after 6 epoch was used on for prediction. A thorough discussion of tuning a neural net on a noisy dataset, can be found on a previous blog post.

Epoch 21/50
- 5s - train_loss: 0.8854 - train_acc: 0.5427 - test_loss: 0.8853 - test_acc: 0.5427

Of the three models, XGBoost provided the best measure of accuracy.


Model Inference

The XGBoost model provided a measure of feature importance.

XGBoost feature importance

Population and geographical height were the main predictors of pump failure rate, followed by the districts and quantity of water pumped. Source of the water, the installer and the extraction types were also identified as contributing to the pump failure rate.

As the next step, failure rate by the identified features were further examined.

Geographic Features

Pump failures do not seem to be affected by geography as the failure rates are randomly distributed across the latitude.longitude clusters.

Administrative Districts

attribution : By TUBS [GFDL (http://www.gnu.org/copyleft/fdl.html) from Wikimedia Commons

The pump failures do not show any administrative clustering either.

Basin and Extraction Types

Boreholes and shallow wells are among the top sources for pump failure supporting earlier findings which showed inaccurate placement of groundwater extraction was a leading contributor to pump failure . The relatively complex extraction types such as motor pumps and wind-powered pumps also were more prone to failure than other low-tech methods such as rope, hand and gravity pumps.

Extraction Group

As expected, dry water sources were represented heavily in the non-functional pumps.


The ‘0’ values were 33% of the total construction year variables. I chose to not use this variable to compute the age of the pumps due to high null value propotion. I would like to further investigate the relationship between age and pump failure.

Next Steps

I plan to conduct an unsupervised clustering analysis on both the functional and non-functional pump status gorups to compare and contrast the salient features.

I would like to acknowledge @keerthykishore’ s work as the inspiration for this project. The project notebooks are here on github.

Harsha Goonewardana

Written by

I am interested in the intersection of data science and international development. Better development outcomes through analysis.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade