Pump it Up: Data Mining the Water Table

5 min readJun 26, 2018

It is an online data mining contest hosted by drivendata.org using data from Taarifa and the Tanzanian Ministry of Water one has to predict which pumps are functional, which need some repairs, and which don’t work at all.

Github link to my solution- https://github.com/vaibhavska/Tanzania-s-water-problem-Pump-It-up

My approach-

Training data set consists of 41 features and 59400 data points.

The fetaures in this dataset-

status_group- functional, non-functional or needs repair

amount_tsh — Total static head (amount water available to waterpoint)

date_recorded — The date the row was entered

funder — Who funded the well

gps_height — Altitude of the well

installer — Organization that installed the well

longitude — GPS coordinate

latitude — GPS coordinate

wpt_name — Name of the waterpoint if there is one

num_private -No description

basin — Geographic water basin

subvillage — Geographic location

region — Geographic location

region_code — Geographic location (coded)

district_code — Geographic location (coded)

lga — Geographic location

ward — Geographic location

population — Population around the well

public_meeting — True/False

recorded_by — Group entering this row of data

scheme_management — Who operates the water point

scheme_name — Who operates the water point

permit — If the water point is permitted

construction_year — Year the water point was constructed

extraction_type — The kind of extraction the water point uses

extraction_type_group — The kind of extraction the water point uses

extraction_type_class — The kind of extraction the water point uses

management — How the water point is managed

management_group — How the water point is managed

payment — What the water costs

payment_type — What the water costs

water_quality — The quality of the water

quality_group — The quality of the water

quantity — The quantity of water

quantity_group — The quantity of water

source — The source of the water

source_type — The source of the water

source_class — The source of the water

waterpoint_type — The kind of waterpoint

waterpoint_type_group — The kind of waterpoint

Challenges-

There were 41 features in the dataset which were too many to handle and accurately use.
Many of the features have many missing values(0’s), with some important features like gps_height having as much as 20,438 missing values out of 59400 data points.
Many fetures had too many sub features like ‘sub village’ etc.

Approach-

My first approach was to find those features which would be important in predicting the result.Intiutively we know that the location and the population around any particular water point would play vital role in the status of that water point.

But there was problems as features like gps_height,population,latitude and longitude had many missing data points, so I missed filled those missing data points with the mean and median(as required) of the respective feature in that particular district in which it was lying.

Data Visualisation:

Water Quality vs Status Group-

Could clearly see that water is soft then there is a very high probability of the water point being functional, while if it is salty then there are almost equal probability of functional and non-functional.

Region vs Status Group-