Pump it Up: Data Mining the Water Table

Vaibhav Shukla
5 min readJun 26, 2018

--

It is an online data mining contest hosted by drivendata.org using data from Taarifa and the Tanzanian Ministry of Water one has to predict which pumps are functional, which need some repairs, and which don’t work at all.

Github link to my solution- https://github.com/vaibhavska/Tanzania-s-water-problem-Pump-It-up

My approach-

Training data set consists of 41 features and 59400 data points.

The fetaures in this dataset-

status_group- functional, non-functional or needs repair

amount_tsh — Total static head (amount water available to waterpoint)

date_recorded — The date the row was entered

funder — Who funded the well

gps_height — Altitude of the well

installer — Organization that installed the well

longitude — GPS coordinate

latitude — GPS coordinate

wpt_name — Name of the waterpoint if there is one

num_private -No description

basin — Geographic water basin

subvillage — Geographic location

region — Geographic location

region_code — Geographic location (coded)

district_code — Geographic location (coded)

lga — Geographic location

ward — Geographic location

population — Population around the well

public_meeting — True/False

recorded_by — Group entering this row of data

scheme_management — Who operates the water point

scheme_name — Who operates the water point

permit — If the water point is permitted

construction_year — Year the water point was constructed

extraction_type — The kind of extraction the water point uses

extraction_type_group — The kind of extraction the water point uses

extraction_type_class — The kind of extraction the water point uses

management — How the water point is managed

management_group — How the water point is managed

payment — What the water costs

payment_type — What the water costs

water_quality — The quality of the water

quality_group — The quality of the water

quantity — The quantity of water

quantity_group — The quantity of water

source — The source of the water

source_type — The source of the water

source_class — The source of the water

waterpoint_type — The kind of waterpoint

waterpoint_type_group — The kind of waterpoint

Challenges-

  • There were 41 features in the dataset which were too many to handle and accurately use.
  • Many of the features have many missing values(0’s), with some important features like gps_height having as much as 20,438 missing values out of 59400 data points.
  • Many fetures had too many sub features like ‘sub village’ etc.

Approach-

My first approach was to find those features which would be important in predicting the result.Intiutively we know that the location and the population around any particular water point would play vital role in the status of that water point.

But there was problems as features like gps_height,population,latitude and longitude had many missing data points, so I missed filled those missing data points with the mean and median(as required) of the respective feature in that particular district in which it was lying.

Data Visualisation:

Water Quality vs Status Group-

Could clearly see that water is soft then there is a very high probability of the water point being functional, while if it is salty then there are almost equal probability of functional and non-functional.

Region vs Status Group-

It could be clearly seen that in some regions there is a very high probability of a water point being functional against non-functional.

Population vs Status Group

It could be seen that as poppulation grows the blue line clearly dominates.

Construction year vs Status Group

Gps_height vs Status Group

Latitude vs Status Group

Longitude vs Status Group

Operational Year vs Status Group

Preprocessing-

Now its time to reduce the number of features by eliminating the unnecessary features, like-

It can be seen that three features are almost alike so we could remain any two.

Many other features like public meeting, permit etc. were also dropped cause they didn’t seem to have any impact on the prediction of status group.

A new feature operational year was also added which represented the number of years from which the water point was operational.

At the end we are left with 22 features while we started with 41 features with no missing value in any column.

Evaluation-

Now its time for evaluating our model on different ML agorithms.

First we do pd.factorize to encode labels into categorical variables.

From here I applied different algorithms from XGBClassifier to Random Forest to ensemble with different parameters but I got the best results at

— RandomForestClassifier(n_estimators=1000)

with a public score of 0.8162 and a ranking of 486 out of 5300 contestants(at that time).

Any kind of suggestions and specially any suggestions to improve my model are most welcome😄.

--

--