Pump it Up: Data Mining the Water Table
It is an online data mining contest hosted by drivendata.org using data from Taarifa and the Tanzanian Ministry of Water one has to predict which pumps are functional, which need some repairs, and which don’t work at all.
Github link to my solution- https://github.com/vaibhavska/Tanzania-s-water-problem-Pump-It-up
My approach-
Training data set consists of 41 features and 59400 data points.
The fetaures in this dataset-
status_group- functional, non-functional or needs repair
amount_tsh — Total static head (amount water available to waterpoint)
date_recorded — The date the row was entered
funder — Who funded the well
gps_height — Altitude of the well
installer — Organization that installed the well
longitude — GPS coordinate
latitude — GPS coordinate
wpt_name — Name of the waterpoint if there is one
num_private -No description
basin — Geographic water basin
subvillage — Geographic location
region — Geographic location
region_code — Geographic location (coded)
district_code — Geographic location (coded)
lga — Geographic location
ward — Geographic location
population — Population around the well
public_meeting — True/False
recorded_by — Group entering this row of data
scheme_management — Who operates the water point
scheme_name — Who operates the water point
permit — If the water point is permitted
construction_year — Year the water point was constructed
extraction_type — The kind of extraction the water point uses
extraction_type_group — The kind of extraction the water point uses
extraction_type_class — The kind of extraction the water point uses
management — How the water point is managed
management_group — How the water point is managed
payment — What the water costs
payment_type — What the water costs
water_quality — The quality of the water
quality_group — The quality of the water
quantity — The quantity of water
quantity_group — The quantity of water
source — The source of the water
source_type — The source of the water
source_class — The source of the water
waterpoint_type — The kind of waterpoint
waterpoint_type_group — The kind of waterpoint
Challenges-
- There were 41 features in the dataset which were too many to handle and accurately use.
- Many of the features have many missing values(0’s), with some important features like gps_height having as much as 20,438 missing values out of 59400 data points.
- Many fetures had too many sub features like ‘sub village’ etc.
Approach-
My first approach was to find those features which would be important in predicting the result.Intiutively we know that the location and the population around any particular water point would play vital role in the status of that water point.
But there was problems as features like gps_height,population,latitude and longitude had many missing data points, so I missed filled those missing data points with the mean and median(as required) of the respective feature in that particular district in which it was lying.
Data Visualisation:
Water Quality vs Status Group-
Could clearly see that water is soft then there is a very high probability of the water point being functional, while if it is salty then there are almost equal probability of functional and non-functional.
Region vs Status Group-
It could be clearly seen that in some regions there is a very high probability of a water point being functional against non-functional.
Population vs Status Group
It could be seen that as poppulation grows the blue line clearly dominates.
Construction year vs Status Group
Gps_height vs Status Group
Latitude vs Status Group
Longitude vs Status Group
Operational Year vs Status Group
Preprocessing-
Now its time to reduce the number of features by eliminating the unnecessary features, like-
It can be seen that three features are almost alike so we could remain any two.
Many other features like public meeting, permit etc. were also dropped cause they didn’t seem to have any impact on the prediction of status group.
A new feature operational year was also added which represented the number of years from which the water point was operational.
At the end we are left with 22 features while we started with 41 features with no missing value in any column.
Evaluation-
Now its time for evaluating our model on different ML agorithms.
First we do pd.factorize to encode labels into categorical variables.
From here I applied different algorithms from XGBClassifier to Random Forest to ensemble with different parameters but I got the best results at
— RandomForestClassifier(n_estimators=1000)
with a public score of 0.8162 and a ranking of 486 out of 5300 contestants(at that time).
Any kind of suggestions and specially any suggestions to improve my model are most welcome😄.