Pump it up — A comprehensive guide to EDA

Brenda Loznik
6 min readDec 27, 2021

--

Even as a working data scientist, I still love working on projects. Machine learning competitions are a great way to expose yourself to new techniques, packages and innovative ways of thinking. And let’s face it, there isn’t much else to do during a pandemic, so why not improve your data science skills?!

When I first started out with competitions, I wasn’t quite sure where to start. I would go through the repositories of high ranking competitors, being amazed by the impressive models they created, but disappointed by the lack of reasoning behind their decisions. Many competitions and a lot of research later, I have developed a workflow that works for me.

Photo by Annie Spratt on Unsplash

This article is the first in a set of 4 describing my workflow in the Pump it Up competition by Driven Data. At the time of writing I hold a top 4% rank with a score of 0.8235 on the public leaderboard. If you are researching ways to improve your score in this competition or looking for a workflow you could adapt for another competition, these articles are just for you.

Competition description

Let me briefly explain the goal of this competition for those of you who are new to it. Based on data published by the Tanzanian Ministry of Water and Taarifa, we must predict which water pumps are functional, may need some repairs or don’t work at all. A good understanding of which water points might fail, can aid maintenance operations and ensure safe and clean water is available to the rural communities of Tanzania. The training data consists of roughly 60.000 records and 39 features providing information on the location, management and extraction method of the pump. A full description of the competition and data can be found here.

Data quality reports

I like to start my EDA by creating data quality reports. These reports give you basic insights into your data that you might miss when you immediately start visualizing your data.

Numerical data quality report

The data quality report doesn’t disappoint… A longitude of 0 in this part of the world? Hmmm, that can’t be right. It looks like missing data are encoded as zero’s.

Let’s have a look at the data quality report for the categorical features.

Categorical data quality report

Cardinality and multicollinearity

The high cardinality of the funder, installer and waterpoint name variables immediately stand out. Simply one-hot encoding these features could result in a highly dimensional feature set and the notorious curse of dimensionality. We should explore these variables further to see if they hold relevant information for our model and think of a proper way to encode them if they do.

I also notice that the num private, public meeting and recorded by variables have a very high mode percentage. This means that the majority of the records contain the same value. These variables will have low variance and therefore provide little information to a model.

Finally, some variables like extraction type, extraction type group and extraction type class seem to hold very similar information. My multicollinearity alarm bells immediately go off. This is certainly something I need to look into during feature selection.

Exploring the distribution of status group

Next, I explore each variable in more detail. I like to compare the distribution of the label (status group) for the different classes of a variable. Why, you may ask. Well, doing this gives you a pretty good idea of whether a variable may contain interesting patterns that your model could learn. This helps you determine if you should keep or drop the variable.

Water quantity

Let’s look at an example. The quantity group variable describes the water availability of the water point. The majority of the pumps (55.9%) have enough and are functional. About 10% of the pumps are dry and almost all of them are non-functional. This feature certainly contains information that will help a model distinguish between functional and non-functional pumps.

Status group distribution for quanity group

In the data quality report we noticed features with either very high or very low cardinality. Let’s see if they hold important information for our model.

Public meeting

The water points that have a public meeting have a higher functionality rate than water points without this meeting. However, the differences are only small so I suspect this feature will have low importance.

Funder

The top 10 funders together funded about 40% of all water points. The Government of Tanzania is the most frequent funder, but their water points have a lower functionality rate than these funded by other parties like Danida (Danish International Development Agency), UNICEF or World Vision. This feature could hold important patterns that a model could learn, but we should consider grouping rare funders together to avoid the model from overfitting on these rare classes.

Extraction types

While we are at it, let’s also have a better look at the three extraction variables. These variables describe the extraction process of the water point in different levels of granularity. The extraction type variable contains the highest level of granularity, but holds classes that only contain a few water points…

I want to provide my models with as much information as possible, but I don’t want it to overfit on rare classes. This is a trade-off I encountered many times while working on this competition. I decided to group rare classes together and keep the classes that hold a fair number of pumps.

Beautiful Folium maps

I wonder if the functionality rate is dependent on the location of the pumps. This question is easily answered using a Python package called Folium. Folium is a great library for visualizing geospatial data.

Funtional pumps by region
Non-functional pumps by region

There are certainly some regional differences in the functionality rate of the water points. The Iringa region has an exceptionally high rate of functional pumps whereas the Lindi region in the southeast of Tanzania has a large percentage of non-functional pumps.

Construction year

Another hypothesis I have is that newer pumps are more likely to be functional than older pumps. This hypothesis is easily confirmed using a bit of Seaborn magic.

Scheme management

One feature that we haven’t discussed yet is scheme management, which describes who operates the water point. The vast majority of the water points is operated by VWC, short for Village Water Committee. A small percentage of the water points are run by private operators. Yet, these pumps have a higher functionality rate than VWC-operated pumps.

I find out that VWC tends to work with gravity extraction, whereas submersibles are much more common for private operators. Could this be another clue?

Extraction type distribution for VWC and Private operators

This is how I explore each one of the variables, forming hypothesis and thinking about imputation strategies and what variables to keep as I go.

All code used in this article on Github.

Up next: dealing with missing data.

--

--