Sharing 200,000 water points — how we did it
At mWater, we believe that sharing data can lead to action: more access to safe water, less waterborne disease, and more economic opportunities in developing countries. But the mechanics of actually sharing data are not simple and seldom talked about. I will probably tag this post with geeky and tedious just to warn casual visitors.
This is the story of how mWater added 200,000 new water points to our free monitoring platform in a single day.
The Water Point Data Exchange Standard
mWater is part of a consortium of water and tech organizations working toward a common standard for water point data sharing. A water NGO called GETF volunteered to take on the thankless task of creating a draft standard, known as the Water Point Data Exchange. It is a work in progress, but was good enough to get a large number of water organizations and governments to contribute.
The goal of a universal standard for water point data is motivated by the movement toward open data, breaking data out of old silos. In aid, data silos have taken two predominant forms: paper and firewalls. Digital (mobile) data collection and cloud-based data storage have brought down these historical barriers, making now the moment for precipitous change toward sharing data across global databases.
Shared data means organizations, governments, communities, and local researchers can collaborate on longitudinal monitoring. It also opens up research possibilities wherein water can be datamined across time and geography for the first time ever — opening up the possibility of discovering trends and patterns that will allow the world to get the jump on diarrheal disease.
There is a gap that must be bridged between a standard and an operational platform like mWater. Things that are left as free text entry in the WPDX standard, such as functionality and source type, have very strictly defined choices in our system. We built mWater this way because we want data to be comparable across different regions and organizations. You can learn more about the mWater-WaterAid water point definitions and attributes at mWater’s website.
Types of water points
We found that both the source and the water_tech fields in the WPDX dataset included descriptions of the water point type. mWater uses types that have been defined by the UNICEF/WHO Joint Monitoring Programme. Since these definitions are used in routine household surveys around the world, they have to be limited to what the surveyor can directly observe at the home. Therefore, if you find a tap in the home, you record it as piped into dwelling even if it might be fed by a borehole up the street. In other words, we define the type by the point of collection, not the ultimate source. This may have some shortcomings, but it has worked for over a decade now and will continue to be used by the UN in the post-2015 Sustainable Development Goals.
Many water datasets have more data than just these directly observable properties. For this reason, we worked with WaterAid to define a set of additional attributes that can be added to water point definitions in mWater. Some of these are applicable just to dug wells and boreholes, such as those shown below:
Another set of attributes, Supply and Treatment works, can be added to piped systems or delivered water. These are multiple choice questions, meaning that you can select all that may apply. This can be useful when trying to describe the chain of supply and treatment that takes place upstream of a tap in your home.
The WPDX data contains a mixture of what we would call subtypes (Water Point being the type) and attributes, scattered across two columns. Therefore, we simply merged those columns and did a pivot in Excel to find all the unique values of the combined data. This gave us 722 unique combinations of the WPDX water_source and water_tech fields. This was a manageable number, so we went to work adding all the mWater translations, which were then used a lookup table to automatically generate the mWater fields from the WPDX data. You can view our work in this google doc.
We defined a very simple set of functionality definitions in mWater that we believe work well because of their simplicity. They are:
- Functional: the water point is in working condition and capable of delivering water to users on a regular basis. A water point may be functional even if it isn’t delivering water when you visit. Many wells and boreholes run out of water each day and rebound overnight. Likewise, piped water may only be delivered for a few hours each day, due to rationing. We recommend that organizations develop separate indicators for access (such as the round trip time to collect water) and reliability (frequency of outages or breakdowns), if they want to track how functional the water source is over time.
- Needs repair: the water point is capable of delivering water on a regular basis, possibly in a diminished capacity, but is in need of repair to restore full capacity.
- Non-functional: the water point is not in working condition. It may or may not be repairable, but cannot deliver water on a regular basis.
- No longer exists: the water point has been permanently shut down or can no longer be found.
mWater also provides Source Notes as a text entry field that can be added to any water point. This is where you can add additional information, such as what repair is needed or the reason it is non-functional.
The WPDX standard contains functionality data in two fields:
- #status_id: This is actually a yes/no/unknown field that relates to whether any water was available on the day of the visit. We don’t advocate using this as an indicator for reasons explained above in our definition of functional. However, we included it as a custom property in mWater in case others want to use it.
- #status: Most of the functionality data is actually in this field, but it is a free text entry field — yikes! These free text fields are a nightmare for analysis, since it allows users to say essentially the same thing in many different ways. Here’s one example:
We don’t know what Functional (not in use) — BROKE DOWN means, but we had to deal with it somehow. In this case, we categorized these responses based on the beginning text, Functional (not in use), and noticed that all the comments after the ‘-’ seemed to imply some kind of repair is needed. Thus, we translated these fields as Needs repair.
There were 12,733 unique ways of describing the water point status.
In the WPDX dataset, there were 12,733 unique ways of describing the water point status! By analyzing the first bit of each answer, we were able to reduce this to a more manageable 54 variations, which were then mapped to the four mWater status types. The original #status field was placed in it’s entirety into the mWater Source Notes field, so it will appear in the longitudinal status history in the mobile app.
Once we had created all of the mWater columns, it was time try to import it. We used the mWater API for the upload (we always eat our own dogfood here). However, since the total file size was 35MB, we decided to do the upload in automated batches using a script. The entire upload took 20 minutes.
The result is the mWater global database with its first ever import of a globally open, cross-shared dataset. View it at http://portal.mwater.co/#maps (free login required). What you are viewing is computed on the fly (live) — what you see is not what everyone else sees: you can see your private sites along with public ones. This is notably not a static view that was created in advance — making the data sortable and viewable in the way that matters to each individual viewer.
If you have data you would like to share with the mWater global water database, email info@mWater.co. To begin using mWater, go to portal.mwater.co for the management portal; and app.mwater.co (or download in the Android store) on any mobile internet browser (we prefer Chrome).