Open Data Science To Be Proud Of: Geographic Healthcare Analysis with Healthsites.io and WorldPop

Published in

Canvas

12 min readFeb 22, 2024

a woman taking a free eye test to shoppers at Maponya Mall in Soweto while raising awareness about the #myschool #myvillage — Photo by Soweto Graphics on Unsplash

It’s quite rare that a data science portfolio project is particularly impactful — at least, that used to be the case for my portfolio. The problem was that the open, free-to-use datasets that I worked with didn’t tend to be very interesting. I’ve expanded my horizons now, and this article is an account of a piece of work I did with two amazing datasets that combine to have a real impact — not only improving my skills as a data scientist but also creating a methodology that can be used in analysing the healthcare coverage of a country. Included is my motivation for this analysis, my planning and overall execution, and my thoughts on the process, as well as some resources for you to do similar research. This isn’t a step-by-step tutorial — I want to convey to you the experimentation and trial and error I went through to get a result that I’m quite proud of.

I’m Neil Majithia, a researcher at the ODI, passionate about the use of data science for public benefit. Although my foundations are in computational economics, I’ve worked with simulation environments, decision making algorithms, and machine learning to deliver research across a variety of fields.

Open data isn’t a new concept in data science. People use open data all the time, especially when learning the trade — I and countless others have used the California Housing Dataset, the Iris Classification Dataset, and the MNIST gallery to learn and practise, building things from basic statistical analysis to Convolutional Neural Networks. But whenever I start a project with these typical open data sources, I have a pervasive feeling that I’m not doing anything particularly insightful or impactful. They’re good for learning, but when trying to build my portfolio, I get the impression that all potential insights from these datasets have been wrung dry.

Perhaps that’s my natural jaded outlook on life talking [editor’s note: the author is 21], but it’s a surprisingly motivating point of view. It drives me to look for novel datasets on topics that are relevant to me and ones that present the opportunity to perform genuinely insightful analysis. Often, you might hear that the interesting datasets are the ones closed off to the world, held by corporations that pay a lot of money for in-house data scientists. Think of the analysis you’d be able to do with the data that Google holds! But interesting data isn’t always kept closed, and in fact there is a world of open data sources which are unique, exciting, and have the potential to generate some impactful analysis.

I’ve been exploring that world while working at the ODI over the last few months and, after doing some research on open data for humanitarian aid, I stumbled upon healthsites.io.

healthsites.io is the result of the Global Healthsites Mapping Project, a global initiative with a network of OpenStreetMap users mapping every health facility in the world with the aim of establishing an accurate source dataset for humanitarian aid workers to use in the event of a natural disaster or disease outbreak. For any country, healthsites.io provides not only a list of every health facility within it, including pharmacies, clinics, and hospitals, but also information about these facilities, such as number of beds, emergency capability, and contact details. The initiative is a Digital Public Good and a shining example of open data that truly means something.

Given my previous work in the area of humanitarian engineering and data science, I fell in love with healthsites.io. Specifically, I fell in love with the idea of doing something with that data, but I didn’t know what. The datasets were points on a map, with differing amounts of detail on each health site, so anything statistical or machine-learning based would be heavily constrained. I racked my brain for something — anything — and remembered the entire field of study dedicated to points on a map: network science.

Although I was daydreaming during most of my network science lectures at university, and almost comatose during the tutorials afterward, I picked up a fairly good understanding of the principles of network analysis. For this project, studying Dijkstra’s algorithm or max flow/min cut problems could have been interesting (finding the shortest path for a blood biker to take between hospitals, for example), but ultimately I didn’t want to wrestle with time-variable route planning and traffic data. I looked at cluster analysis, but the insights I’d get from that seemed uninteresting (health sites would be clustered around major population centres, of course). But while flicking through notes I had only a vague recollection of writing, I found a methodology that offered a way forward: Voronoi tessellation.

I was first introduced to Voronoi tessellation by a visualisation I saw a long time ago, showing the closest American Football team to each place in the US. A similar, less colourful visualisation, is below.

A Voronoi tessellation describing geographic coverage of 2020-21 Premier League football teams (Alasdair Rae via Stats, Maps n Pix, 2021)

From this example the results of Voronoi tessellation are easier to understand. It takes a set of points called “seeds”, in this case locations of 2021 Premier League football clubs, and casts shapes on the map that represent a region in which all places within that region are closer to one seed than any other (by Euclidean distance, i.e. as the crow flies). For example, all places in the northernmost region of the above diagram are closest to the Newcastle seed. I’d recommend having a look at its Wikipedia entry if you’re confused.

Asking “what’s closest to me” is actually quite a pertinent question in the humanitarian field, especially given the healthcare data that healthsites.io provides. I quickly settled on an objective: Using Voronoi tessellation with a specific country’s health sites to project onto a map the closest hospital for any person in that country. It’s an objective that could generate useful insights — if there’s a large disaster in one area, where’s the closest place for a large group of people to get medical treatment?

I chose to use South Africa as my country of study after asking my colleagues, some of whom had previous connections there that might be interested in what I had to offer. But unfortunately, when I explained to each colleague what I was planning to do, I felt my enthusiasm fade. I realised that the insights I would get were, at their best, not particularly useful once again. Most people know their closest hospital, and those that don’t can find it out in ways easier than consulting a Voronoi tessellation. I needed to find a better research question to ask. So, I turned to open data once again.

I found WorldPop on the UN Humanitarian Data Exchange. The WorldPop project’s mission is to collect high spatial resolution data on human population distributions using censuses and satellite imagery. A main output is gridded population estimates, which are datasets that split a country into 1km x 1km squares (or 100m x 100m) and provide the estimated population within each square. Such an approach provides a lot of information for things like healthcare planning and humanitarian response, where local populations need to be understood. It represents another stellar example of how open data can enable truly impactful analysis.

I could use WorldPop data to get accurate measures of the population in each Voronoi region I constructed to an incredible resolution, in turn making the regions far more interesting given they would now reflect the population burden on each health site. Therefore, combining these two unique datasets, healthsites.io and WorldPop, I would be able to provide insights on health facility coverage in South Africa.

The dataset I used from WorldPop as visualised on its page

I could use WorldPop data to get accurate measures of the population in each Voronoi region I constructed to an incredible resolution, in turn making the regions far more interesting given they would now reflect the population burden on each health site. Thus, I could use these two unique datasets, healthsites.io and WorldPop, to provide insights on health facility coverage in South Africa.

Consolidating my objectives

To put it simply, my goal was to use Python with a few specific libraries to perform some geographic analysis so that I could understand population burdens on hospitals in South Africa. This would involve creating Voronoi regions from healthsites.io data and finding the population within each region using WorldPop data, a methodology that would use network science, planar geometry, and a lot of data engineering.

In terms of an output, a choropleth map with Voronoi regions coloured by population within them seemed like a worthy aim. I took the extra step of making it an interactive map, just so I could learn a new skill along the way.

Methods

To be honest, I had never done anything with spatial data or planar geometry before this project. I hadn’t really done any network science in a computational format either, and certainly not Voronoi tessellation. My choice of Python certainly helped ease my worries given the vast resources out there for it, and I got to work with scipy’s Voronoi library.

The first steps were pretty straightforward — importing the healthsites.io shapefile for South Africa along with its administrative border (I got from the UN Humanitarian Data Exchange) and visualising both together (Figure 1), isolating the 104 hospitals (Figure 2), and doing the Voronoi tessellation with scipy’s seemingly suitable Voronoi capabilities. Figure 3 shows that it certainly wasn’t suitable enough.

Figure 1: All health sites in South Africa

Figure 3: Initial Voronoi tessellation with unmodified scipy method

Not only is the plot completely uninformative (because the scipy plot couldn’t interact with another layer/trace to show the South African border), but also the Voronoi cells it creates are not all polygonal. That’s not easy to see at first — for the most part, the cells are closed shapes drawn between orange vertices, the coordinates of which can be pulled out of the plot to be used to make planar geometry. But some of the cells, lined with one or more dashed lines, are not polygonal; an inherent nature of Voronoi tessellation, the outermost regions can extend to infinity, just as the Everton region does in the Premier League example I used above. Non-polygonal regions couldn’t be extracted from the diagram, meaning they wouldn’t be able to interact with the WorldPop data. I had to find a way to turn them polygonal.

I won’t waste your time with the details, but I eventually found a way to MacGyver the scipy Voronoi tool to get over this dilemma and give me a set of definite polygons that I could extract and put onto my own plot (Figure 4, where green regions are those that would have been infinite). For aesthetics, I decided to use the innate cross section of planar geometry and set theory to make the polygons consistent with the South African Border (Figure 5).

Figure 4: Green regions that would have been infinite are defined, polygonised, and plotted

Figure 5: All Voronoi cells constrained to the South African border

This might not look like much but it was a month of on-off work. I certainly enjoyed it, and getting Figure 5 as an output made me feel quite fulfilled — fitting the cells to a defined, irregular border is not something I’d seen before with Voronoi tessellation, and I’m coining it as “geo-constrained Voronoi tessellation”. I saved the polygons plotted in Figure 5 with some ID numbers consistent with their seed hospital into a .csv file and moved on to using WorldPop data.

The WorldPop data was beautiful in its simplicity. A three column csv, the dataset contained the latitude and longitude of the centres of each 1km square of the country’s grid, and the population estimated to live in that square. To put it simply, it is a collection of points — something that I could take advantage of with set theory again. I made a script to take each polygonal Voronoi region and run through all points in the dataset, checking if they are contained within the region and, if so, adding to a running counter of population. For each of the 104 regions, I had the population within them.

Finally, I had to build a visualisation for this information. Geopandas’ plot method allowed me to build a static choropleth pretty easily, Figure 6, and already I was quite proud. Building an interactive version took a little configuring, but the end result was even better, hosted online here.

Figure 6: A choropleth map with Voronoi cells shaded to indicate the residential population within them

So what do these visualisations actually represent? The regions depicted are Voronoi cells whose seeds are hospitals, so for any person living in a particular cell, the closest hospital to them (in Euclidean space) is the seed of that same cell. The hospitals are represented as black points on the maps. The colour of each region represents the size of the population residing within it, providing a depiction of the relative burden on each hospital.

Exploring the interactive map, I found a few useful insights. The two darkest cells, one in the north-east and the other just south of Lesotho, each contain over 2.4 million residents — meaning the respective hospitals in those regions are significantly overburdened. The largest cells, representing hospitals that are responsible for large geographic regions, are not the highest populated, meaning that despite the hospital’s large geographic responsibility, they are less burdened than other health sites. This represents well-planned health coverage in the country, and is especially prevalent on the Western side.

You can explore the interactive map and find further insights. I didn’t make it incredibly interactive, so the healthsites.io data isn’t represented as well as it could be, but I think the map accomplishes my objectives. Hovering over a region gives its estimated population and hovering over the hospital within it gives its healthsites.io ID (osm_id), name, and address city if known.

There’s obvious caveats to this kind of analysis — for example, using Euclidean distance as the basis of the Voronoi regions doesn’t take into account travel time or route conditions, both of which imperative considerations when directing emergency response (an ambulance should go to a hospital further away as-the-crow-flies if it’s faster to get to due to the route being on a motorway rather than rural roads). I also treated all hospitals as homogenous, whereas there may be preferences over both quality and cost that would make a person go to a hospital further away rather than something close by. Ultimately, the visualisation I’ve made isn’t useful in its current state, but if it were used as part of a larger analysis of healthcare in South Africa, it provides a unique insight that is both interactive and informative.

Conclusion

Over this project, I built on my skills in Python by learning how to implement spatial data via Geopandas, Voronoi tessellation via Scipy, planar geometry via Shapely, and interactive mapping with Folium. I adapted what I learnt to fit my needs and the outcome was a visualisation that provides insights in the real world, something that I have been itching for whenever I’ve done data science, and would never have been possible without open data such as healthsites.io and WorldPop.

Open data is foundational to learning data science, but I hope this article proves to you that it isn’t all as drab as housing data from the 1990s like the California dataset. There’s data out there that can inspire you, teach you, and, most importantly, make an impact — it’s up to you to find that data and use it.

Has this article lit a fire in you? If so, here’s a couple of things you could do with these data sources:

Do this same methodology with another country and look for similarities and differences in healthcare infrastructure in comparison to South Africa. Do identifiable trends persist between countries? Are similar geographic features covered differently?
Compare the centroids of Voronoi cells to the location of their seed health sites and population centres. What does this say about healthcare coverage per cell?
Try some cluster analysis of the locations of health sites, comparing cluster coefficients with population densities to rate healthcare coverage. Are clusters of health sites in the right places? Why/why not?
Utilise the other aspects of the health sites dataset to determine the number of beds available across the population of a country. Would the country be prepared for a large-scale emergency like a pandemic? Would it be more or less prepared than other countries?

Once I clean up my GitHub repo for this project, I’ll update this article with a link, although I’d recommend ignoring it and forging your own path in your project. The trial and error of this work was the thing that kept me going, prodding my learning towards new pathways and making me rethink things I thought I already knew. I hope you can use the healthsites.io and WorldPop datasets to the same effect.

Open Data Science To Be Proud Of: Geographic Healthcare Analysis with Healthsites.io and WorldPop

Consolidating my objectives

Methods

Conclusion

Written by Neil Majithia