HDX Universe: The shape of the Humanitarian Data Exchange

Simon B Johnson
3 min readJun 18, 2018

HDX Universe: https://brcmapsteam.github.io/hdx_universe/

HDX continues to grow; there is nearly 6000 data sets of which over 1000 have HXL tags contributed by over 200 organisations. But what is the shape and texture of the data beyond these headlines figures?

What are the themes the data is clustered round?

Which datasets are people using the most?

How widespread is the use of HXL?

These are useful questions to answer to help us improve the HDX service. To understand these investigative questions, we built HDX universe. HDX universe is an alternative way of browsing HDX via a data visualisation. Rather than finding the datasets through the search function, this is a page that visualises all of the data sets on HDX all at once with each data set represented by a circle. That’s a lot of information! And it needs to be arranged in a way that is useful and can provide insights.

Overview of data on HDX

To shape the data and make sense of the relationships between data sets we used a network map. Similar datasets, such as asylum seekers in Nigeria and asylum seekers in Algeria have a link and are pulled closer together. Data sets which are less related such as asylum seekers in Nigeria and a 3W in Nepal have no link and sit further apart. The size of the circle then represents how often the dataset is downloaded.

The end result, when every data is compared to every other data set, is an overview of all the data sitting on HDX creating constellations around themes of data. We see clusters of similar information, around asylum seekers, risk, population statistics and more. We can see which datasets are used the most and where they sit in the context of other datasets and which clusters are providing the most value.

HXL dataset does not have much variation and clustered together

Datasets with HXL hashtags are coloured red. When we see the headline figure of over 1000 datasets with tags it sounds like usage is quite widespread. In reality, we see that a large number of these are clustered in similar datasets due to them being generated from APIs. It shows that there is still a lot of work to do to increase the penetration and prevalence of HXL and the visualisation helps target the next steps.

High downloads across a range of more varied data

Here we can see a nebulous of widely used disparate data which is peppered with HXL, but makes a great case of the datasets owners we should target next to provide the best value for effort in proliferating HXL.

While the visualisation is useful there are steps needed to improve the picture. The downloads are total downloads from upload date. This means that older data sets will have a bias to be larger due to being on the site for longer. Another improvement to make is around tagging. The tags at the moment are free-text entered by the data uploader. This means bias is introduced into the clustering. HDX plan to review and assess the tagging as part of this year’s work. Despite these flaws the exercise has proved very fruitful in answering the some of the original questions posed.

HDX Universe: https://brcmapsteam.github.io/hdx_universe/