Need a map? Want a dashboard? Here’s what to do first.

Colleagues often come to the GIS & Information Management team at the British Red Cross to get help with making a map or a dashboard.

Before we can jump into making anything at all, we need to look at the data. And more often than not, we need to clean and reformat it before we can go any further.

To ensure data analysis and visualisation is possible, data should be
1) comparable, 2) measurable, 3) make sense, 4) be constrained, 5) be structurally restricted from editing, and 6) be owned by someone.

This article outlines these data preparation principles, illustrated by recent tasks I’ve worked on. These examples come from the humanitarian and development sector, but the principles are applicable more widely.

Principle 1 — ensure your data is comparable.

Lack of consistency across data collection can raise more questions than you originally asked.

I built a map of locations in the UK where our Refugee Services directorate can offer support to people fleeing the crisis in Ukraine.

Regional managers were asked to compile the locations, but some listed specific towns or cities, while others named wider areas such as counties or regions of the UK.

For a comparable, legible final product, I could only map either points (for specific locations) or polygons (for wider areas), so how to capture them all?

(To illustrate the difference, below is an example of where both points and polygons can be shown on the same map — because they represent different data sets).

In this example, weather stations are point data, and countries are polygon data. Source — various, via Paul Knight, British Red Cross.

For our example where only one type of data exists, points were chosen as the majority location which made the most sense for the context.

Lesson — ensure your data can be compared from one entry to the next to make analysis or visualisation possible.

Principle 2 — ensure your data is measurable.

A lack of measurement can lead to erroneous, unreliable data, making analysis impossible.

I have recently supported a flood response in central Asia with a WASH assessment survey.

The pilot survey included questions about access to water, including how many buckets, jerry cans or water coolers of water the household could collect daily before and after the floods.

People queuing to fill water containers in Côte d’Ivoire. Source — Oxfam International via Flickr Commons

It was unclear how these numeric answers were produced without knowing the volume of the various containers, or if different sizes had been accounted for.

The data varied wildly across locations and enumerators, so had to be disregarded, and the questions quantified with measurements.

Lesson — always make sure to define and quantify your data units, whether that be a geographical spatial unit, or a size, distance, date, currency etc. This avoids your data being uninformative post-collection.

Principle 3 — ensure your question makes sense.

Illogical questions lead to unusable data.

In the example above, another survey question asked people if they had access to a household tap, and if so, how many ‘taps’ of water they could collect daily.

What does the second question actually ask? Is it in relation to number of taps? Capacity of a water tank (if not piped)? How many of any of the above containers could be filled? Or does this refer to output flow?

Source — StorySet, via Freepik

Unfortunately, again the data had to be disregarded and the question clarified.

Lesson — if you cannot answer a question yourself, it needs rewording. This issue can also crop up as part of a translation error, so always make sure to translate, then get someone else to back translate into the original language for accuracy.

Principle 4 — Constrain data entry.

Free text entries will lead to typos at best, and erroneous data at worst.

I used Kobo Toolbox to digitise an assessment survey for a cash programme being set up in West Africa.

The draft put together by the programme team featured a list of states, then a free-text entry for the specific location where the survey was being conducted.

This would have led to potential errors for a variety of reasons -

  • where location names are similar
  • or appear in more than one place across the country
  • contain spelling mistakes
  • are written down in different languages
  • the wrong state is selected
  • or something entirely different is written in the free text field.

To address this, I designed a list using cascade select, where the state (admin1), then local government area (admin2), then a list of the possible specific locations are selected in descending order, with the options filtered by the previous selection. This ensures only those smaller areas which are present within larger areas could be chosen.

A cascade select list in Kobo Toolbox, of provinces and districts in Afghanistan. Source — author’s own, using public data via HDX.

Lesson — By constraining your data entries to only those that are possible, this keeps your data clean and usable.

Principle 5 — Restrict structural editing rights.

Multiple people with access to edit the data structure can lead to differing and duplicated data, meaning analysis and visualisation is not immediately possible.

I am currently making a dashboard detailing engagements undertaken by a participatory network of refugees and asylum seekers, to help track impact and support better co-production.

Data has been gathered in an Excel spreadsheet made in 2020, which contains a table with data validation. This constrains cell content to a particular format, or drop-down list. These lists have been both added to and edited structurally by various people ever since.

One problem is that when new options were added to drop-down lists, existing options were not necessarily accounted for. For example, racism was added as an option for the theme of the engagement, but with discrimination being an existing option, this could lead to overlap.

In this case, many list options had to be amalgamated or recategorized before the data could be visualised in a dashboard. Due to the high volume of users, Excel is reaching its limits as a database. Therefore I am now designing a more suitable system via SharePoint Lists, where the ability to amend the data structure can be better restricted. Additionally, the filters, search and overall visual appearance are more user friendly.

A comparison of how 2 data types looked originally (when managed in Excel, left) and after (when moved to SharePoint Lists, right). Now, the ability for users to intentionally edit the data structure, or accidentally overwrite/change the data validation criteria is better managed. Additionally, SharePoint Lists has a ‘select multiple’ function, enabling better classification of engagement themes. Source — British Red Cross

Lesson — the solution to this problem is not only a technical one. Which data categories are appropriate depends on your context and needs. The more granular detail you have, the harder data analysis can become. For any team it is important to take the time to reach consensus and ensure common understanding of data types, not forgetting to make clear, accessible documentation to be communicated amongst your user group.

From the technical perspective, best practice is to agree and then lock down the data structure to prevent ongoing edits occurring from multiple users, which leads me to my next principle…

Principle 6 — make sure someone owns the data

If no-one has oversight and responsibility for your dataset, errors and inconsistencies could go unnoticed, compromising data quality.

Throughout the above examples, a nominated data owner overseeing the entire data set would be able to compare and contrast, holistically identifying the above issues.

Ideally they should have authority to make corrective decisions to maintain consistency (even if interim decisions, pending further discussion).

Being a data owner can be a lot of responsibility, requiring you to manage many factors and stakeholders. But remember, if everyone is charge of data quality, ultimately no-one is. Source — ITMastersMag

Lesson — a data owner is essential to ensure ongoing data consistency, accuracy and quality, with the ultimate aim to make the data usable for analysis and visualisation. Good communication and updates from the data owner will lead to fewer errors in data entry, and greater trust in data quality.

Conclusion

Beyond cleaning the data you already have, identifying the cause and deciding the best way forward takes time.

Ideally you should address these considerations before collecting your data, especially for large, more complex pieces of work with a larger scope for problems.

It is estimated that for every hour spent on data collection preparation, you are saving 4 hours in data cleaning (source — IFRC), therefore this is definitely time well spent.

For other considerations if you are wanting to collect data from scratch, there are tips and examples in this blog I wrote last year about ‘Asking the right questions.’

And if you want to make a simple Powerpoint map or Excel dashboard yourself, my colleagues have written some excellent guidance to help you — check out https://medium.com/digital-and-innovation-at-british-red-cross/tagged/maps

--

--