Data Poverty amid Abundance: Challenges for the Global South

Good Data Initiative
Good Data Initiative
11 min readNov 21, 2020

Data — a resource some consider the “new oil” — has played a central role in our daily lives for millennia. Ancient states used censuses to inform tax levying; colonial banks used local land registries to inform loan issuance. While the scale of human activities has changed over time leading to more data, arguably the most drastic change has been in the ways we capture and store this information about our activities and the environment.

Recent technological advances — especially widespread digitalisation — have radically reduced the cost of collecting and storing the large volumes of data that result from our economic and social activities. These increasingly include aspects of our interactions and experiences that used to be conceived as entirely qualitative.

Simultaneously, progress in analytic techniques have allowed for more advanced processing to extract greater value from this available data. Artificial intelligence and machine learning in particular have pushed the use of data across sectors, with prediction algorithms deployed to develop self-driving cars, identify promising new drugs, personalise healthcare, deliver targeted advertising, and to improve the efficiency of operations.

Amid this data explosion, some sectors and locations still run into a lack of data. The same reasons that have led to data explosion are themselves not equitably distributed: Data that feed into economic impact calculations, for instance, are highly favoured, whereas attempts to monetise all impacts (including those of climate change, environmental degradation and health) raise questions of validity and ethics. Compared to the global North, the global South has weaker infrastructures for routine data collection, as well as lacks sufficient skilled analysts to exploit these new kinds of data.

The challenges created by this become more obvious while attempting to quantify non-economic impacts in African contexts, like I do in my work. As a public health physician from Cameroon and current researcher in Public Health Modelling at the University of Cambridge, I am acutely aware of these data challenges and the impact they have on shaping better public health policies across Africa and the world.

Data Poverty in Estimating Health Impacts

One crucial area where data poverty makes a difference is when estimating the health impact of transport, both in terms of possible causes and solutions. Rather than looking at the impact of urban transport policies from a purely economic perspective, in my research I approach it from a perspective of health and well-being. I find this angle more fascinating because it does not monetise human lives as is the case for goods and services; that said, taking this non-monetary perspective can cause data compatibility issues since most existing data in this area are essentially suited for economic impact assessment. Our research to quantify the health impacts of urban transport policies in African contexts has regularly grappled with issues of data availability, despite current advancements in the data world.

The research I work on involves mapping transport policies to population travel behaviour, i.e., modes of transport, time spent travelling, and reasons for travelling. This data allows for the estimation of health gains through physical activity from using active modes of transport, such as walking and cycling, while discounting harms resulting from exposure to factors including road traffic injuries and air pollution.

Flowchart: Linking transport policy to health through different pathways (courtesy of Lambed Tatah)

Following these different pathways linking transport policy to health outcomes, it is easy to appreciate the disparate nature of the data required for the calculations my team and I do in our research. In addition to identifying possible datasets in geographic areas of data poverty, another obstacle we often face is how pooling together all the valid available data also requires expert knowledge in each pathway. To give a sense of the complexity involved, here is (briefly) what my team and I look for when examining datasets of travel behaviour data, air pollution and road traffic injuries:

Travel behaviour: ‘Travel behaviour’ data essentially captures people’s daily travel in their living environments. This includes the transport modes used (e.g., cars, motorcycle, bicycle, walk, bus, trains etc.), time spent travelling, and the reasons why they travel. This information is typically obtained from household travel surveys and in some cases censuses and time use surveys, which means that this data is affected by storage, maintenance, and accessibility issues as would be the case for most survey data.

While many cities in high-income countries have repeated travel survey data, my team and I have only identified a few sporadic travel surveys in African cities (with the exception of South Africa). Some of the largest Sub-Saharan cities with over four million residents, including Lagos (est. pop. 21M), Kinshasa (est. pop. 11.8M), and Dar es Salaam (est. pop. 5.5M), simply do not have this type of data available. Obtaining these data where they exist requires that we tap into our personal and professional networks. Other workarounds, like crunching Google Street View data as a proxy for estimating travel behaviour, are still in their early stages. Over the coming years, we hope to develop and extend approaches like this to reduce the gap of data in travel behaviour in low-income cities.

Air pollution: Data of interest to better understand the health impact of air pollution includes the background air pollution (PM2.5, PM10, etc.) of areas where people live and travel, as well as how much pollutants different transport modes contribute to air pollution (known as ‘source apportionment’). My team and I have struggled to obtain these data from local monitoring stations or from larger repositories, such as the Emission Database for Global Atmospheric Research (EDGAR), since monitoring stations are expensive to operate and their cost has limited their ubiquity in poorer settings. To work around this, we are exploring how we could use measurements of air pollution from high-resolution satellites that could cover most locations. Even if researchers can access that data, however, whoever owns these satellites will ultimately determine their coverage.

Road injury: To better understand road traffic injuries, my team and I look for data on the demographics of persons and the types of vehicles involved in crashes, as well as general vehicle counts on road networks in cities. This data is typically collected by local police and/or the hospitals and mortuaries involved in each incident. Understandably, these data are not easily available and both of these sources are often incomplete; merging the two datasets is also often not practical because of poor linkages, potentially resulting in incorrectly linked data. To work around this, my team and I explore how crowdsourcing information on injury and triangulating data from different sources provides another, potentially more accurate and timelier, way of gathering road injury data.

Causes of Data Poverty

There are many causes of data poverty as I outlined in these three previous examples drawn from my work. It can simply be that data about events/activities is not captured; data is captured but users are not aware of its existence; data is captured but not available to users; or that data is available in formats that are not readily useful. In low-income settings, the non-capture of data, especially in formats required by users, is an especially important cause of data poverty. However, one area where changes can be made and data poverty reduced is to focus on data that has already been captured or being captured passively; captured data can easily be made available by improving policies around data and increasing data analytic techniques, as I explain in the following four examples of practices leading to data poverty:

  1. Data localisation: Data localization laws require that individual data be stored within certain borders. In some cases, restrictions are limited to data from specific sectors deemed particularly sensitive to personal or national security (such as health or finance). While data localization may have the advantage of possibly increasing sovereignty and the strength of cyber security, it can also have the unfortunate downside of aggravating data poverty. Data localisation can help governments to protect their citizens’ right to privacy through more stringent data laws and regulation. Local data storage may also make it easier for national governments to physically protect these infrastructures, especially when it comes to core infrastructures, such as financial services, health provision, or energy distribution, that are all relevant for national security. However, data localisation also encourages data autarky, which discourages external use of data. In addition to these barriers, there is also an inherent high cost of collecting and storing data to meet local storage requirements and restrictions, and further questions of compliance in whether providers agree to these rules.
  2. Data hoarding: One characteristic of data is that it is non-rival, meaning that the same data can be used by many without depleting it. Data benefits society the most when it is widely shared since more users can then use it to increase efficiencies and innovate. However, because data is also a key input in the modern production function, that firms combine with factors such as labour, capital, land, and oil to produce their wide range of goods and services, data providers also have strong economic incentives to hoard data (for another example, see GDI analyst Emma Clarke’s analysis of oceanic data). Large co-operations that can set up robust data infrastructures can benefit from economies of scale to outsmart competitors. In the current dispensation where data markets are opaque, companies easily collect vast quantities of both relevant and irrelevant data while paying little attention to privacy; this is made worse by the frequent lack of clarification of rights and obligations around data. Data hoarding for economic advantage not only stifles competition but also reduces possible social benefits, like those being worked on through my team’s research, that could flow from wider access.
  3. Lack of analytical techniques: The rapid fall in cost of digital sensors (including cameras, microphones, global positioning systems, and accelerometers), storage technology, and the proliferation of digital activities have dramatically increased the volume of data available. To make sense of these data, advanced analytic techniques are required. Lack of data scientists and adequate working equipment in many locations and sectors further reinforces pre-existing data poverty. For an example out of my own research, good social network analysis skills could help with crowd sourcing and analysis of injury data from social media and news feeds.
  4. Cost barriers: Data is partially excludable: Its collection, processing, and storage on interconnected systems requires continuous investment to prevent loss through accidental damage or cyber-attacks. It can be argued that it is sensible to at least charge users fees for sustainable management of the data. However, transferring costs to data users may deter many from engaging further in the data economy, especially those in low-income settings. We were not charged for accessing most of the surveys, but it is important to acknowledge this hindrance, especially when data is held by private co-operations that impose access cost. In fact, accessing Google Street View Data, which we now see as the next approach for estimating travel behaviour, is not free of charge.

3 Ways to Improve our Path Forward

To change this current, widening path separating regions of data poverty from those experiencing data abundance, a few key actions have the potential to make a serious impact. My research estimating the public health impacts of transport is just one forum where such changes can make a difference, but the resulting improvements have the potential to radically improve lives everywhere — not just in the global South. Here are three immediate routes that should be considered going forward:

  1. Changing Delocalisation Policies: Data localisation not only obscures data from potential users but also risks causing international date fragmentation, precluding important potential gains from cross-border activities. Because data is non-rival and can be transferred anywhere at virtually zero cost, it is the ultimate mobile factor of production that can stir sustained growth and innovation. From my own experience, it is clear that policies should be directed towards making data increasingly global. More immediate local actions can involve the digitalization of sporadic and routine survey data held by institutions, and connecting these to national databases to increase the availability of data to other potential users.
  2. Increase portability and interoperability of data: Data hoarding can be reduced through radical policies requiring the portability of individual user data. This corresponds to granting users the right to access and transfer the personal data that collectors and processors are holding about them. Implementing portability imposes costs on data processors, who must build an interface from which users can access their data. A complementary policy measure to portability requirements is to require interoperability of data across platforms through common standards.
  3. Increase analytic capacity where needed: There has been a recent boom in the data science industry resulting in the training of many data scientists. These efforts, however, have been historically concentrated in the West, ignoring the need and opportunities available to those in the global South. As a result, the global South has extremely limited skilled persons trained and available to crunch data relevant for their areas, widening data gaps even further. To address this, policy makers locally and worldwide need to focus their attention on fostering data science education and skills training where it is both much needed and will make significant impacts. In this regard, efforts such as Data Science Africa and Quantum Leap Africa are to be encouraged.

Key takeaways (TL;DR):

  • Regional and sectoral data poverty is a critical issue that needs to be addressed in the context of the recent data boom, especially since the latter can obscure the former.
  • Advancements in exploiting rapidly increasing volumes of data must consider the inequities that are also generated in other sectors and settings.
  • Policies to delocalise data, increase user data portability, and increase equity in data analytics are urgently needed, especially in the global South.

About the Author: Lambed Tatah, MD, MSc, MPH

Lambed Tatah is a Public Health Physician-Scientist undertaking a PhD in Public Health Modelling at the University of Cambridge. Lambed’s PhD focuses on quantifying the health impacts of active transport resulting from different transport policies in LMICs. He is looking at how to curate data for health impact modelling, generate robust estimates for transport health impacts, and make results relevant to policymakers in these settings. Lambed has previously published peer-reviewed articles on topics involving globalisation and health, and is a current research analyst at GDI.

--

--

Good Data Initiative
Good Data Initiative

Think tank led by students from the Univ. of Cambridge. Building the leading platform for intergenerational and interdisciplinary debate on the #dataeconomy