Harnessing Big Data To Solve Big Problems
CEGA Doubles Down on Data Science for Development
This post was written by Program Manager Samuel Fishman, who manages CEGA’s Data Science for Development (DS4D) and Technology portfolios.
Data is being generated in real time at mind-boggling scales, and is becoming easier to analyze in an integrated way. Satellites ply the skies, mapping our planet’s physical environment in real-time; billions of mobile phone records are generated every day; countless data points are recorded through crowdsourcing apps, private and public sector data collection, online intermediaries, and other nodes in global information chains. Increasingly, much of this data is being made open access, discoverable, and “analysis ready.” This includes rich text searches, training data repositories, labeling, geo-referencing, and the proliferation of data catalogues.
Reading this, you might be imagining a billion dollar lab in Palo Alto preparing the world’s most advanced marketing campaign. However, arguably some of the greatest value generated by advances in big data, and data science, will come in the form of welfare benefits for people living in poverty. In low- and middle-income countries (LMICs), where traditional data (like household survey data) is often expensive, difficult or even dangerous to collect, the use of data from non-traditional sources (including satellites, mobile phones, and sensor networks) has the potential to drive far more reliable and timely insights for researchers and decision-makers at a fraction of the cost.
Many of these advances in big data and predictive data science methods are making their way into the mainstream of social science research. These developments are rapidly increasing the scale at which your average researcher can study phenomena that previously remained invisible in many LMICs.
With this rapid expansion comes significant risks. An overconfident hybrid data/social scientist with his head in the literal clouds (satellites, social media, and cell phones) could quickly lose track of the real world, and develop biased, and poorly validated, approaches (more on this later). Additionally, though pushes for open data are a familiar feature of many organizations pushing big data applications in LMICs, some have pointed to serious inequities in the benefits derived from the pressure to “free” data. With the advent of approaches that hold so much promise, and so much risk, for research in the developing world, a robust and rigorous academic ecosystem needs to be constructed.
Seizing the Opportunity
For over a decade, CEGA has been working to build a “Data Science for Development (DS4D)” ecosystem by supporting social scientists, engineers and data scientists to leverage new types of data and analytical approaches in addressing challenges related to poverty in LMICs. Examples include using cell phone data to measure economic well being, using machine learning algorithms to overcome the gender gap in credit access, and using geospatial modeling and machine learning to improve water resource management.
With the recent launch of CEGA’s DS4D portfolio, we’re doubling down on the promise of large scale data sources and novel approaches to measuring poverty and facilitating sustainable economic development — while also supporting research on the limitations and concerns surrounding this work. Key to this effort are partnerships with a wide range of stakeholders, including global technology companies, telecommunications companies, satellite companies, and innovation-minded research and policy partners.
CEGA recognizes that coordination across this diverse ecosystem is critical if we want to increase the quantity and quality of data science approaches in development research while effectively addressing associated risks. To this end, CEGA hosts conferences and workshops designed to open channels of communication, learning and collaboration across sectors. This past year, we co-hosted our annual Geospatial Analysis for Development (Geo4Dev) Symposium and Workshop with the World Bank’s Development Impact Evaluation (DIME) team and Analytics and Tools unit, New Light Technologies (NLT), and 3ie to highlight new applications of geospatial data and train scholars new to these approaches how to incorporate geospatial data in their work.
What does this look like in practice?
So how are data science applications being implemented in the real world to help alleviate poverty? DS4D approaches are changing the way we study social phenomena, and construct public policies. Some examples in poverty estimation, urban development, health, and environment can help illuminate a small fraction of what’s happening:
- Poverty mapping: CEGA faculty co-director Josh Blumenstock, as well as CEGA affiliates Marshall Burke and David Lobell, have developed machine learning methods that identify signatures of poverty in satellite or cell phone call detail records (CDR). For example, cheap roofing materials and low cell phone call frequency align with lower income households. Once validated with ground truth data, these methods are as or more effective than survey based approaches at measuring economic well being at a disaggregated level, and in-turn informing governments in targeting social services. Now, CEGA’s launched the Targeting Aid Better Initiative with Blumenstock to scale these approaches and help governments get emergency aid to those who need it the most.
- Urban Development: Big data is transforming how we study cities and apply urban policy solutions. CEGA affiliate Marta Gonzalez used cell phones to estimate building occupancy. CEGA affiliate Aprajit Mahajan used machine learning to analyze five years of Indian tax returns in New Delhi to identify fraudulent businesses. Nick Tsivanidis used AI with satellite data to measure the impacts of an urban redevelopment scheme in Mumbai. Marco Gonzalez-Navarro is studying the impact of new subway systems on air pollution using satellite data. These findings are transforming urban planning and policy.
- Health: Data-intensive approaches are transforming health research and practice. In a recent example, CEGA affiliate Maya Peterson used machine learning approaches to identify persons at risk for HIV in Uganda. CEGA affiliate Ziad Obermeyer wrote about how these kinds of algorithmic interventions are advancing at a breakneck pace in medicine. Importantly, one of his recent papers highlighted some pitfalls, finding racial bias against black patients in an algorithmic approach to estimating patient health risks. Because the study used money spent on the patient as a proxy for health needs, it under-estimated health needs of black patients because less money is spent on them than white patients suffering from the same levels of sickness.
- Environment: Environmental and climate research are a boom sector for big data as well. In one example, CEGA affiliate Jennifer Burney at UCSD uses measurements of spatiotemporal variation in satellite imagery to measure the impact of groundwater extraction on soil erosion. Climate change research is also being transformed by big data. In a 2016 paper, Tamma Carlson and CEGA affiliate Sol Hsiang note “advances in computing, data availability, and study design now allow researchers to draw generalizable causal inferences tying climatic events to social outcomes.” In climate research, and other sectors, non-traditional data hasn’t just improved upon traditional data methods, they’ve opened up entire new avenues of inference that were formerly invisible to researchers.
Not a replacement, or a panacea
Non-traditional data methods improve upon field work, but often can’t completely replace data collected face-to-face.
In Hsiang’s co-authored paper “Ground Control to Major Tom: the importance of field surveys in remotely sensed data analysis,” researchers found that the size and nature of ground truth data had a huge impact on the usefulness of satellite imagery in predicting socioeconomic well being.
Other methodological challenges, like algorithmic bias, need to be addressed case by case, especially in low-income settings. Blumenstock writes about data science pitfalls in LMICs, such as under-representation of marginalized populations in many datasets (for example if you’re using data from smartphones in a place where most people don’t own smartphones), and the increasingly frequent trends of people manipulating data used in algorithmic research.
At a recent Asian Evaluation Week panel, CEGA Research Director Bilal Siddiqi spoke on some of these challenges, noting:
“No matter how smart it gets, Artificial Intelligence can’t protect us from ‘natural stupidity.’ Machine learning algorithms are trained on large amounts of data, and these data capture and reflect human behaviors. So, unless we’re careful, intelligent machines will learn to mirror and even amplify human behavioral and subjective biases — which means we need to be thoughtful when we build these algorithms, and careful when choosing the data on which they’re trained.”
So before we get ahead of ourselves trying to solve all the world’s problems with data science, it’s critical to build well constructed and validated approaches. Otherwise we may jump to generalize solutions too quickly and allow new algorithmic biases to seep back in.
Where to look for more information
CEGA and its partners are working to provide resources, training, and visibility around DS4D. Below we summarize a few places to turn for consolidated research and resources:
- On March 31-April 1st, CEGA and our partners at the World Bank will host MeasureDev 2021: Emerging Data and Methods in Global Health Research. Click on the link for an agenda and to register.
- The Geospatial Analysis for Development (Geo4Dev) initiative, run by CEGA in collaboration with New Light Technologies (NLT), the World Bank, and 3ie, recently launched the Geo4.dev website, which features an open source repository of geospatial data and resources from diverse sectors and contexts.
- Content from our Geo4Dev 2020 Symposium and Workshop is now available to view online.
- UC Berkeley’s Data-Intensive Development Lab (DIDL), run by CEGA co-faculty Director Josh Blumenstock at UC Berkeley, is actively applying non-traditional data and methods to migration and labor market equilibrium, impacts of urban infrastructure, poverty mapping, and methods research around machine learning and big data solutions.
- The journal Development Engineering recently released a Special Issue on Geospatial Analysis for Development. If you have a paper or research idea that involves geospatial data, please consider submitting!
- The World Bank’s Development Economics Data Group (DECDG) recently released a review of tools and resources for researchers working with data fusion methods, building on MeasureDev 2020: Data Integration and Data Fusion.
- D-Lab at UC Berkeley has a fantastic roundup of data and tools for mapping COVID-19.