Q&A: Divy Thakkar on Valuing Data in AI for Social Good

People + AI Research @ Google
People + AI Research
7 min readMay 17, 2022

--

Illustration for Google by Aleesha Nandhra

Divy Thakkar is Google’s Program Manager Lead in University Relations and AI for Social Good, based in India. Divy is an human-centered interaction (HCI) researcher with interests in examining Human-AI interactions for underserved communities in the Global South. His work has enabled advancements in the use of AI to support underserved communities in areas of public health, conservation, future of work, and Responsible AI. This Q&A was collaboratively edited with David Weinberger based on a paper to be presented at CHI 2022.

Q: When many of us think about data, we think about sterile computing environments churning through neat rows of numbers. But data doesn’t always start that way. What does data look like before it gets handed to data scientists and machine learning (ML) developers?

Divy: There’s been some recent work about understanding the biases and origins of data, most of it centered around the Global North, a context that tends to be distant from the realities of the Global South. Our group’s interests came from looking at the growing body of work in community public health beyond the Global North. Data workers — including ML developers, data stewards in governments, community health workers, and so forth often encounter data that they believe is low quality and isn’t usable for ML, yet frequently is used for downstream ML applications.

Q: In 2020 you did a study of this, centered in India.

Divy: Yes, this included researchers from Google, Georgia Tech, and Indian Institute of Technology Madras. We interviewed healthcare workers, data stewards, and ML researchers, in 16 villages and 5 states; each state in India defines its own public health norms, so having that range was helpful.

Q: What did you learn about why the quality of the data is often low?

Divy: Data goes through a supply chain, with humans involved in each stage of the process: collecting it, looking through it, editing and transforming it (including data cleaning and data labeling), and passing it on. It’s very important to understand that the data we are studying is data that changes multiple hands in the data supply chain and there are various interdependencies within the data, frequently unknown to other stakeholders in the chain. Some people attribute the low quality to lazy work done by the people collecting it. But we found this could be better understood through a deeper understanding of the disconnects in valuing across multiple stakeholders (data workers). Valuing here comes from valuation studies and refers to a set of practices and contexts that shape the worth of “good” data at each stage. This provides us with the framework to examine the shared relevance of valuing an entity, where the value of a ‘good’ entity differs and is likely to be in tension in different situations for different stakeholders. Even though each stakeholder might have different valuing practices, their collective action makes the entity being valued — (in this case data) “good”.

Our research found that each person in the data supply chain had a limited view of how the data was used in the next stage. Additionally, each stakeholder had a different valuation associated with the data. We noticed that there were frequent tensions between the stages that directly impacted data quality.

Q: For example?

Divy: In one case, a ML researcher at the end of the supply chain was concerned about possible undersampling of the data and attributed it to missing data. To evaluate that concern, you have to look at the entirety of the supply chain. We found that in this case, a healthcare worker — typically underpaid and at the expense of personal safety — had to travel long distances to communities where they might not have the right level of trust, and then had to deal with complex social factors that prohibit access. For example, the healthcare workers were often unable to conduct interviews with people from certain marginalized communities due to India’s cultural and social structures. So while the ML researcher initially believed that the issue in data quality was in missing data, we actually found that the issue was more complex, as undersampling was merely the result of complex underlying societal factors that impact the safety and efficacy of the data collection process

Q: What type of values?

Divy: Our work unpacks the values that shape the data due to each stakeholder’s organizational, social, and cultural ecologies. Let’s take an example — organizational factors frequently influence the dataset. For example, think of a person who acts as a data steward, pulling the data together and organizing it. They have huge piles of data they need to digitize. Their organization may have limited resources to automate those processes, especially outside of the Global North. That means more manual work that is often quite tedious, and not enough people to enable the work to be done well. With strained resources, an organization may not provide effective incentives for the data stewards to work faster or better. They may not even be able to provide training to manage data quality. That puts the stewards in a difficult situation.

Q: And that could affect the quality of the data they pass along to the next step in the supply chain.

Divy: Yes. In our research we’ve found that healthcare workers gathering data need to build a foundation of trust with the people they’re approaching. This can mean being part of a community of trust. It’s especially important that healthcare workers are embedded in a community of trust because of the sensitive data they’re frequently gathering. This trust affects whether healthcare workers are able to gain full consent for the data collection, which often involves a type of negotiation. That’s led our research team to argue for data literacy so that the people being surveyed and health workers can understand the implications of their agreeing or refusing to provide information. And most importantly, it means everyone in the data supply chain should understand that their primary responsibility is to ensure that the data collected from these communities is not used in harmful ways.

Q: Do the data supply chains support this responsibility?

Divy: Our findings support practices towards greater accountability in the data supply chain by encouraging ML developers to understand the ways in which data is valued throughout the supply chain and the ways in which those values are different from theirs. Our work hence urges stakeholders to recognise their responsibility to each other in the data supply chain through practices that create more transparency about the use of data, feedback for collectors and the responsibility to their communities. We also encourage ML developers to think responsibly and critically about their data and the use cases of data in high-stakes domains.

Q: At the collection stage of the supply chain, do the incentives of the people collecting the data align with the ultimate value of the data?

Divy: Not always. This is another way organizations can shape data. For the community health workers, data means performativity: they have to show their data is complete, but its quality can’t be judged until later in the chain. That is an incentive for completeness that can compete with the need for accuracy. In fact, we found in our study that ML developers assumed that a series of perfect reporting meant that some of it was fudged (though we lack the data to verify whether or not this assumption was accurate).

Q: At what point in the data supply chain can these sorts of inaccuracies be noted?

Divy: Instead of inaccuracies, I would reframe them as tensions arising because of differences in the valuation of data in terms of the practices to make data ‘good’. Those valuations may not be applying uniform criteria across the entire supply chain. Looking at data through the lens of its valuation allows us to move beyond arbitrary notions of data quality to focus on what are desirable and achievable data quality goals in a given context based on existing practices.

_______

Q: What can we do to improve the data supply chains?

Divy: Our findings build on prior work that highlights the importance of data documentation. We find that improving transparency about the data used throughout the supply chain can have an impact on data work, especially for health workers and data stewards. This can take shape through training, increased data literacies, shared data taxonomies, and shared visualizations of the impact of data throughout the supply chain.

Q: Given the multiple links in a data supply chain, where does the accountability fall?

Divy: Though all stakeholders were engaged in the development of these ML datasets, we suggest that accountability of ML outcomes be shared between ML developers and the organizations providing data, given the investment of time and resources from these two groups towards the outcomes. Our findings uncovered that bias was frequently linked to poor workflow organization, excessive bureaucratic data collection requirements, and challenges that data collectors face in reaching all communities which necessitates better accountability for organizations towards examining their workflows to make them more equitable to communities and workers. There is also a need for accountability by the organization and the public health officers that data stewards and collectors reported to by providing regular feedback and greater transparency around the use of data. Organizations and governments should be accountable for providing them with more resources to complete their tasks, and training for statistical operations so that they feel equipped to do their work. Along these lines, data collectors should be able to hold superiors accountable for giving feedback.

Q: Accountability to whom?

Divy: Accountability for bias should compel institutions to reconsider whether their workflows are equitable to communities and workers, while improving data quality as well. It is important to evaluate your data from the lens of valuation and ask yourself how individuals were involved in the data collection, how the data was valued at different stages, what tensions exist, and to co-design processes with data collectors and stewards.

It turns out that data is a fully human product that passes through a supply chain supported by humans. In the end, the accountability is owed first and foremost to the human communities who are both the source of the data and the intended beneficiaries.

--

--

People + AI Research @ Google
People + AI Research

People + AI Research (PAIR) is a multidisciplinary team at Google that explores the human side of AI.