What Can a Data Science Team Learn From Social Science? (and vice versa!)

Alex Mankoo
Wellcome Data
Published in
11 min readFeb 18, 2020

Over the last year, I’ve been working with Wellcome’s Data Labs team, who seek to integrate on-going ethical and social review as core elements in their data science product development work. By training, I’m a sociologist of science and technology, and I was brought into the team to provide a social science perspective on how they go about their data science activities.

The idea behind my involvement has been for me to work with the team’s embedded user researcher, Aoife, who has also written a number of blogs about the project. As I’m not embedded in the team on a daily basis, this means that I can provide broader, less invested feedback and critique to Aoife on the team’s activities and decisions.

The Social and Technical Make-up of Data Labs

One of the interesting aspects to the project is the wide range of expertise present in the team. Some team members are data scientists, who are very proficient in the technical skills involved in writing and designing the code for machine learning algorithms. Others are data engineers and software engineers, who build the data infrastructure and maintain the servers and the databases that allow for the algorithm to ‘exist’ in the first place. Others still, like Aoife, have a different kind of expertise. They’re skilled in understanding the way that users — people — experience and work with these technical tools.

Then you have the product managers, who set and implement the vision for the product and coordinate the multidisciplinary team to build it, and in that respect they are interested in the way the different team members interact to achieve the goals they have set. Finally, in the case of Data Labs, you can add myself — an external social scientist whose role is to help identify and flag where, when and how, certain assumptions get taken for granted throughout the process as a whole, lest they unwittingly end up being baked into the technology and its development process.

The range of expertise present within the team is important to recognise because it highlights what academics might call the ‘sociotechnical’ nature of the innovation process. Put simply, developing new technologies doesn’t just involve doing ‘technical’ work and then making ‘social’ decisions about how to use or govern the outputs of that work — on the contrary, ‘social’ issues and ‘technical’ issues are, more often than not, intertwined. Making judgements in designing the technical specifications of a technology involves making decisions about its context of use, whose problems it is addressing, and the potential consequences its introduction may have on the lives of those who interact with it, be it directly or indirectly. This last aspect can be particularly challenging, because different contexts will often shape technologies and knowledge about them in very different ways, leading to potentially unintended or unforeseen consequences.

“Developing new technologies doesn’t just involve doing ‘technical’ work and then making ‘social’ decisions about how to use or govern the outputs of that work — on the contrary, ‘social’ issues and ‘technical’ issues are, more often than not, intertwined.”

Consequently, researchers in my field of Science & Technology Studies (STS) like to think about social values and technical knowledge as linked — as ‘sociotechnical’ — rather than draw boundaries between realms of ‘technical’ and ‘social’ considerations and work. It may seem unusual to conceive of data science teams, or technical experts, as making what seem equivalent to governance decisions or even public policy, but thinking in terms of ‘sociotechnical’ systems suggests that this happens regularly. What concerns many STS researchers, however, is that it often happens without recognition — leading to ‘technocratic’ forms of governance that don’t truly provide opportunity for democratic participation.

Three Ways Developing an Algorithm is ‘Sociotechnical’

If this all sounds very abstract, throughout this piece I’ll use a couple of examples to illustrate what I mean. One is from the work of the Data Labs team, and the other is an example that’s been really popular within the artificial intelligence (AI) and data analytics communities over the last few years. I’ll look at this latter one first. In the last decade, the US criminal justice system has in some states begun using certain risk assessment scores to assist in decisions relating to setting bail bonds and sentencing. These systems aim to predict recidivism — in other words, they score the likelihood that a criminal defendant will reoffend. A higher score means that a defendant is assumed to be more likely to reoffend. The use of these systems led to uproar when an investigation by ProPublica found that such risk assessment software disproportionately gave black defendants higher risk scores than white defendants.

The case has been a hotbed of discussion for a few years now, but for our purposes I want to highlight a variety of points at which we could scrutinise the sociotechnical choices that were at play.

(1) First, the data sets.

The risk scoring software (COMPAS) examined by ProPublica derived its score from defendant survey data that included questions about family history, residence details, educational qualifications, social circles, and employment, amongst other things. Many of these fields (especially in combination) function as ‘data proxies’ for socioeconomic status and ‘sensitive’ categories of data such as race. Compounding this issue is the fact that the ‘training data’ set for the COMPAS model, so to speak, would have been data from the US criminal justice system, which has historically jailed an overwhelmingly disproportionate number of black citizens.

Designing software like COMPAS therefore involves selecting what data is relevant for determining risk score — this is a social and technical judgement. Which categories are suitable predictors for recidivism? What, if anything, in the data set should be used, despite its inherent social bias? How should data on defendants be collected? How can the reliability of data be dealt with (e.g. what if residential history is incorrect)? How should omissions or categories without data be processed? These are not decisions that technical expertise can provide simple answers or fixes to. They are not decisions that data scientists and developers can make by themselves, nor should they be expected to.

“These are not decisions that technical expertise can provide simple answers or fixes to. They are not decisions that data scientists and developers can make by themselves, nor should they be expected to.”

Some of these data set issues might be addressed by introducing more diverse specialised roles in a product development team, but others may be shaped by external factors (e.g. if training data sets are purchased from a third party, or if there are regulations in place regarding data collection). Choices may also depend on time and resources. What data can actually be gathered, how easily, how quickly, at what cost? Nevertheless, they are still social and technical judgements — choices about design that were made for one reason or another.

(2) The data processing.

If these are some of the ways sociotechnical judgements shape ‘inputs’ into the data system, what about the ‘process’ of developing how an algorithm works? They come into play there too. For instance, data scientists may choose to ‘impute’ or substitute missing data with estimated values, or decide to map word values to numerical values (‘word embeddings’) in ways that have been shown to contain bias. Moreover, data scientists have to make decisions about what kind of evaluation metric is appropriate in a given context. In fact, these decisions are two-part: the data scientists must decide what evaluation metric is suitable for an algorithm and then decide how to optimise for this particular metric.

Let me use an example we’re currently working with to explain evaluation metrics. Data Labs is developing an algorithmic tool called Wellcome Reach. Put briefly, Reach is designed to scan and identify research citations in policy reports (for example, reports by the World Health Organisation) and match these to research papers produced by researchers (in a given data set). What evaluation metric is suitable for such an algorithm? Presumably, an effective metric would be one that maximises the rate that the algorithm matches a citation in a policy document to the correct author (true positive rate). Likewise, it will be important to minimise attributions to incorrect authors (false positive rate) and minimise correct authors not being identified at all (false negative rate).

This introduces a couple of issues:

  1. It may not be possible to maximise all of these. For example, it may be possible that minimising the false negative rate leads to an increase in the false positive rate. Therefore, decisions need to be made about which of the scenarios is most important given the context of application of the algorithm in society.
  2. Simply optimising for the overall true positive rate may introduce different accuracy rates across different groups in the dataset. For example, what if getting the highest true positive rate means that citations from UK research institutions are recognised more often than those from non-UK institutions?

In the case of the second issue here, one way to deal with this is to optimise for parity across prominent groups. So in this example, seeking parity would mean making sure that the true positive rate is the same for research produced in UK institutions and non-UK institutions. This might lead to a drop in the overall true positive rate, but would mean that the algorithm performed more equitably across groups. In fact, if for any reason you wanted to rectify a perceived ‘unfairness’ in a dataset, you could optimise the algorithm to be slightly more accurate for one underrepresented group (however this would again be at a potential cost to other groups).

“seeking parity would mean making sure that the true positive rate is the same for research produced in UK institutions and non-UK institutions. This might lead to a drop in the overall true positive rate, but would mean that the algorithm performed more equitably across groups.”

Hopefully from this example you can see how different contexts of application might affect the choice of an evaluation metric (in some cases false positives might be very important to avoid). Furthermore, it also becomes clear that utilising such a parity metric also requires a decision about which different groups should be checked for parity. Crucially, this is not simply a technical judgement but a deeply social one. One may deem that optimising for parity across gender (i.e. to ensure Reach acknowledges research produced by different genders equally) is more important to optimising for parity across month of publication, for instance. One is a significant social category, the other is not. But what about cases where trade offs need to be made — say, if parity across gender cannot be achieved simultaneously with parity across country of publication? These are more challenging scenarios, and decisions necessarily require broader participation than simply that of the technical developers.

Moreover, these decisions are to do with ‘sensitive categories’ of data. To optimise parity across sensitive categories like gender or race, it necessitates intentionally identifying and processing these sensitive categories. Aside from the logistical complexity and compliance work this introduces with respect to important data laws like GDPR, these decisions involve intentional choices about who matters in a data set, and to what extent. The latter half of this recent article gives readers an interactive tool to think through this sort of dilemma with respect to the black and white defendant categories in COMPAS.

Even after these steps, and assuming a decision can be made, in the case of an algorithm like COMPAS another decision has to be made about thresholds, i.e. where should the risk score threshold between the recommendation of release and jail lie? That threshold does not always correspond to maximising accuracy. Maximising overall accuracy would mean jailing a large number of medium risk defendants. But a whole host of economic, social, legal issues (e.g. the strain imposed on the resources of an already stretched prison system) understandably dictate that thresholds should necessarily be set higher despite a lower accuracy rate.

This brings me to the third and final aspect of the sociotechnical I want to discuss here.

(3) The outputs and their impacts.

I mentioned the importance of context of application above. Responsible development of an algorithmic technology demands attention to the values, practices, groups and pressures that exist within that technology’s context of application — both at a local scale and potentially broader scales. Technologies like Reach and COMPAS can have significant impacts on existing social structures, practices, identities and understandings of the world. They can generate new topics for attention, new methods, types of problems, forms of expertise, and so on. For example:

  • An organisation wishing to make use of an algorithm will need to ensure users are trained in how to interact with the algorithm properly and interpret its results. This might result in new working groups being formed, new training procedures, or new policies on how to act upon outputs.
  • Over time, use of an algorithm like Reach might shape how funding organisations choose to assess their research outputs and allocate their resources. Subsequently, it could have an impact on the relationship between researchers and their funders, as well as the ways in which researchers choose to publish and manage their research outputs.
  • An algorithm like Reach could provide new ways of understanding the world, through insights into large data sets that might not have existed in prior methods of data analysis. These insights have the potential to identify new problems and opportunities for research and technological development.

Unlike some of the problems discussed earlier, these examples may seem to be issues that demand social judgement rather than technical judgement. Again, this is not necessarily so. They are still sociotechnical by nature — knowingly or not, technical decisions in the design process involve imagined contexts of application and user behaviour, as well as context specific goals. If dialogue and awareness regarding these aspects is not built in from the early stages of development up, then we risk falling into what Langdon Winner, a famous academic from the field of STS, has called ‘technological somnambulism’. In other words, we risk passively sleepwalking into futures shaped by technological developments that we could and should have considered, democratically deliberated, and taken an active role in shaping.

“we risk passively sleepwalking into futures shaped by technological developments that we could and should have considered, democratically deliberated, and taken an active role in shaping.”

In the early stages of my work in Data Labs, I tried to map out some of the ways in which Reach could shape futures. Here were some of my thoughts:

  • If Reach comes to shape impact metrics for research — could it provide visibility to certain disciplines over others? Could this affect funding decisions/allocations?
  • Changes in practices and behaviours of researchers — how, when, where, why they publish? Changes to their attitudes — do they favour/oppose the use of Reach to analyse their work?
  • Changes in practices and behaviours of policy organisations?
  • Will funders need to start employing people trained in managing and analysing data? Will Reach integrate within, or reshape, existing research assessment frameworks?
  • Data analysis could identify commonalities between fields of study, or identify priority areas for policy impacting research/areas where policy impact is already significant.
  • From a very wide perspective, through its ability to shape funding and research culture, Reach may also shape publics (e.g. certain patients, communities).

Clearly, none of these issues can or should be tackled by Data Labs alone (nor do I think anyone would want to take on such a responsibility!). They demand dialogue, participation and awareness from a wider range of stakeholders from the research landscape. They are the sorts of questions that centre upon whether innovation or development should even be pursued in the first place, and if so, how, when, why, and for whom. They are big questions and quite possibly present the most complex challenge I’ve highlighted in my discussion of three sociotechnical aspects here, because they require an understanding of both local contexts, larger contexts, and the interactions between them.

“These are the sorts of questions that centre upon whether innovation or development should even be pursued in the first place, and if so, how, when, why, and for whom.”

To finish, I’d like to emphasise that the approach Data Labs is taking is experimental and reflexive. Data Labs is developing Reach not only for the product, but also to reflect upon how they as a data science product team grapple with the sociotechnical issues discussed here, the tradeoffs that they make, and how decisions might be made with an awareness of the significance of both social and technical aspects. This is, I think, a worthy starting point (not an end by any means) to better integrate social and ethical review into data science work, and I am looking forward to both seeing and playing a part in what it produces.

Thanks for reading!

--

--