AI in the Public Sector: How Do We Know When it is Safe to Use an Algorithm? (Part 1)

Data & Policy Blog
Data & Policy Blog
Published in
5 min readMay 4, 2021
Photo by Yassine Khalfalli on Unsplash

Catherine Inness is Data Science Senior Manager & Tech Delivery Lead at Accenture. She also has an M.Sc. in Machine Learning from University College London, and has written in various topics connected to ethical and responsible Artificial Intelligence (AI). This is the first part in her discussion of AI in a public sector context, which uses the summer of 2020 British A-Level controversy to expand the conversation around algorithmic outcomes and models. Part 2 will discuss some tools for fairness in machine learning.

Following continued disruption to education from COVID-19, Ofqual (the regulator for qualifications, examinations and assessments in England) has confirmed that grades for school-leaver exams, known as A-Levels, will be determined by their teachers in 2021. This position is likely informed by last year’s experience: in August 2020 the UK government revoked the results of a statistical model that determined school-leavers’ expected grades following claims that the model was harmful to social mobility [1].

The controversy surrounding this grade prediction algorithm has raised the questions: Could a predictive model for this task could ever have been deemed fair? And, what does this mean for the future application of predictive models in the public sector?

Photo by Yassine Khalfalli on Unsplash

There are two clear learnings:

1. The model will never be 100% accurate, so can we afford to make mistakes?

Just because we can use an algorithm (and have an urgent problem to solve), doesn’t mean we should. But how do we know when we should?

The Ofqual interim report [2] makes clear that significant thought went into the design of last year’s statistical model. For each academic subject, teachers were asked for a predicted grade for each student and a rank order of students for each grade. Ofqual then implemented standardisation methods, with the aim to “ensure fairness to students within the 2020 cohort” [2]. The final model predicted the distribution of grades for each institution based on past performance of that institution, and gave weight both to this predicted distribution and to the set of teacher-predicted grades.

The problem is that statistical models must make generalisations, and where there are generalisations there will be errors at the individual level. To understand whether a task is suitable for an algorithm, decision-makers must actively agree that errors at the individual level are acceptable. This is a complex topic and must be considered in comparison with error volumes that persist in the ‘offline’ alternative.

Photo by Dora Dalberto on Unsplash

We must take into account the type of error too: teachers have been found to be more likely to over-estimate their students’ ability than they are to underestimate it [3]. Teacher predicted grades therefore could leave universities overcrowded, but arguably have a more positive impact on students’ lives than an algorithm that underestimates performance does.

The distribution of errors across different subgroups must be reviewed to make sure the algorithm makes the same types of mistakes for people with different characteristics. For example in this case, are there just as many state school students underestimated as there are independent school students?

Photo by Pattern on Unsplash

2. Outcomes from algorithms must be transparent and there must be a feedback loop

On ‘results day’ 2020, teachers and schools spotted patterns in the predictions of the algorithm that led them to believe that it was not fair to all students and was detrimental to social mobility. The media was already switched on to reporting students’ results, and picked up the stories of suspected algorithm bias. There was sufficient loss of confidence in the fairness of the model for the government to take action and revoke the predictions of the algorithm.

This clearly isn’t the feedback loop that Ofqual would have chosen, but it was a course-correction.

For low-stakes applications of machine learning, for example paid online advertising, A/B testing can be implemented to build in a feedback loop. It is far harder for higher-stakes decisions, which can impact lives from the moment they are communicated, and often have no mechanism for feedback. We’ll never know whether an individual denied a loan by an algorithm was actually credit-worthy, for example.

Model designers must build in a mechanism to identify when a model is making unfair predictions, and enable corrective action to be taken. This means determining how we can build in an opportunity to analyse real outcomes at the group level, as was possible almost by accident at the school level for A-Level predictions.

End of Part 1.

Photo by Wim van ‘t Einde on Unsplash

[1] BBC, “A-levels: Anger over ‘unfair’ results this year,” 2020. [Online]. Available: https://www.bbc.co.uk/news/education-53759832.

[2] Ofqual, “Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report,” 2020. [Online]. Available: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/909368/6656-1_Awarding_GCSE__AS__A_level__advanced_extension_awards_and_extended_project_qualifications_in_summer_2020_-_interim_report.pdf.

[3] R. Murphy and G. Wyness, “Minority report: the impact of predicted grades on university admissions of disadvantaged groups,” Education Economics, pp. 333–350, 2020.

This is the blog for Data & Policy, the partner journal for the Data for Policy conference. You can also find us on Twitter. Here’s instructions for submitting an article to the journal.

--

--

Data & Policy Blog
Data & Policy Blog

Blog for Data & Policy, an open access journal at CUP (cambridge.org/dap). Eds: Zeynep Engin (Turing), Jon Crowcroft (Cambridge) and Stefaan Verhulst (GovLab)