What Is Beyond the “Human-Centered” Design of AI-driven Systems?

Claudia Müller-Birn
12 min readNov 12, 2019

The following text summarizes the keynote I gave at the 4th European Technology Assessment Conference “Value-driven Technologies: Methods, Limits, and Prospects for Governing Innovations” on November 5, 2019, in Bratislava (Slovakia). I was asked for my notes and my slides and decided to publish most of the content here. Happy Reading!

My name is Claudia Müller-Birn. I am the head of the research group Human-Centered Computing at the Institute of Computer Science at the Freie Universität Berlin. In my talk, I introduce the research area of Human-Centered Machine Learning. I give an overview of existing techniques and approaches and I also explore the question whether such perspective alone is enough. Why is this topic important? Our society is increasingly reliant on software systems that are driven by artificial intelligence (AI) technologies. We use these systems in more sensitive areas and the results are often not as anticipated.

Challenges in current AI-driven Systems

We all know about the examples from media (or from the book by Cathy O’Neil “Weapons of Math Destruction”). People with an Afro-American background in the US, for example, were systematically disadvantaged by software for predictive police work and legal risk assessment in court. Software for assigning students to schools in the US discriminated against children from low-income families. Research has shown that the use of facial recognition software in public places leads to systematic misjudgments. We see, on the one hand, fantastic technical progress, but, on the other hand, there has been a qualitative change in our perception of this technology and its impact on our society. I want to get to the bottom of this qualitative change and discuss what approaches exist in computer science to mitigate existing challenges when using AI-driven systems. Let me name some of these challenges.

Firstly, AI-driven systems exhibit a lack of robustness. This means that AI has problems to handle exceptions and unexpected situations. Let us take the example of the driverless car. How does a car behave in unusual traffic patterns and weather conditions and deal with strange human behavior? Marcus and Davis call this in their book “Rebooting AI” an over-attribution error. People frequently believe that AI has human-like intelligence. People frequently believe that AI has human-like intelligence. They assume that good performance in one context assures reliability in another, which is, for example, not the case for driverless cars that perform wonderfully in ordinary circumstances but badly in messy situations.

Secondly, there is a limited ability to transfer domain-specific data sets in AI-driven systems. AI-driven technologies use machine learning models that are trained on one specific domain data set. Transferring or applying these technologies to another context often fails. Machine-translation systems trained on legal documents, for example, perform poorly when applied to medical documents (an approach to tackle this issue). Voice recognition systems often have problems with accents (Mozilla is working against it — excellent project CommonVoice).

Thirdly, another challenge with the data is that they are historical. We train our machine learning (ML) models based on data from the past, thus, we entrench the history rather than reflecting existing changing realities. Simply search for the term “professor.” The majority of images show you a white male. Does this represent our current society? The same applies to medicine: Cancer diagnostic programs may be tuned toward white patients and give invalid results for people of color. A core issue is that current AI systems mimic the input data without regarding our social values or the quality or nature of the data.

Fourthly, the combination of the preexisting societal bias in the data and the so-called echo-chamber-effect, which describes a situation in which beliefs are amplified or reinforced by communication and repetition inside a closed system, can lead to the amplification of social bias.

An excellent example is the COMPASS software from the US that carries out so-called risk assessments for defendants. These risk assessments can be used to determine a deposit or penalty level. ProPublica, a non-profit organization for investigative journalism, received the risk figures through a public file request. They were able to show that the risk assessment was strongly dependent on the ethnicity of the offenders. Black defendants received a higher risk of relapse than in reality, while white defendants were often predicted to be less risky than in reality.

How can we respond to these challenges?

Researchers from computer science, social science, policy making and philosophy, to name only some of the disciplines involved, have become increasingly aware of the challenges. Platforms such as the interdisciplinary conference on Fairness, Accountability, and Transparency are flourishing.

Terms such as Responsible AI, Explainable AI and Ethical AI are being used everywhere. One example is the OECD (Organisation for Economic Co-operation and Development). In May, 2019, the OECD adopted its Principles on Artificial Intelligence — 42 countries agreed to the first international standard for the responsible stewardship of trustworthy AI.

How can these guidelines help us to make decisions on the design of AI-driven systems? I focus, therefore, on Human-Centered ML, since it already provides some concrete approaches and techniques. Human-centered ML reframes existing ML workflows based on situated human working practices. It explores how humans and systems can co-adapt. When we talk about ML, then we relate to a narrow area of the broad spectrum of AI technologies. Today, I want to talk primarily about supervised learning, which is a subfield of ML. Let us have a look at a typical ML pipeline.

Typical ML-Pipeline

You select a representative sample based on available data. These data are then separated into a training and a test set. Let us assume we decided on a specific model. We train this model and analyze the quality of the model based on the test data. If the quality and accuracy meet our expectations, we use it as a prediction model and we apply it to new, i.e., unknown data. How is this technology typically being used? Here is an example from the public doamin.

Case 1: Using ML in the Public Agency

At the beginning of this year, the Public Employment Service Austria has started to deploy a software in their offices. This ML-driven software system provides a prognosis for unemployed people and their chances of being integrated in the labor market. This prognosis calculates high chances (66 % job in the next 7 months), low chances (< 25 % job in the next 2 years) and all others with medium chances of finding employment.

Chances for employment

The motivation for its introduction is to assign existing support programs provided by the Public Employment Service more efficiently to unemployed people. Neither people with high chances on the job market nor people with low chances should get assigned to expensive support programs. When the decision to use the software in the Public Employment Service went public, it caused a lot of critique in the net community. The development of the software was carried out by a private research institute, “Synthesis Forschung GmbH,” from Vienna. As a reaction to this critique, the company released a document that contains a description of two of the 98 models used. The models are based on logistic regression. Let us have a look into the features used:

Translated features (Source: AMS Documentation)

The model is based on a reference person (young, healthy, male, Austrian, service sector). Without taking individual data into account, his chances of reintegration into the labor market are 52 %. Under this certain model, women are given a negative weight, as are disabled people and people over 50. Women with children are also negatively weighted but, remarkably, men with children are not. Based on the document publicized, people were even more concerned by the question of discrimination.

Johannes Kopf, the head of the Public Employment Service Austria, however, is less concerned with these results. He said in an interview about the software: “The algorithm does not make any decisions but calculates the integration chances.”

However, it is not clear how these integration opportunities are presented on the graphical user interface to the employees of the public agency. Research has shown that it is difficult for nontechnical experts to understand probabilities. It is not clear to what extent the job center employees may deviate from this proposed value, or to what extent consultants are able to give feedback to the system. The software is being developed by a private company. The source code is a closed source. Information on the sampling strategy is not available. The model quality cannot be objectively assessed by third parties by taking in or out specific features or by using protected features. The decision for using logistic regression is not being justified. As a result of this situation, the discussion about the software is ongoing and people are very concerned about a possible bias in the software. What is bias?

Bias in Software Systems

In the broadest sense, the term bias simply means “imbalance.” According to Nissenbaum, the term bias is used in computer science to refer to computer systems that systematically and unfairly discriminate against certain persons or groups of persons in favor of others. What sources of bias exist in our ML pipeline?

Sources of bias, inspired by (Baeza-Yates, 2019)

Data Bias: Many social media applications, such as Twitter or Facebook, do not represent society.

Another issue is the sampling bias. Many medical applications, for example, rely on data from the so-called weird population. This population relies on people from Western, educated, industrialized, rich and democratic (WEIRD) societies — who represent as much as 80 % of study participants but only 12 % of the world’s population. These people are not only unrepresentative of humans as a species, but we are, on many measures, outliers.

Algorithmic or model bias relates to the specific properties of models in relation to the data. It is not the case that every model works well with all data.

An underestimated source of bias is the interaction bias. The way in which we represent the results in the graphical user interface, how well we explain the results, influences how we, as humans, understand the probabilities.

Point of Participation in ML Pipelines

Even more important, because of the scalability of the software, it has an impact on society which again influences the underlying data. It might be that we create a system of self-fulfilling prophecies. However, human-centered ML has developed some answers to these problems, as shown in the following image.

Points of participation inspired by (Dudley & Kristensson, 2018)

I call these points of participation since humans are given the opportunity to participate in the ML workflow with the help of graphical user interfaces. I do not explain them here in detail, but I want to show their usefulness based on a current research project.

Case 2: Using ML in Open Peer Production

You might know Wikipedia. It is an open encyclopedia, in other words, everybody can participate. On a daily basis, English Wikipedia receives too many new edits to check them manually for spam. Over the last ten years, Wikipedia has built a quality assurance system. {I skipped here part of the story to shorten the post} One important component of the quality assurance system is the ML backend service ORES. Each edit can be sent to the ORES service, and in reply, you receive a probability whether the edit is spam or not. ORES is embedded in Wikipedia; its development is openly documented and technical decisions are transparent. Let us have a look at some design decisions. A complete discussion is available in Halfaker and Geiger (2019).

An early decision relates to the model selection. Shortly after deploying ORES, they received reports from the community that the prediction models were too strongly oriented against newly registered editors. The ORES team used linear SVM (Support Vector Machine) estimators to create classifiers at the beginning. Because of the feedback from the community, they changed the model type and decided in favor of ensemble strategies, more precisely gradient boosting estimators.

The ORES team also uses these closed feedback loops for the training phase. Experienced Wikipedia editors help to label the data needed to train the models. Based on the feedback, context-specific problems, for example, because of the language, can be quickly discovered by the community, and the model can be adapted by the ORES development team.

A major challenge of using an ML system is the question of determining a satisfactory model quality for a specific context of use, in other words, how to balance the costs of false positives and false negatives. I explain the idea behind it based on the confusion matrix. Let us have a look.

Weighting the costs of false positives and false negatives

You have a so-called true condition that is based on the labeled data. A predicted condition that is based on the ML pipeline. You then compare how well your predicted condition corresponds to your labeled, i.e. known condition. What you would like to achieve if that the prediction is correct, i.e. the true positive and the true negative are high. However, what about the false negative and false positive cases. This is a trade-off. In Wikipedia, for example, a high false negative means that you might reject a higher number of valuable edits, since these might come from newcomers. On the other hand, a high number of false negatives might lower the quality of your article. Deciding on one or the other depends on the context of use.

In some cases, you want to reduce the number of false rejects because you use a bot. In other cases, you are okay if the system has some undetected spam, because humans will check the edits manually anyway. It is a trade-off and it depends on the situation. We introduce bias into the system on purpose. But we know about the bias and we can control it.

In my group’s research, we introduced a first prototype for a graphical user interface that allows people to make this decision (Kinkeldey et al., 2019).

Just to sum up what we have learned so far over the course of the ORES development. The values embedded in the Wikipedia community are deeply reflected by the ORES development. The source code, all the models and the training data are openly available. The whole development process is transparent, and the developers are open to the community’s feedback to understand their specific needs but adapt the software if needed.

On the contrary, the Public Employment Service Austria has developed a software that, according to the head of the agency, aims to increase the efficiency of the support provided. Efficiency is important for governmental agencies indeed, however, it is an economical interest. Shouldn’t we also include societal values, such as equality of opportunity, public interest (Gemeinwohl), individual rights (e.g. privacy), etc. in our technological decisions?

What is needed?

The application of AI in societal contexts should be based on value-sensitive design, thus, direct and indirect stakeholders should be included in the design of the system and their values investigated to reveal tensions. It is important to note that values in AI-driven systems exist on a continuum — at design time, configuration time and run time (cf. Mulligan and Bamberger 2018). We have many concepts to ensure that AI-driven systems follow our values, such that Fairness, Accountability and Explainability are important. However, as Lorena Jaume-Palasí wrote in a recent blog post:

“Transparency is not a principle; it is a means of monitoring compliance with a particular principle or procedure. The social effects of algorithmic systems cannot be controlled by transparency alone.”

This also applies to many other terms in this context (see OECD). However, what idea unites all these quotes?

“The design of technologies is doing ethics by other means.”

And this is what Verbeek defines in his mediation theory. Mediation of human-world relations has two dimensions. Technology, on the one hand, shapes humans actions and practices in the world. In Austria, for example, unemployed people will no longer receive any expensive support from the agency. Technology changes our actions and how we organize our life. On the other hand, technology helps us to understand how we perceive the world. Again taking the Austrian case into account. For the consultants in the Public Employment Service, unemployed people are now only belonging to three groups. What extent a decision might have on an individual or societal impact is not considered.

Technology is involved in almost every dimension of society. This profound impact of technology on our society makes all direct and indirect stakeholders, designers, engineers and policy makers responsible for shaping the impact. As Verbeek denotes:

“Technical mediation does not only helps to understand the moral dimensions of technology but also helps to identify the ethical questions that need to be ask when — before — designing technologies.”

And with this perspective, I conclude my talk and I am now open for your questions/comments/remarks. :)



Claudia Müller-Birn

Researcher in the area of Human-Centered Computing, Special focus on Human-Machine Collaboration, Advocates Open Knowledge in Science and beyond