Giving Power to Machines

Published in

Safer I

12 min readOct 10, 2022

Two and a half years back when I started my journey as an AI Engineer, I remember intaking a lot of news — primarily because it was the peak of the COVID-19 Pandemic. Amongst all the news about COVID casualties and its supposed cure, I recall coming across this particular blog — “160k+ high school students will only graduate if a statistical model allows them to”. The title looks like clickbait, doesn’t it? And so I instantly clicked it, because why not. Initial thoughts — it must be a futuristic model that isn’t operational yet. And never had I ever been so wrong.

A statistical model assigning you academic scores

Apparently, when the COVID-19 pandemic disrupted education systems and daily lives, the International Diploma Baccalaureate (IB) program taken by more than 170,000 students a year around the globe, had to cancel its spring exams due to COVID. But how did they grade their students you ask? Definitely not by a straightforward algorithm, rather a “statistical model” — they said. As per the IB, the statistical model took into account the following two or more metrics to assign final grades to students:
1. Coursework grades: The scores you received in previous graded assignments and tests.
2. Predicted Grades from teachers: What your teachers think you would have received if you appeared for the examinations in normal conditions.
3. Miscellaneous other data (if available): Including a school’s historical records on given subject grades

All looks well and fine, or does it? The blog covered a lot of comparative and visual analyses of the verdicts given by the model, but a few of these mentions caught my eye. I will try my best to list out what didn’t feel right to me. Of course, a more detailed analysis of the loopholes can be found in the blog itself (that I have linked below), but here are a few things:
- The first metric looks fair enough. In fact, I have known certain cases where institutions just take the first metric as a representative grade in case the final exams get cancelled.
- The second looks sketchy to me. I mean, there’s the obvious human bias that one cannot ignore here. On top of that, what if the teacher who taught me changed schools midway, and the new teacher had no idea whatsoever about my performance? I’m not sure how much this metric would weigh in my final grades but the whole idea of someone else predicting my performance — no thank you.
- The last one, and the most controversial one, is miscellaneous data if available. Simply, they would use historical data of the school if available (grades from past grads at their school), to predict what the students would have scored, had the pandemic not occurred. And if this data isn’t available for a school, in place of the historic data, they will use a formula based on the pooled information from every school taking that subject, based on coursework marks and predicted grades. [1]

A lot of things to consider here. What if there weren’t many female students in the past getting good grades in STEM subjects? Does this mean I would blindly be assigned a similar grade too? Although one might argue saying that hiding personal info like gender, race, etc. from the model will prevent it from generalizing based on these factors — simply termed “Fairness through unawareness”. [2]
But have you noticed how a salesperson first guesses your gender or race through a call (without you explicitly mentioning it) and then tailors his sales demo? The statistical model here (trying to closely mimic human behaviour), is capable of doing just the same — predict certain characteristics (without any explicit mention) and use it to predict final grades. As a result, it was seen that these models could discriminate against students based on gender, race and socioeconomic status. [2]

The results? Well, there was a risk of poorer students getting unfair results because they got judged based on the track records of their school rather than their individual performances. Moreover, after the results actually got published, 25,000 students out of the 170,000 students who appeared for the May 2020 exams signed a petition “Justice for May 2020 IB Graduates — Build a Better Future! #IBSCANDAL”, demanding the IB to “take a different approach with their grading algorithm and to make it fairer”. [3]

The IB didn’t take back the decisions, unfortunately. Many graduates had complaints of being rejected at their dream colleges and courses because the final grades didn’t satisfy the acceptance criteria. The statistical model definitely looked promising at first glance, but biases — both human and machine levels lurked in at different stages. All of this affects the big question:

“Will a student get to study what they choose in the future?”

figure: A woman looking through a bulb-shaped window

Is Nepal ready?

AI and statistical models today are used in so many more fields than just the cool stuff that we see surfacing the internet — policy level decisions, risk prediction systems, fraud detection, recommendation systems and so much more. But in the context of Nepal, it is still a far catch. The AI industry in Nepal is taking its baby steps with applications mostly in Natural Language Processing (NLP) domain and Computer Vision (CV) domain. Moreover, the use cases usually circle around areas like chatbots, object detection, simple recommendation systems or even sales prediction.

As for deploying AI systems at policy levels for decision-making, I believe that there are three major roadblocks:
1. the organizational data isn’t well documented. If anything, the data primarily exists in chunks of paper files stored in a dusty cupboard that isn’t opened for years. While organizations are migrating paper-based information to digital databases, this would take a lot of time for it to be ready for data-driven decision-making systems.

2. no matter how much the AI community talks about shifting from a black-box system to a more transparent and interpretable system, AI Interpretability comes at a cost. And it is no surprise that developers would target systems with less error in the given deadline and then think about whether the system makes sense in the first place for the end-users to understand.
3. people in decision-making positions of the organization aren’t all well-versed in the internal workings of AI systems and how machines reach a certain decision. This makes it even more difficult for these people to question the fairness of such systems in the long run, let alone come up with effective solutions.

So all-in-all, there aren’t many AI systems to look out for, operational at public institutions in Nepal in terms of fair use and ethics. However, this area really caught my attention. If AI was to take over the world, are we ready for a world where these machines take decisions for us? Or are we already living in such a world? And if we are, can we fully trust the machines with their decisions? *sips tea and closes laptop*

An algorithm deciding your right to receive Childcare Benefits

A few weeks back as I was working on a data analysis project for a client, I came across this online workshop organized by ECNL on automated decision-making by public institutions. As someone who was interested in knowing about responsible and ethical AI but never had so much in-person exposure to it, I decided to sign up for it as the presentation lineup looked quite insightful.

Fast forward to when the workshop took place, it gave me a major reality check. Firstly, there is a world outside Nepal. Secondly, people have blind faith in AI. I remember two specific presentations — the one where the presenter talked about the Childcare Benefit Scandal in the Netherlands and the other one that showcased a system dealing with the rising cases of gender-based violence in Spain —VioGén.

The Netherlands case was a classic example of bias in AI that we study in theory. Childcare benefits in the Netherlands, in simple terms, is a scheme where a part of the childcare costs are covered by the state and is available to families in which all parents are either employed or enrolled in secondary or tertiary education or a civic integration course [4]. Between 2013–2019, around 26,000 parents in the Netherlands were left in debt after false accusations were made by this black-box risk classification model in regards to the child care benefits that the Dutch government provides to the people. This black-box algorithmic decision-making system included self-learning elements to create risk profiles of childcare benefits applicants who were supposedly more likely to submit inaccurate applications and renewals and potentially commit fraud.

picture credits: https://www.amnesty.org/en/documents/eur35/4686/2021/en/

Now one of the reasons this system was in place, was to possibly prevent any fraudulent claims by citizens, especially after Bulgarian migrants and childminding agencies were seen taking advantage of the social welfare system back in 2013. Since the black box system was continuously learning from the past fraud records and its own results, it started making accusations which had a distinctively common feature — parents who had an immigration background received higher risk scores. This is quite an expected behaviour, isn’t it? When you feed a model with data consisting of dual citizenship information, that declares that many Bulgarian residents have previously committed fraud, it is pretty obvious that the model would learn this pattern and use it to assign higher risk scores to anyone who has Bulgarian dual citizenship.

While the Dutch government publicly disapproves of racial profiling, it continues to allow the use of ethnicity and other prohibited grounds of discrimination risk factors as a basis for suspicion and for decision-making in law enforcement. [5] Although the authorities claim to have deleted the database containing dual-citizenship information, the process still hasn’t been given the level of transparency demanded by the people.
This process has some other issues as well. It is not only a question of false accusations but the use of personal data of people without their consent in the first place. Even worse is how this data became a source of discriminatory decision-making. Because of multiple violations of the General Data Protection Regulation, the Dutch government has to pay a fine of almost 2.75 million Euros.

There goes a case of

“What could go wrong? Everything.”

An algorithm deciding your right to receive protection against Gender-based Violence

Another interesting case was from Spain — one of the countries where gender-based violence is a key problem. To tackle this burning issue, the authorities decided to take matters into an algorithm’s hands. VioGén is a system operational in Spain since 2007 with the objectives of bringing together all public institutions that have competence in the area, making risk prediction, monitoring and protecting victims of gender violence.[6] Here’s how it works: it asks you a set of 35 yes/no questions detailing your case of gender-based violence and assigns you one of the 5 risk levels: Unappreciated, Low, Medium, High, Extreme. These questions include things like:
- Physical violence
- Sexual Violence
- Use of weapons
- Death threats
- Suicidal thoughts
- Harm to the child
- Addiction
- Mental disorder
- Aggression toward pets
- signs of jealousy
and so on.

The questions look pretty straightforward — at least for now. Things start going downhill when humans are involved in the process. But this should be good, shouldn’t it? Well, not really if they just assign weights to questions manually and leave the rest to the model. For example, more weightage to physical violence and harm to kids compared to signs of jealousy and mental disorders. The weighted sum that a victim receives via the system is used to determine the frequency of police intervention and protection they receive as it assesses their likelihood of encountering future aggression by the same perpetrator [6]. For instance, someone with an unappreciated risk score will be denied further police protection while a woman falling in the extreme score band would receive frequent interventions and constant protection.
And apart from the manual weights (which could be biased depending on the person assigning them), they had a few other assumptions:
- The assigned weights will not have any exceptions depending on the case which means that regardless of the severity, the weight for each question in determining the final risk score won’t change.
- The victims understand the questions and are informed about the scoring system.
- The victims are emotionally in a stable mental state to answer the questions accurately.

After being operational for more than 13 years, more than 3 million cases have been evaluated by the system. To know whether the system was producing practical results, it was externally audited by Eticas Foundation in 2021. And the end results were pretty shocking-
- firstly, only 35% of the interviewed women were informed about their VioGén risk score.
- over 80% of the women interviewed reported different problems with the VioGén questionnaire, stating that questions were generic and in some cases ambiguous.
- 48% of the women we interviewed negatively evaluated their experience with the system; 32% of them highlighted both negative and positive aspects, and only 19% of them positively evaluated their overall experience with the VioGén system.
- only 3% of the women who are victims of gender violence receive a risk score of “medium” or above and, therefore, effective police protection.
- Cases of psychological violence and women without kids were under-valued by the algorithm because of the manual weights given to these aspects.
[6]

All these loopholes aside, there is one other thing that went completely wrong. The system was designed as a recommendation system to assist the police in their judgement. In fact, they even had full authority to manually increase the risk scores when needed. But, this was rarely the case. 95% of police officers and authorities reported that they did not reevaluate the decision of the system and moved forward with the predicted decision.
Now, this says a lot about how the system is being blindly trusted upon. With approximately 45% of the cases receiving a score of “unappreciated” [6], that too without any further considerations for the score, it is highly likely that victims go through self-doubt and never come back for reevaluation. As a result, between 2003 and 2021 there were 71 murdered women who had previously filed a report without obtaining a level of risk that entails police protection [6]. Quite a heavy price paid for negligence, ain’t it?

The Dynamics of Power

Listening and reading about all these incidents took me back to a very simple, yet complicated question — What actually is power? Or to be precise as we are dealing with society, what is social power? We have grown up believing that money is power and education is power. Sometimes, seniority and leadership bring power. But recently, I read this amazing book by Srilatha Batliwala titled “All About Power”, where she presents social power as — “the capacity of different individuals or groups to determine who gets what, who does what, who decides what, and who sets the agenda.” [7]

figure: All about power by Srilatha Batliwala

It makes complete sense, doesn’t it? But in the case of machines and algorithmic systems, where is this social power transferred to and from whom? If we go back to all the three cases I have highlighted so far, there is a clear similarity — people who are already in positions of power like:

- State authorities
- Institution Admin members
- Police and Law Authorities

vested their decision-making powers on machines. And with power, transfers systemic and historical human biases as well, all impacting the lives of individuals or groups who are already suppressed by the societal system. These machines were deployed on a strong foundation of trust, no doubt:

- Trust in historical “unbiased” data — because data is king.
- Trust in the “unbiased” algorithms — because accuracy and metrics cannot lie.
- Trust in the “unbiased” human input — be it while weighing certain factors manually or the different ways of data collection.

While these systems can make manual work 10 times faster, it is important to note that they could reinforce the already existing power imbalance due to caste, race, gender and other social factors. And the earlier we acknowledge this, the better we can take precautions and address the forecasted biases.

In fact,

what good is technology if it is not fair and just for all?

References:

[1]https://www.ibo.org/globalassets/new-structure/covid-19/pdfs/assessment-model-letter-may-2020-en.pdf
[2] http://positivelysemidefinite.com/2020/06/160k-students.html4
[3] https://www.wired.com/story/algorithm-set-students-grades-altered-futures/
[4] https://en.wikipedia.org/wiki/Dutch_childcare_benefits_scandal#:~:text=The%20Dutch%20childcare%20benefits%20scandal,the%20distribution%20of%20childcare%20benefits.
[5] https://www.amnesty.org/en/documents/eur35/4686/2021/en/
[6] https://eticasfoundation.org/wp-content/uploads/2022/03/ETICAS-FND-The-External-Audit-of-the-VioGen-System.pdf
[7] https://creaworld.org/wp-content/uploads/2020/07/All-About-Power.pdf