Dataset Can Only Be Unbalanced, Not Biased. But Humans Can.

TLDR: Dataset, architecture, models, metrics, algorithms, and math cannot be biased. Humans making choices are biased, by the very nature of the act of choosing. And when those choices leads to bad stuff, those making the choices should be held responsible. Professionals have standards.

Published in

The Startup

9 min readFeb 13, 2021

Epistemic status: I started writing this minutes after I finished watching NeurIPS 2020 keynote: “You Can’t Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterprise”. So I haven’t put much thought into it. What do I know, right? This is as under-qualified and over-opinionated as it gets.

Tools cannot be biased. The choice of tools can. Photo by Ashim D’Silva on Unsplash

Background story: Yann LeCunn vs Timnit Gebru

Months ago I first heard about about Timnit Gebru exit from Google controversy. I call it “exit from Google”, because that’s what Wikipedia calls it. This inevitably leads me to the story about her exchange with Yann LeCun regarding Face Depixelizer that turns Obama white. Just from reading this one article, I was completely on Yann LeCun’s side. This is a simple problem of dataset unbalanced, that the end of the story. To be fair, that article doesn’t talk about Gebru’s arguments at all, (which is not the article’s fault, it is just reporting a controversy).

And then, few minutes ago, I finished watching NeurIPS 2020 keynote: “You Can’t Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterprise” by Charles Isbell. And now, I’m on the complete opposite side. I now understand why Gebru is complaining about “framing”. Although I’m super late to the conversation, not that I am anyone whose voice has enough weight to join the conversation anyway — this would be my response to Yann LeCun’s original tweet.

My current response to Yann LeCun’s tweet

Face Depixelizer that turns Obama white. From twitter Chicken3gg: https://twitter.com/Chicken3gg/status/1274314622447820801

ML systems are biased when data is biased. This face upsampling system makes everyone look white because the network was pretrained on FlickFaceHQ, which mainly contains white people pics. Train the *exact* same system on a dataset from Senegal, and everyone will look African

Personal story: near miss with medical malpractice

Picture unrelated. Photo by Austrian National Library on Unsplash

Story time! I once felt very sick, so my parents brought me to a hospital. Doctor said “it’s appendicitis, needs an operation”. My parents said “sure”. I said “sure” — surgery scheduled. And then a visitor came and said “he looks yellow”. My parents said “he looks yellow”. I said “Blaargh” because I was literally vomiting. The doctor said “let’s do blood test”, blood test said “hepatitis A”. Surgery canceled.

If the surgery went ahead, nobody knows what will happen. But chances are, it won’t be nice. In more academic terms:

The outcome of patients with acute viral hepatitis undergoing general anaesthesia has never been prospectively investigated (40). In one retrospective study, 9.5% of patients with acute viral hepatitis undergoing laparotomy died, and 12% developed significant morbidity (40).
Lentschener, C., and Y. Ozier. “What Anaesthetists Need to Know About Viral Hepatitis.” Acta Anaesthesiologica Scandinavica, vol. 47, no. 7, Blackwell Publishing, 2003, pp. 794–803, doi:10.1034/j.1399–6576.2003.00154.x.

But let’s say the visitor didn’t come, the surgery went ahead, and complication happened. And then my family tweeted: “botched surgery by bad doctor.” LeCun’s reply is akin to replying my family tweet with “anaesthetic toxicity was enhanced by a liver undermetabolism, leading to morbidity.” That would be a 100% correct statement that missed the whole point entirely. That’s why Gebru was complaining about the “framing”.

The question about the interaction of anaesthesia and hepatitis is an important technical question that should be happening (and is happening) within the medical community. But that would not be the question that my family and I would be interested in. The big question would be: How could this happen in the first place? Was blood test not part of the standard procedure? How about simply looking at the patient’s whites of the eyes? These were unacceptable and preventable mistakes.

The other big question is “who should be responsible?” The doctor who ordered the surgery? The surgeon? The nurse who conducted the physical examination? The anaesthetic specialist? Maybe all of the above? I don’t know because I’m not in the medical field. But I’m very sure at least one person should be responsible for this would-be malpractice.

If my mom asked “why did my son die?”, answering it only with “anaesthetic toxicity” is not only callous, but also a refusal to take responsibility.

Choice of dataset

PULSE https://github.com/adamian98/pulse

Back to Face Depixelizer. Before, I thought this is a simple technical issue. The technical problem is that the dataset is unbalanced. The technical solution is to get better datasets, or use over/under sampling, or weight the samples, or a myriad of other techniques. But the bigger issue at hand is not “how to solve this mistake?”, but “why this happened in the first place?”.

Yes, all of the above are important technical challenges, and those are being addressed right now. That is a good thing. However, in public spaces, people are asking a completely different set of question: “Why did the CVPR publish a paper with an unbalanced dataset without discussing the important ways in which the dataset is unbalanced?” In other words “why is AI racist?”

(Of course dataset, architecture, models, metrics, algorithms, and math cannot be biased. Just like how anaesthesia the chemical cannot perform malpractice. AI and ML right now are unfortunately non-legible to the public. We cannot blame the public for using the wrong jargon and framing. It is upon us to make this more legible.)

The quick answer is — because getting a balanced dataset is hard and expensive. But that’s not the real answer. More difficult things have been done, and more money have been spent on more trivial things. The real answer is that CVPR does not care. They don’t care about using dataset that is balanced in terms of age, ethnicity, sex, or Fitzpatrick skin type. And that’s the crux of the issue.

Beyond datasets

Choices. Photo by Egor Myznik on Unsplash

Isbell’s talk also opened my eyes beyond datasets. Bias can creeps in more ways than just datasets. Another good example is metric: micro and macro F1 scores. A set of metrics has to be chosen by human. That choice, is a reflection of our values as human. What do we care about more and what we care about less. In fact, every single choices we make are biased, because that’s the nature of choices.

This is where the biases reside. Not in dataset, architecture, models, metrics, algorithms, or math. But in our decision when we choose one dataset over another, one metric over another, one architecture over another.

A good example is, why do we have a lot of face related tasks and datasets? Face Deepixelizer is one, as well as YOLO, and thispersondoesnotexist.com. Why not elbows or toenails? Because the researchers, the funders, and human in general, care much more about looking and generating and differentiating human faces. This is simply a fact, not something bad or good. Bias exist, and it has the potential to lead to bad outcomes.

Getting professional

“But what if I don’t care? I mean I agree that fairness in AI is an important issue, and people should be working about it, top venue should care about it more, and more funding should be put into it. But I just personally don’t care. That’s not the kind of research problem that I want to solve. This is not my cup of tea. I think cancer and black hole and climate change and poverty are important too. The fact I’m researching something else and not any of that does not mean that I belittle any of those topics. It is just that I have limited resources and I want to research topics that I am interested in. And in practice, that simply means picking up the most convenient datasets and metrics an architecture modules etc.”

(That’s not just me making a strawman. That is basically me. Right now I’m working on traffic predictions. And I keep on thinking to myself, “it would be kinda nice to be working with datasets that I am familiar with.” But I did my research, and the datasets available are biased towards few places on earth. Before, my thought was simply “this is not fun, but expected”. But thanks to Isbell’s, I realized the implication. Me and most people’s models are biased for certain kind of traffic. Whether or not it will generalize to other geographies is still a huge unanswered question. And I’m not going to kid myself. There is absolutely negligible chance that I’m going to put any real effort into making my set of dataset more diverse in the near future. I got bigger fish to fry, like graduating. So there’s that.)

Professional. Photo by Javier Reyes on Unsplash

On a personal level, I guess that’s perfectly fine. But as a community of experts, someone has to be responsible. This is not just about being more careful in deployment, but about setting up professional standards. In medicine, law, accounting, and airlines, the standards are set about who is responsible for what. Mistakes are not tolerated. Mistakes are evidence of malpractice and we need to figure out exactly who dropped the ball, or whether the standards should be updated.

ML experts are not professionals, yet. There are no standards, no code of ethics. A failure, however catastrophic, is a technical mistake, not malpractice. No one is responsible because no one has been assigned responsibility. The problem is that the public are less and less forgiving when it comes to mistakes by ML and AI. Either we have to regulate ourselves, or the people who are non-experts are coming with a nonsensical heavy handed regulations, or worst, mistakes will continue to be made, burdening everyone, especially the most vulnerable in the society.

If I’m making models, I don’t want to get sued because someone handed me a bad dataset. And when I’m preparing a dataset, I don’t want to get sued because someone picked a bad metric. Or maybe I should be responsible for both, regardless of my role. Maybe there should be duty of care and multiple checks. Just like a teacher has to do mandatory reporting if they suspect someone else is abusing a child. Or how a pharmacist can’t just blindly hand over whatever the doctors prescribed. Maybe ML engineers should check the model card signed by ML professionals before deployment, or there will be legal consequences. I don’t know how it should work, but I know that these are the questions we should be asking.

(And that is just on the deployment side. There are also questions about ethics on the ML research. Is broader impact statement actually working? I don’t know.)

More than social justice

For me, this is more than just about social justice. When bias in society got compounded into bias in technology, both in research and deployment, resulting in further harm to minority groups, that is a tragedy. But it could still get worse. Bias in science, leading to bad science, can, did, and will hurt everyone.

In the mean time, while we are getting there, when someone asks, “why did my son die?” The appropriate answer is not “anaesthetic toxicity”, but “I’m sorry. There has been a malpractice. There will be consequences for the people responsible. Here is some compensation though some hurt cannot be undone”.

References

The sources to all images are mentioned in the caption.