Data : Garbage In, Garbage Out

Moomal Shaikh
7 min readNov 30, 2022

--

I’m going to go ahead and argue the most important and urgent real-world problem for us to solve as a global tech community is how we improve the quality of data that goes into Artificial Intelligence systems — a pivotal step before we can focus on scaling AI for all the positive potential it has.

We already have so much reliance on AI in our day-to-day lives, it’s important for the collective “us” to understand what we’re dealing with. AI depends on data to exist. In fact, the data behind the algorithm is much more important than the actual algorithm itself.

Garbage in ➡️ Garbage out.

A boy throwing smelly garbage into a trash can. The trash can is actually the AI algorithm. The AI algorithm then creates out broken light bulbs from that trash, symbolizing bad ideas and outcomes from bad data.

Three Aspects of Data to Examine More Closely :

  1. Data quality for training AI
  2. Infrastructures to gather, store, and process data
  3. Ethics in Data & AI

Data Quality for Training AI

During the design phase of an AI algorithm, teams determine where the data to train the algorithm will come from. Biased data will create biased algorithms, and ultimately biased outcomes and decisions. The real-world implications are far-reaching and quite dangerous. If you’re like me, you learn best with examples too :

  • Healthcare | X-ray AI Models : If only x-rays of men are used to train an AI algorithm for image detection, then the AI may not recognize diseases when tasked to diagnose a woman.
  • Security & Surveillance | Smart AI Cameras : If the images fed to smart AI security cameras only pick up US news articles on Muslims from the last 10 years, it will learn to consider anyone with physical features from that region or anyone who practices Islam as a threat. A similar unfortunate application is security surveillance for African American communities, which we’re all too familiar with.
  • Facial Recognition | Social Media Tagging : If the dataset used to train the AI algorithm is primarily caucasian faces and features, the algorithm will exclude those of other ethnicities. This goes much deeper into the topic of representation for all, and the impact it can have on negative self-fulling prophecies and the barriers it creates for progress. On the flip side of the same application is the concern of surveillance and security forces, ultimately perpetuating unfair discrimination against certain communities.
  • Content Recommendation : If AI algorithm training data is built by those with limited experiences, perspectives, and backgrounds, these content rec engines can can draw lines between what content is recommended to certain groups, perpetuate narratives, limit critical thinking, and restrict access to new information. This also addresses the issue of availability bias — where people will believe content they read, because that is the only content available to them.

Recruitment algorithms, loan applications, friend suggestions. The list goes on. But you get the idea.

“Data doesn’t lie. People do. If your data is biased, it is because it has been sampled incorrectly or you asked the wrong question (whether deliberately or otherwise).”
- Lee Baker, Truth, Lies & Statistics: How to Lie with Statistics

If the foundational training data is biased and incomplete, that same algorithm (or even an improved version of it) will continue to learn from that incorrect foundational data with more usage, just further exacerbating the problem.

My first real jolt to reality on this issue was when Donald Trump won the presidential election in 2016. I realized I had been in an echo chamber based on what content was designed for me, and I continued to be fed more of that content theme as I continued to consume it.

Downside? I felt totally blindsided by the results of the election.

Upside? I am now hyper-curious and have sharpened my critical thinking skills.

Infrastructures to gather and process data

The reality is we haven’t been following much of a standardized method or system of how we collect, store, and process data. This has resulted in enormous amounts of data collected in multiple different platforms who don’t play nice with each other — aka very siloed systems without seamless integrations between them to share and combine data. This is not to say all systems are such (there are many that are in the process of resolving this concern), but it remains a real problem for the tech community to address in order to maximize the value of data from various different sources.

And worse? The quality of data collected by each system varies leading to inaccuracies and inconsistencies when combined with other datasets. A pretty awful cocktail of problems for the “data-driven strategy” you hear everyone talking about.

Ethics in Data & AI : It’s Complicated.

To make any meaningful progress in developing a standard of ethics for technology and AI, we must first acknowledge how incredibly complex the issue of ethics is. What one group considers “moral” and “right”, could be completely obscene and offensive to another group — with exactly the same amount of conviction.

In 2017, I attended a phenomenal talk by Michael Schidlowsky at the Flatiron School in NYC that continues to inspire me to this day. He walked the audience through a number of thought experiments to illustrate the complexity behind what we “consider” ethics and morals, how quickly we jump to conclusions initially, and how the lines get blurry when it’s time to execute on them.

My favorite thought experiment : The Trolley dilemma. This experiment is an actual real-life dilemma for those designing and training self-driving cars today!

In the classic version of the experiment, a trolley is heading down a track toward five people who are tied to the track and unable to move out of its path. The person making the decision must choose between two options: either allow the trolley to continue and kill the five people, or divert the trolley onto a different track where it will kill only one person. This description was generated using OpenAI Playground

Let’s take this one step further. How would YOU choose to train a self-driving car algorithm if making the choice between killing / saving an old person vs a child? A man vs a woman? A black person vs a white person? A pregnant woman vs a woman with a small child in her arms? A man with an amputated leg vs a perfectly healthy able-bodied man?

Uncomfortable yet? Yeah, it’s complicated.

While our goal as a tech community and as members of the human race should be to reduce as much bias as possible, the reality is there will always be some bias that exists in the datasets selected to train AI algorithms, and the bias within these datasets will shift depending on the environment around us and what is “normalized” during that time.

An uncomfortable example with some ugly truths : If self-driving cars were being trained in the South (USA) during the early 1900s at the peak of the KKK movements, it’s not hard to imagine those making decisions on training datasets would choose the path of valuing a white persons life over a black person. Countless other examples from the world we live in today.

Garbage in, Garbage out.

Good Data ➡️ Good AI— But how do we get there?

Some level of bias, conscious or unconscious, will always exist. The collective goal is to reduce the swing of the bias pendulum, as much as humanly possible.

Here are some ideas on how we can get there :

  1. Intentional Diversity across Data and AI Teams 🌏 :
    It’s critical to represent as many groups of people as possible in creating and training AI algorithms. This step of inclusion must be meaningful and action-oriented, and not just a coat of PR-paint. Diversity of thought, perspective, experience, and background will strengthen our datasets, and help dial down the pendulum swing of bias in data — especially as we scale AI applications globally.
  2. Be Hyper-curious 🧐 :
    Learn more about Artificial Intelligence and unpack those buzzwords. Ask questions. Don’t be afraid to investigate and dig further with business partners and tech vendors around what datasets are being used and represented, how data is gathered and processed, what AI methodologies are used, etc. Be hyper-curious so you’re armed with the information you need to make the best decisions for your business (and yourself) as you possibly can.
  3. Leverage AI Tech for Better Data ⚡︎:
    Use AI technology to automate monotonous tasks around data collection. For example, many expense report systems allow employees to simply upload or email a photo of receipts, and automatically scans all the necessary information required.
  4. Gamification 🎮 🎯 🎰:
    Cleaning up data and ensuring data quality can be one of the less exciting jobs one would sign up for, but requires thoughtful human input. There are ways to creatively gamify the process of collecting higher quality data, cleaning up existing data, and aggressively working towards reducing bias and increasing diversity in datasets. If done effectively, we can drive the change we need with less friction.
  5. Most importantly — Accept the Complexity of Ethics ⚖️ :
    Instead of fighting for an absolute truth in an increasingly global and diverse world, it would be best for us to accept the complexity in designing ethical standards and continue doing our best to increase diversity and representation, while reducing bias. This will be a constant work in progress (as it should be!), and we’re going to get it wrong a lot — but as Maya Angelou so beautifully said : “Do the best you can until you know better. Then when you know better, do better”.

There is so much power and innovation in diversity. We simply need to take a step back and focus on the quality of data going into AI to puts us back on the right path.

Garbage out.

--

--