Okay, ignoring that horrendous pun.
Humans are able to see from birth, so we often take it for granted. Trying to teach a machine to see from scratch, however, is a whole other ballpark. But first, what even is “seeing” in the first place? I’ve broken down true vision in three parts:
- Being able to physically “see” the object in front of you
- Understanding and recognizing what it is
- Being able to respond to it
In humans, this process corresponds to our eyes being able to see what is in front of us, then our brains recognizing what it is.
For instance, imagine you see someone walking down the street. Your brain receives and processes this image, eventually recognizing that it’s your childhood friend from 2nd grade. From this new information, you’re able to respond by saying hi to your friend. Eventually, the two of you end up grabbing a coffee together. Pretty simple, right?
Well, these simple processes that we conduct every day are hugely important for properly functioning AI. We got the first part down pretty early with the invention of the camera in 1816, nearly 200 years ago!
4 Books that Challenge Your View on Artificial Intelligence and Society | Data Driven Investor
Deep learning, robots thinking like humans, artificial intelligence, neural networks- these technologies have sparked…
The next step is a bit harder. The computer can capture an image but has no idea what’s happening in the picture. There was no sense of understanding or intelligence.
“Just like to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing, we really mean understanding.”
However, developments in image recognition have given computers the intelligence to do just that.
But, first, let’s take a step back to the beginning.
What Do Computers Really See?
Let’s go through how a computer views a normal colored image. Essentially, the computer sees it as a bunch of pixels. Each pixel is made up of a certain amount of red, green, and blue light, which your computer quantifies as RGB values.
Because these are the three primary colors, any other color can be produced relative to the amount of RGB light measured. Thus, each pixel has three values measuring the total RGB value. For instance, the color red is represented in RGB as (255, 0, 0). It contains only red light and zero amounts of green and blue light.
The same methodology applies to black-and-white pictures, except there is only one value for each pixel. This number essentially measures the darkness of the pixel, with 0 representing white and 255 representing black.
Okay, so the computer can show us images in color. That’s a pretty good step forward, but what’s next? How can we get the computer to recognize what’s in the photo? So that the next time you take a picture of a cute cat, the computer actually understands “Okay, that’s a cute cat!”
This is where artificial intelligence comes in.
Breaking Down CNNs
Image recognition technology uses convolutional neural networks (CNNs) to interpret pictures.
A neural network is a set of algorithms used to recognize patterns and relationships in a dataset. It’s very similar to how neurons in a human brain function and interact together, allowing us to perceive the natural world.
First, there’s an input — this would be the image. The network breaks up the image into smaller portions, to make it more manageable and easier to process.
Then, each portion undergoes a series of convolutions. Convolutions are just a fancy way of saying mathematical operations. This involves a lot of complex matrix multiplication, weights, biases, etc. However, the ultimate goal of these convolutions is to identify a feature — like an edge, a shadow, etc. As more and more simple features are identified, these can be put together and interpreted as something more complex. So, with the image above, the network would first recognize edges, then gradually, a head, eyes, ears, and ultimately, determine that it’s a child.
Max pooling is a process used to downsize the computations by taking the largest number from every subsection of the array. This method keeps the most important parts of the array, and again, makes the data easier to manage. This can be done multiple times at different stages in the network.
The final output would be numbers representing the probabilities of the image being each class the network was trained to identify. Again, using the same image earlier, it’d have a high likelihood for the “child” class and a significantly lower one for, let’s say, the “hammer” class. The machine will then take the class with the highest number, group the image as such, and voila! CNNs in a nutshell.
Wow, that was a lot. At least, now the computer can see and actually understand what it’s looking it. That’s parts one and two of true vision checked. But what about the third part? How does it utilize this information effectively and respond to it in real-world scenarios?
The Third Part — Applications
Asking why computers need to see is like asking why humans need to see. Imagine if you were blind. You can’t walk outside without using a cane or a dog to help. You wouldn’t be able to see anything. Life just got a lot harder.
The same goes for machines. Many functions of AI depend on being to recognize images. It’s a tool used in a variety of fields and industries. Law enforcement is using AI to recognize criminals based on security camera recordings. In healthcare, doctors use AI to detect the early stages of brain tumors.
The technology is no longer limited to huge tech corporations and geniuses. Microsoft has a fun and easy-to-use face recognition tool to detect if two people in different photos are the same person or not. It even provides some extra attributes, like the gender, age, and even emotion of the person. They were pretty spot on with my photos (even if they added a year or two to my age!)
Nothing is Perfect
But of course, AI isn’t without its problems.
With CNNs, you need a lot of labeled data. Just think about it — the network needs a diverse set of data to adapt to all sorts of different scenarios it may encounter. In the case of image recognition, this is especially true. After all, not every cat picture is exactly alike.
Thus, the more limited and homogenous the dataset is, the more limited the neural network’s abilities and intelligence will be.
It’ll only be able to identify pictures that have the same pose, lighting, and, in the case of humans, even skin color. See where I’m headed? This leads to the issue of algorithmic bias.
“The Coded Gaze” — Fighting Algorithmic Bias
In 2018, Joy Buolamwini from MIT Media Lab noticed a huge problem. Face recognition technology often wouldn’t recognize her face, despite working perfectly for her white male coworkers.
She decided to research into the bias, which she called “the coded gaze”. With a team of scientists, they composed a dataset of 1,270 individuals using their self-developed Pilot Parliaments Benchmark (PPB). They used pictures of people working in parliaments of three European countries (Iceland, Finland, and Sweden) and three African countries (Rwanda, Senegal, and South Africa), then classifying them based on gender and skin-tone. It allowed them to produce a final dataset made up of 44.6% females and 46.4% darker-skinned individuals.
They purposefully chose countries whose parliaments contain people with a variety of skin colors, as well as a more evenly-distributed gender ratio. This study became one of the first intersectional studies regarding face classification across different races and genders.
Afterward, the scientists ran their data on some of the most widely-used classifiers to date — Microsoft, IBM, and Face++. They found that all of the classifiers had the least accuracy classifying darker females and the highest accuracy classifying lighter males, with a discrepancy as large as 34.4%.
What does this mean for image classifiers in its real-world applications? Are algorithms already prejudiced and discriminating against women of color? As we depend more and more on AI to conduct major tasks in society, such as identifying criminals, this is a huge issue to be aware of.
My thoughts — even the field of AI isn’t immune to the historic gender and racial imbalance rooted in American society. Women and people of color continue to be the most vulnerable, most likely due to the lack of representation in the tech field. Black and Latinx tech workers comprise 5% of the tech workforce, while women make up 24%. Less representation and a lack of diversity make it much easier for algorithmic bias to occur.
Of course, this includes the training datasets. Bias may come from training datasets that are unintentionally skewed towards a particular gender or skin color. After all, AI is only as good as the data being used. It makes sense: as obvious as it sounds, data is the foundation for data science! If the data is biased, the algorithm built from it will be flawed as well. Go figure.
The Big Idea
It’s no revolutionary idea that minorities are more vulnerable to being discriminated against — and the tech industry is no exception. To combat it, companies must recognize this and be sure to prioritize diversity in their training datasets and workforce to prevent the ill-fated “coded gaze”.
■ True vision requires both sight and understanding. Hearing is not the same as listening!
■ Computers perceive images as a bunch of numbers, called RGB values, in each pixel. These numbers range from 0 to 255 depending on the amount of red, green, and blue light detected. The same concept applies to black-and-white photos.
■ Convolutional neural networks (CNNs) are used to train the computer in image recognition. It detects different features of an image and uses that to classify it.
■ A neural network is only as good as its training data — the more homogenous it is, the more limiting the network’s abilities will be!
■ A “coded gaze”, or algorithmic bias, has been proven to be apparent in major image classifiers including Microsoft, IBM, and Face++.
Thanks for reading! Feel free to follow me on Medium, LinkedIn, or shoot me an email to hear more. Until next time…