Giving my Computer Eyes

An Introduction to YOLO and the future of Object Detection

You might not know this but… you have eyes!

Without our vision, our human existence would have easily gone extinct a really, really, long time ago! We can react to the things happening around us. Without our vision, humans wouldn’t be able to understand and evaluate our surrounding environment. We wouldn’t be able to discover new land or discover how our solar system works. Without vision, you wouldn’t be reading this right now.

Image for post
Image for post

Now, of course, it isn’t just our eyes that play a role in our vision. Our brains play a massive role as well.

Our brains are the ones that take what we are seeing and tell us how we should react in milliseconds.

So, is there anything that can react to their surroundings almost exactly like us?

Welcome to the Field of Computer Vision

That’s right, we are now able to get computers to see. With something called Computer Vision (CV). Just like the name says, computer vision when we give computers…vision. Pretty self-explanatory.

But this is easy, right?

Unfortunately, taking your camera and strapping it on your computer won’t do the job. You see, computers don’t see the same way that we do.

Image for post
Image for post
The difference between our brain and a computer brain

As humans, we see a bunch of different shapes and edges that our brain pieces together to make up an image. Computers, on the other hand, see a bunch of different arrays with different numbers.

This is just a black and white picture, regular images are in RGB colour. This means that computers need to see 3 separate channels. So instead of looking at a 2-dimensional array of things, they are looking at a 3-dimensional array.

If it is this complicated to see an image… how do computers do it?

If our brain can see these images clearly…

Why can’t we just find a way to replicate our brain into computers?

Introducing Neural Networks

Image for post
Image for post
How a Neural Network works

Neural Networks are artificial networks inspired by the human brain. There are tons of different layers of different interconnected neurons. This is what allows the computers to teach themselves to learn and recognize different complex patterns when given a specific task. In terms of Computer Vision, our goal is to teach these machines to understand and recognize different patterns in images in the environment it is in.

So how are we going to do this?

This is where Convolutional Neural Networks (CNN) come in.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are mainly used with Computer Vision because it can preserve the spatial relationship between pixels by learning image features using small squares of input data. This is important because with CNNs, even if you rotate or squeeze an image, it will still be able to recognize it.

The first few Convolutional layers recognize the low-level features such as edges and corners. As the features in this image go through more and more layers, these features start to become more and more complex.

Image for post
Image for post
How the Convolutional Layers work

Although these CNNs are so powerful, they are mainly used in image classification. This means that it classifies an image into a class. For example, if you have an image of a cat, then it will classify it as a cat. Even though there are flowers in the picture, CNNs can only classify one thing.

Image for post
Image for post
CNNs will classify this picture as a cat instead of locating the cat.

This is the problem, because what if you wanted to know where each object in the image is located? Or what if we wanted to classify multiple objects in the image. Simply classifying an image won’t help. So, how are we going to do this?

Enter the Real of Object Detection

This is where object detection comes into play.

Object detection takes the classification properties of CNNs and uses them to locate different classes within the image rather than classifying the image as a whole.

Image for post
Image for post

This is done through an algorithm called YOLO (You Only Look Once)

YOLO can perform real-time object detection!

This algorithm is also incredibly fast! It can process up to 155 frames per second!

So it’s crazy, but how does it work?

You Only 𝖫̶𝗂̶𝗏̶𝖾̶ Look Once

As I mentioned, YOLO can operate up to 155 frames per second in real-time and is really accurate! YOLO only needs to look once, which means that it can almost immediately detect and locate objects.

By using a single CNN, YOLO can simultaneously predict multiple different bounding boxes and classes for those boxes.

Bounding Boxes???

When YOLO is running, it creates an S x S grid to locate different objects. Then it creates bounding boxes around objects within each grid. These boxes surround potential objects within an image so that YOLO can further predict them.

Image for post
Image for post

Intersection over Union

Now that we know more about that, let’s talk about how YOLO calculates these bounding boxes. For starters, the machine uses something called IoU (Intersection over Union)

Image for post
Image for post

The IoU is what the machine uses to determine whether the predicting bounding box matches with the actual bounding box around the object. The is extremely important because even if YOLO detects the object correctly, it might not ever be surrounding the whole object.

So how does YOLO do this? The IoU is calculated by dividing the overlapped area with the area of the union. This calculation gives you the IoU score between 0 and 1, and this tells you how well the bounding box is.

Image for post
Image for post
Samples of IoU scores

Hold up, hold up! The area of the what now?

The area of overlap is the area where the predicted bounding box overlaps the actual bounding box. The area of the union is the total area of both bounding boxes.

Image for post
Image for post
Intersection over Union formula

Always be Confident like YOLO

Here is another example of YOLO in action!

Image for post
Image for post
How YOLO knows what is more important.

Okay! So YOLO divides the image into an S x S grid. Then is creates bounding boxes… Wait! Why are there tons of bounding boxes!

This is because, in the beginning, YOLO detects every tiny detail in the image and creates bounding boxes around them. When YOLO creates bounding boxes, it also decides which objects are more important. You can see in the image that it’s detecting parts of the image that don’t even have objects in them.

You might notice that some objects have a thicker outline than the others. This is the confidence that YOLO has.

When YOLO decides which objects are more important, it creates a confidence score, whichever scores are higher than an x amount, then it will further predict that object.

Image for post
Image for post
YOLO filters out any non-important objects in the image
Image for post
Image for post
Another Image of YOLO in action!

This is the general overview of what YOLO is and how it works. Now, let’s discuss the potential application of YOLO in the real world.

The Crime Stopper

In 2018, there were over 2.2 million police-reported crimes. And over 150,000 more in 2019. Not to mention, that these are the only ones that got reported. Tons more happen every day.

What if we could get AI to detect whether someone is breaking a law. Now, there are tons upon tons of different laws, and so training the machine to recognize each one is going to take too long. Instead, we can train YOLO to detect whether someone is about to or already got hurt. This could be from falling, getting robbed, and even seizures.

So how can YOLO help?

We can train YOLO to recognize multiple different objects such as guns, knives, etc. So, if YOLO detects these objects then what?

Image for post
Image for post
YOLO isn’t just detecting the gun, but also a person with the gun

What if the gun was from a cop? What if these knives weren’t being used? There wasn’t actually any crime committed, so what can we do. Well, we can use something called Pose Detection. Like the name states, it can detect whether someone is standing, sitting, etc.

Image for post
Image for post
Pose Detection

How can we use this?

If someone falls, we can use Pose detection to inform them that this happened. This can also be used to identify if a person is holding a gun to someone, or if they are trying to rob someone.

Using YOLO and Pose detection, we could potentially be able to save tons of lives.

This is only one of the millions of other applications of YOLO.

Are we already done?

Wow! That was a long one! Here are some of the key takeaways from this article.

  • Neural Networks are Artificial Networks full of interconnected neurons inspired by the human brain.
  • Neural Networks are what make computers learn and teach themselves to recognize patterns in a certain environment.
  • Convolutional Neural Networks(CNNs) are a type of Neural Network that is mainly used in Computer Vision because of their powerful way of classifying images and/or videos.
  • These networks can preserve the spatial relationship between pixels by learning image features using small squares of input data.
  • Object Detection is what locates and recognizes different objects within an image and/or video.
  • YOLO(You Only Look Once) is a very powerful object detection algorithm because it can detect these objects at up to 155 frames per second in real-time!
  • YOLO only uses a single CNN to perform this.
  • YOLO uses bounding boxes that surround a potential object and predicts the object class.
  • YOLO uses Intersection over Union to calculate whether the bounding box surrounds the whole object
  • IoU = Area of Overlap / Area of Union
  • YOLO uses confidence scores to define which objects are more important and should be labelled than others
  • There are many different applications for YOLO such as detecting crimes and reporting them in real-time.

Want to Learn More?

Whether it’d be from Computer Vision to applications of YOLO, I hope you learned at least one thing from this article.

Thank you for reading this article. If you want to have a conversation or are interested in meeting with me, feel free to check me out on LinkedIn!

students x students

for students by students

Sign up for readers x students

By students x students

Your opportunity to connect with the students who write for sxtudents. Be prepared to ride on a whirlwind of wonderful writing, learning the power of student voices. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Bagavan Marakathalingasivam

Written by

A 14-year-old passionate kid whose one goal is wanting to impact the world | AI Enthusiast | Innovator @The Knowledge Society

students x students

Providing a platform to uplift student voices and give them greater confidence and fulfillment in their writing.

Bagavan Marakathalingasivam

Written by

A 14-year-old passionate kid whose one goal is wanting to impact the world | AI Enthusiast | Innovator @The Knowledge Society

students x students

Providing a platform to uplift student voices and give them greater confidence and fulfillment in their writing.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store