Giving my Computer Eyes
An Introduction to YOLO and the future of Object Detection
You might not know this but… you have eyes!
Without our vision, our human existence would have easily gone extinct a really, really, long time ago! We can react to the things happening around us. Without our vision, humans wouldn’t be able to understand and evaluate our surrounding environment. We wouldn’t be able to discover new land or discover how our solar system works. Without vision, you wouldn’t be reading this right now.
Now, of course, it isn’t just our eyes that play a role in our vision. Our brains play a massive role as well.
Our brains are the ones that take what we are seeing and tell us how we should react in milliseconds.
So, is there anything that can react to their surroundings almost exactly like us?
Welcome to the Field of Computer Vision
That’s right, we are now able to get computers to see. With something called Computer Vision (CV). Just like the name says, computer vision when we give computers…vision. Pretty self-explanatory.
But this is easy, right?
Unfortunately, taking your camera and strapping it on your computer won’t do the job. You see, computers don’t see the same way that we do.
As humans, we see a bunch of different shapes and edges that our brain pieces together to make up an image. Computers, on the other hand, see a bunch of different arrays with different numbers.
This is just a black and white picture, regular images are in RGB colour. This means that computers need to see 3 separate channels. So instead of looking at a 2-dimensional array of things, they are looking at a 3-dimensional array.
If it is this complicated to see an image… how do computers do it?
If our brain can see these images clearly…
Why can’t we just find a way to replicate our brain into computers?
Introducing Neural Networks
Neural Networks are artificial networks inspired by the human brain. There are tons of different layers of different interconnected neurons. This is what allows the computers to teach themselves to learn and recognize different complex patterns when given a specific task. In terms of Computer Vision, our goal is to teach these machines to understand and recognize different patterns in images in the environment it is in.
So how are we going to do this?
This is where Convolutional Neural Networks (CNN) come in.
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are mainly used with Computer Vision because it can preserve the spatial relationship between pixels by learning image features using small squares of input data. This is important because with CNNs, even if you rotate or squeeze an image, it will still be able to recognize it.
The first few Convolutional layers recognize the low-level features such as edges and corners. As the features in this image go through more and more layers, these features start to become more and more complex.
Although these CNNs are so powerful, they are mainly used in image classification. This means that it classifies an image into a class. For example, if you have an image of a cat, then it will classify it as a cat. Even though there are flowers in the picture, CNNs can only classify one thing.
This is the problem, because what if you wanted to know where each object in the image is located? Or what if we wanted to classify multiple objects in the image. Simply classifying an image won’t help. So, how are we going to do this?
Enter the Real of Object Detection
This is where object detection comes into play.
Object detection takes the classification properties of CNNs and uses them to locate different classes within the image rather than classifying the image as a whole.
This is done through an algorithm called YOLO (You Only Look Once)
YOLO can perform real-time object detection!
This algorithm is also incredibly fast! It can process up to 155 frames per second!
So it’s crazy, but how does it work?
You Only 𝖫̶𝗂̶𝗏̶𝖾̶ Look Once
As I mentioned, YOLO can operate up to 155 frames per second in real-time and is really accurate! YOLO only needs to look once, which means that it can almost immediately detect and locate objects.
By using a single CNN, YOLO can simultaneously predict multiple different bounding boxes and classes for those boxes.
When YOLO is running, it creates an S x S grid to locate different objects. Then it creates bounding boxes around objects within each grid. These boxes surround potential objects within an image so that YOLO can further predict them.
Intersection over Union
Now that we know more about that, let’s talk about how YOLO calculates these bounding boxes. For starters, the machine uses something called IoU (Intersection over Union)
The IoU is what the machine uses to determine whether the predicting bounding box matches with the actual bounding box around the object. The is extremely important because even if YOLO detects the object correctly, it might not ever be surrounding the whole object.
So how does YOLO do this? The IoU is calculated by dividing the overlapped area with the area of the union. This calculation gives you the IoU score between 0 and 1, and this tells you how well the bounding box is.
Hold up, hold up! The area of the what now?
The area of overlap is the area where the predicted bounding box overlaps the actual bounding box. The area of the union is the total area of both bounding boxes.
Always be Confident like YOLO
Here is another example of YOLO in action!
Okay! So YOLO divides the image into an S x S grid. Then is creates bounding boxes… Wait! Why are there tons of bounding boxes!
This is because, in the beginning, YOLO detects every tiny detail in the image and creates bounding boxes around them. When YOLO creates bounding boxes, it also decides which objects are more important. You can see in the image that it’s detecting parts of the image that don’t even have objects in them.
You might notice that some objects have a thicker outline than the others. This is the confidence that YOLO has.
When YOLO decides which objects are more important, it creates a confidence score, whichever scores are higher than an x amount, then it will further predict that object.
This is the general overview of what YOLO is and how it works. Now, let’s discuss the potential application of YOLO in the real world.
The Crime Stopper
In 2018, there were over 2.2 million police-reported crimes. And over 150,000 more in 2019. Not to mention, that these are the only ones that got reported. Tons more happen every day.
What if we could get AI to detect whether someone is breaking a law. Now, there are tons upon tons of different laws, and so training the machine to recognize each one is going to take too long. Instead, we can train YOLO to detect whether someone is about to or already got hurt. This could be from falling, getting robbed, and even seizures.
So how can YOLO help?
We can train YOLO to recognize multiple different objects such as guns, knives, etc. So, if YOLO detects these objects then what?
What if the gun was from a cop? What if these knives weren’t being used? There wasn’t actually any crime committed, so what can we do. Well, we can use something called Pose Detection. Like the name states, it can detect whether someone is standing, sitting, etc.
How can we use this?
If someone falls, we can use Pose detection to inform them that this happened. This can also be used to identify if a person is holding a gun to someone, or if they are trying to rob someone.
Using YOLO and Pose detection, we could potentially be able to save tons of lives.
This is only one of the millions of other applications of YOLO.
Are we already done?
Wow! That was a long one! Here are some of the key takeaways from this article.
- Neural Networks are Artificial Networks full of interconnected neurons inspired by the human brain.
- Neural Networks are what make computers learn and teach themselves to recognize patterns in a certain environment.
- Convolutional Neural Networks(CNNs) are a type of Neural Network that is mainly used in Computer Vision because of their powerful way of classifying images and/or videos.
- These networks can preserve the spatial relationship between pixels by learning image features using small squares of input data.
- Object Detection is what locates and recognizes different objects within an image and/or video.
- YOLO(You Only Look Once) is a very powerful object detection algorithm because it can detect these objects at up to 155 frames per second in real-time!
- YOLO only uses a single CNN to perform this.
- YOLO uses bounding boxes that surround a potential object and predicts the object class.
- YOLO uses Intersection over Union to calculate whether the bounding box surrounds the whole object
- IoU = Area of Overlap / Area of Union
- YOLO uses confidence scores to define which objects are more important and should be labelled than others
- There are many different applications for YOLO such as detecting crimes and reporting them in real-time.
Want to Learn More?
Whether it’d be from Computer Vision to applications of YOLO, I hope you learned at least one thing from this article.
Thank you for reading this article. If you want to have a conversation or are interested in meeting with me, feel free to check me out on LinkedIn!