Deep-Learning-Enabled Technology Will Become A Fundamental Part Of Our Life
Maybe you’ve already heard about things like „Machine Learning“, Artificial Intelligence and all those weird things that scare most of the people. Mainly because they think about some Hollywood block buster movies where robots take over the world.
As a believer in the good of all things, and especially in technologies, I’m excited in a very positive way what is already available today, as I see hundreds of use cases that would make my life much easier, and possibly the lives of others, too. Let me show you what we can do with this technology already, and what might come in the near future.
At first, when I talk about machine learning here, I actually mean deep learning, which is a very specific kind of machine learning. To not confuse you too much, I will use machine learning when it’s about the core concepts. Just think about a much improved way of learning complex matters when you hear „deep learning“.
To understand machine learning at least a little bit, let me explain what these systems do in a very easy way — and this is the same for deep learning, too.
Given one or more sensors, the data from these sensors is taken and analyzed immediately to find patterns that trigger specific events. The analysis is based on information provided from outside (the „knowledge“). The simplest thing would be a camera recording its environment, and the video stream gets analyzed for certain patterns, like detecting a human face (not a specific face, just that it is a human face) or certain actions, like „human plays a guitar“.
Analyzing such streams of data for patterns (like „this is a human face“) is not something you solve by programming an algorithm or writing some code that process the data in the video stream. Well, some tried but it never led to something usable.
Instead, this is where artificial intelligence and machine learning come into the game.
Machine learning means, that you train a system that is capable of learning what is seen as a good case, and what is a bad case. It’s as simple as that! In our example, a good case would be „this is a human face“, and the bad case is „this is not a human face“.
Think about this process in the way that you take several thousands of images and label them with „has a human face“ or „has no human face“. You then send this information into a learning system. It analyses each image, and the label is hat assigned, and tries to detect patterns. This process is called „training“. This is a huge computing task, and on a standard computer (not the nerdy stuff; think of your parents’ computer, yes, the one you need to fix every time you visit them) this would take weeks to months (as long as there is no Windows update that forces a reboot). Luckily, there are much faster systems, which basically consist of a normal CPU and a very, very, VERY capable graphics processor (GPU or graphic processing unit). We don’t use GPUs because we process images in this case, but because the „learning“ part involves mathematical computation with floating point values (like 1.23456, compared to 1 or 2 or 3 which are integer values), and GPUs are extremely good in working with floating point values.
So, the machine learning systems „learned“ what is a human face and what is not by trying to find patterns in each of the images it has processed, and finally it has a so called „model“ with the outcome. Luckily, when this task is done, it also tells us how good this model is. If has a low probability to find a human face, we could simply add another hundred thousand images to make the model better, but there are also other ways to improve this.
Nevertheless, we now have a model, and we can use this model now to analyze the video stream of our camera to detect human faces.
This is still a very demanding task, and this is why such a system also has to have a powerful GPU inside. It’s still dealing with lots of floating point values. Think about a billion of these values per second for just a single camera stream, and a delay of a few seconds to detect the faces in the stream with a very capable GPU. Maybe you get a feeling now what it means to process the six cameras and several other sensors in a modern Tesla car for autonomous driving….the car has to react within a tenth of a second, and not within several seconds.
Ok, so until now I may have bored you with some weird technical stuff. Let me tell you about what you can do with these capabilities, as I think that most people don’t understand the impact on our lives that will be happening soon.
Think about the classic babyphone/baby monitor that you’ve put into the room of your baby or toddler to know when it wakes up or cries. This was basically triggered by a certain noise level. Some better ones also measured the length of the raised noise level so they don’t inform you on the first cough or some other, unrelated noise in the room.
With a deep learning enabled system, you can not only detect if the sound is really noise that you should take care of, but it can also detect actions from a camera. So when your toddler moves around too much, or tries to climb out of the crib, you could get notified.
So far, you might think, that this is a „nice to have“, but who would buy an expensive device that detects hyperactive toddlers for a lot of money, and a few months later you have no use for it anymore?
Well, this device wasn’t created for just this purpose. It would be a device that can detect ANYTHING, it just needs a different model as input. It could record how long your child is working on its homework due out the day, or you put it in your front yard to get notified when a carrier delivers something to you door. Or you could track how often your cat left your house through the cat flap, and how long it was out, etc. Or anything else you might think of!
And there is ABSOLUTELY NO CODING INVOLVED! Machine learning means, that you TRAIN the system from real life input, and the more data it has, the more accurate it becomes.
Regarding privacy, as we now have so capable systems-on-a-chip (SoC; a single chip containing everything from a CPU, GPU up to in and out interfaces) that are so small and energy-efficient, the analysis of the data can happen locally, and the information never leaves the device. This is so crucial for a new technology to be accepted in our daily life. These systems don’t need a connection to the Internet at all. You could send new models onto these systems using your smartphone and bluetooth, so you will be in charge of what happens on your device, and when. They can’t even be compromised as they are not connected.
For the nerds like me, we can already work with these things. There are systems out there which are extremely capable, like the NVIDIA Jetson TX2 which uses less energy during detecting faces from a 4k high resolution stream than your smallest LED bulb that you have in your house. If you dry your hair for 10 minutes, you need the same amount of energy as this systems needs in 36 hours of 100% load, which would mean that it would analyze the stream of six cameras for human face detection in parallel.
There are pre-trained models we can use to detect simple actions like „playing the guitar“, „brushing the teeth“, to differentiate between or detect cats and dogs, to detect faces in an image or a stream, etc. We can do this on incredibly capable devices which are not much bigger than your smartphone. A device as tiny (and as energy-efficient) as an Amazon FireTV stick or Google Chromecast stick is now already capable to analyze a 4k video stream with a latency of less than a second, or a few seconds for more complex tasks. Even your up-to-date smartphone is capable of detecting anything you want with its camera and a given model when you point at it. In a few years from now, a device smaller than a car key will be able to detect all sorts of actions or things from a video or sensor stream within a fraction of a second. Models will be provided and shared based on information from all over the world. They will become better and better. In the same way we share code via the different channels like Open Source projects, App Stores etc. we will work with models in the future. It will be a commodity. What we now get sold as a „smart home“ will change radically. The system will understand your behavior and act upon it, BUT it will not share this information with anyone!
This is so exciting! This technology will help us in many many ways, and it is non-intrusive as it will run locally, without sharing any data, and still will yield results we didn’t think about a few years from now!
PS: the day that I can get rid of my presence detectors in my smart home and the problems they cause (e.g. not detecting persons when they don’t move, or just move too little) and replace them with a small detector based on a camera and a trained model (that cannot be hacked and can’t send anything to the outside world), this will be one of my luckiest days! This is not a too-far-away future, this will be reality within the next 5 to 10 years!