How to use AI and Computer vision to make an analytical watchdog

The methodology of building a comprehensive structure using CV and some AI algorithms to allow us to deploy a virtual watchdog (watching videos, streams, and images to audit and analyze information)

8 min readMar 14, 2020

Overview

Computer Vision has is not something that just only began to exist in the 21st century but dating way back in the 1960s, computer vision was introduced as the subfield of information-processing though the yield of Computation. The hope was to convert visual object information into machine-readable code, in order to see how Human interpretation of real-world objects is interpreted by computers. The algorithms were simple and flat, unable to process (human)contextual information but rather being able to describe objects by geometric expressions only (but barely accurate).

Pretty darn boring for us humans!

In this article, let’s talk about the methodology from deploying deep image training infrastructures to frameworks and tools that can make the computer’s vision intelligible.

What the heck is the ‘Virtual Watchdog’?

The synonym of the word ‘watchdog’ is monitoring, auditing activities of an objective. Manifesting into a subject, similar to that of a Detective. The vast volume of activities happening in the digital world is hosted by Virtual environments. As the safety concern for these environments, virtual watchdogs are deployed to monitor, audit and administration purposes. Usually, the schema of the data frame of such virtual systems are based on non-visual IO (such as most conventional anti-virus engines, which usually either reads the meta of various software and then asks the database about the context thru either a simple or slightly complex ML algorithm).

A visual implementation of such a system, however, scans the environment for contextual understanding. A regression model can be utilized in order to see for changes and deep learning to train the system to become more intelligent and aware of its surroundings.

Building & deploying Deep computer vision 1960s vs Today

Now in order to appreciate how easier yet powerful the resources for working with deep computer vision has gotten nowadays (and keep on doing so in coming years)by, generally speaking, the two most fundamental breakthroughs over the years :

Hardware(Storage technology, faster network, and powerful {cheaper} computing resources)
Software (Efficient algorithms, better training methodology and great scaling in deep learning performance{through powerful GPUs})

All the way back in the seventies (70s) till the nineties (90s) when the still young visual capturing and processing technologies had a solid obstacle of limited (and expensive) data storage and transfer. Computing resources were extremely limited too and most importantly to mention the early image capturing sensors(a big part of better CV), when photos and videos taken wouldn’t be detailed enough for us humans to even understand what’s going on, let alone computers. Today, however, it has undergone a revolution in camera technology. High-res image sensors are getting smaller and in-expensive rapidly over the iteration of years; thanks to vastly superior network connectivity allowing detailed high-res videos to be streamed wirelessly (revolutionary when implemented in IoT devices; indexed by larger networks of IoT devices allowing to model big environmental sectors; for example, roadways).

For high-res video streams and big dataset processing, computation, and storage resources have been shifted towards cloud infrastructures. That has made this whole field more open to people who are hobbyists or amateurs; for them to play a role in experimenting and innovating together.

With all major Cloud providers (AWS, G Cloud, IBM, Oracle), one can employ asynchronous runtime environments with clusters (for running and training CV models), deploy virtual pipelines (to process and analyze structured data) and versatile database resources (to store pre-trained and trained models). Thus, a large-scale video/image analysis and training infrastructure can be hosted by you yourself alone in the comfort of your own bed.

From 1965 to 2010’s image segmentation(by real-world objects)

In the software side of things, there have also been some outstanding breakthroughs. For starters, we went from just letting the computer draw geometric shapes identical to those on the real-world (using liner mathematical expressions) to utilizing various types of Deep learning algorithms (such as the convolutional neural network on ImageNet{dataset}) to establish a contextual understanding of the real-world objects and being able to annotate it by content and contextual notions. This way, computers are able to see and understand the world just like how we humans do by recognizing the content in an image, such as object and their context (category); playing with 3D vector panes (one metric being distance), also enabling computers to get an understanding of depths.

And in the last decade, the emersion of the SLAM (Simultaneous Localisation and Mapping) solution is more sophisticated thanks to the newly solved state of the art ML and deep learning algorithms. Today, there are many opensource SLAM frameworks and CV and visualization tools are available to practice training image models; that’s what we are going to take a look at briefly.

Blueprint of the watchdog

Our watchdog is supposed to audit & analyze information from visual input and in the best-case scenario, be able to do it in real-time. But before we do all of these, we need to go basic. There would be different parts where we to need to individually focus —

Frameworks and tools
Algorithms (DL, NN & Datasets)and scaling
Hosting & processing infrastructure

Our CV interface is supposed to take a visual input (static image sets or motion videos) and spit out outputs classified by context. For example, a visual input such as an image of a city and split the content data types by text (language), object classification by context and depth. Thus information from these visual inputs can be decoded into versatile data; later be used for analysis at scale (i.e can be put into mathematical expressions).

Let’s see what we need but first build a blueprint

Frameworks and tools

Although we want to make our system versatile to be able to work with any datasets consisting of visual inputs; we want to, for instance, start with open-source datasets and practice training locally.

This is our first step, to process basic images and videos, we are going to be using OpenCV which is an excellent choice for the task as well as for scripting and automating, Python is what I would prefer to use for the most. To experiment and play with our model, we are going to use Pangolin 3D visualization library based on OpenGL. When we would want to go a little advanced and high-level, such as real-time environment mapping and localization (slam), we shall move on some nonlinear slam tools; OpenSLAM and maplab are two of some solid choices.

Algorithms (DL, NN & Datasets)and scaling

The process of making sense of visual input that we give to our machines need to take the helping hand of fine-tuned DL CNN (conventional) algorithms. What necessary needs to be done; the processed and broken down visual input now needs to go through the algorithmic layer(s) so that the computer establishes an understanding of the content that it processed.

Convolutional Neural Network to map multi-session large-scale environment (openSLAM)

But fine-tuning Deep Learning algorithms so it can do its job more efficiently and accurately takes months if not years. It’s a tedious process altogether. That’s why the algorithmic side of this entire field is the most expensive as a matter of fact. So let’s commit to ourselves spending years on developing some algorithms? No, fear not our open-source community is HERE to back us up by already developing solutions for Computer vision and deep learning related problems. Many competitions are hosted throughout these years which yield from crowd help in order to come up with solutions for these notorious problems.

AlexNet (for CNN) is one such. WordNet for training NLP (natural language) models and imageNet with 2 million high-res images.

Hosting & processing infrastructure

Here comes the part which will fuel our watchdog system, the infrastructure (over cloud). This will be the skeleton of our system.

(Among so many cloud providers, I preferred to use AWS)

The basic structure goes somewhat like this —

Database to store pre-trained objects (datasets). As for the resources, we can either use the traditional relational databases or non-relational databases or you can even use s3 buckets in AWS
We then feed the un-trained data into code scripts for it to serialize and be ready for the next step. The industry is going serverless, so are we. We shall use Lambda functions and code deployment here.
This is the process of training as a DL/ML job. We shall set up pipelines; the training process begins and continues.
In case we require more compute power, we can set up clusters with nodes.

Our job afterward is to play with the model and use it in the example and real-world to see whether it is doing its job as intended or not. This process a little non-linear.

However, at the time of practicing and executing, these things seem totally different and unexpected challenges come up out of nowhere. There will be a lot of Trial & Error, so the reason why I thought of defending it into a multi-part series (I don’t know how many parts it will take, but I will try to give interesting insights on each one).

Note: that’s about it, this is discussing the structure of the project and abstracting. The tutorial and documentation of the project will come on the next posts. For now you can go check out my social where I post small frequent updates and also cool tech and engineering resources and insights.

Twitter: https://twitter.com/alphensigntv (this is where I’m most active and share insights)

Instagram: https://www.instagram.com/(I post infographic posts on interesting tech topics)

Thanks for the read :)