Deep Learning

5 Computer Vision Trends for 2021

ML Engineer Sayak Paul presents key trends in Computer Vision

Benedict Neo

Published in

bitgrit Data Science Publication

6 min readJun 9, 2021

Computer Vision is a fascinating field of Artificial Intelligence that has tons of value in the real world. A huge wave of billion-dollar computer vision startups is coming, and Forbes expects the computer vision market to reach USD 49 billion by 2022.

The goal

The main goal of computer vision is to give computers the ability to understand the world through sight and make decisions based on their understanding.

In application, this technology allows the automation and augmentation of human sight, creating many use cases.

If AI enables computers to think, computer vision enables them to see, observe and understand. — IBM

Use cases

Use cases of computer vision range from transportation to retail.

A quintessential example for transportation is the company Tesla, which manufactures electric self-driving cars that rely solely on cameras powered by computer vision models.

You also see computer vision revolutionizing the retail space, such as the Amazon Go program, which introduces checkout-free shopping using smart sensors and computer vision systems, taking convenience to the next level.

Computer Vision has a lot to offer in terms of contributing to practical applications. As practitioners, or even someone having fun with deep learning, it’s essential to look at the newest progress in the field and keep up with the latest trends.

Trends in Computer Vision

In this article, I will be sharing the thoughts of Sayak Paul, an ML Engineer at Carted who recently gave a talk for Bitgrit. You can find him on LinkedIn and Twitter.

Note this article won’t cover everything in the talk and will only serve as a summary/takeaway. You can find the slides for the talk here with similar content but with helpful links related to the topics. The talk is also published on YouTube, which has more elaborations.

The goal of this article will be similar to his talk, which is to help you:

Discover what might be more exciting to work on in the coming days.
Inspire your next project idea.
Get up to speed on a few cutting-edge stuff happening in the field.

If you don’t know already, we recently launched a new discord server! Come join the bitgrit community where we discuss all things data science and AI, including our newly released BGR cryptocurrency token! Join the server here!

Now let’s dive into the trends.

Trend I: Resource-Efficient Models

Why

State-of-the-art models are often very hard to run offline on tiny devices such as mobile phones, Raspberry Pis, and other microprocessors.
Heavier models tend to have significant latency (which in this case stands for the time it takes for a single model to run a forward-pass) and can affect the infrastructure costs significantly.
What if cloud-based model hosting is not an option (cost, network connectivity, privacy concerns, etc.)?

Build Process

1. Sparse Training

Sparse training is about introducing zeros to matrices used to train neural networks. This can be done because not all the dimensions are interacting with the others, or in other words, significant.
Although performance might take a hit, it will result in a major reduction in the number of multiplications, reducing the time it takes to train the network.
One very closely related technique is pruning, where you discard network parameters that are below a certain threshold (other criteria exist as well).

2. Post-Training Inference

Using quantization in Deep Learning, to lower the precision (FP16, INT8) of models to reduce their size.
With Quantization-aware Training (QAT), you can compensate for information loss caused by lowering precision.
Pruning + quantization can be the best of both worlds for many use cases.

3. Knowledge Distillation

Training a high-performing teacher model and then distill its “knowledge” by training another smaller student model to match the labels yielded by the teacher.

Action plan

Train a bigger and high-performing teacher model.
Perform knowledge distillation, preferably use QAT.
Prune and quantize the distilled model.
Deploy

Trend II: Generative Deep Learning for Creative Applications

Why

Generative Deep Learning has come a long way.
Examples of its achievement on thisxdoesnotexist.com

Applications

1. Image Super-Resolution

upscale images for use-cases such as surveillance.

2. Domain Transfer

transfer images into another domain
ex: cartoonize or animize pictures of humans

3. Extrapolation

Generate novel context for the masked regions in images.
Used in domains like image editing, simulating features seen in photoshop apps.

4. Implicit Neural Representations and CLIP

Ability to generate images from captions (ex: a human riding a bicycle in the streets of New York)
Github Repo

Action Plan

Study these works and implement them. It’s okay to skip a few parts.
Develop an end-to-end project.
Try improving their elements and who knows — you may find something novel!

Trend III: Self-supervised Learning

Self-supervised learning doesn’t make use of any ground-truth labels and uses pretext tasks instead. Then, using a large chunk of unlabeled data set, we then ask the model to learn the dataset.

How does it compare with supervised learning?

Stakes of Supervised Learning

A humongous amount of labeled data is needed to push performance.
Labeled data is costly to prepare and can be biased as well.
Length of training time is very high with such large data regimes.

Learning with unlabeled data

Asking a model to be invariant to different views of the same image.
Intuitively, the model learns the content that makes two images visually different i.e. a cat and a mountain.
Preparing an unlabeled dataset is way cheaper!
SEER (a self-supervised model ) performs better than supervised learning counterparts in object detection and semantic segmentation in computer vision.

Challenges

Self-supervised learning requires a very large data regime to perform well for real-world tasks such as image classification.
Contrastive self-supervised learning is still computationally expensive.

Good reads

Trend IV: Transformers and Use of Self-Attention

Why

Attention helps a network learn to align important contexts inside data by quantifying the pairwise entity interactions.
The idea of “Attention” has been there in computer vision in many forms: GC Blocks, SE Networks, etc. But their gains have been marginal.
Self-attention blocks form the foundation of Transformers.

Pros and Cons

Pros

Lesser inductive priors and hence can be thought of as a general computation primitive for different learning tasks.
Parameter efficiency with performance gains on par with CNNs.

Cons

Large data regimes are important during pre-training because transformers do not have well-defined inductive priors as CNNs.

Another trend is when self-attention is combined with CNNs, they establish strong baselines (BoTNet).

Explore Vision Transformers

Trend V: Robust Vision Models

Vision Models are susceptible to many vulnerabilities that affect their performance.

Problems Vision Models face

1. Perturbations

Deep models are brittle to imperceptible changes in input data.
Imagine if a pedestrian gets predicted as an empty road!

2. Corruptions

Deep models can easily latch into high-frequency regions that make them brittle to common corruptions like blur, contrast, zooms, etc.

3. Out-of-distribution (OOD) data

Two kinds:

Domain shifted but label intact — We want our models to perform consistently with respect to their training.
Anomalous data points — We want our models to predict with low confidence when faced with anomalies.

How to make them robust

Many techniques deal with those specific problems to build robust vision models.