Recapping the Computer Vision Meetup — November 2022

Michelle Brinich
Voxel51
Published in
11 min readNov 15, 2022

--

Last week we hosted the November 2022 Computer Vision Meetup and had a blast! The speakers were engaging, the virtual room was packed, and Q&A was lively. In this blog post we provide the recordings, as well as recap some highlights and Q&A from the presentations. We’ll also share the Meetup locations and upcoming schedule so that you can join us at a future event.

First, Thanks for Voting for Your Favorite Charity!

In lieu of swag, we gave Meetup attendees the opportunity to vote for their favorite charity and help guide our monthly donation to charitable causes. The charity that received the highest number of votes was the World Literacy Foundation. We are pleased to be making a donation of $200 to them on behalf of the computer vision community!

Meetup Recap at a Glance

Talk #1: Scaling Autonomous Vehicles with End2end

The first talk was by Sri Anumakonda, an autonomous vehicle developer focusing on creating Computer Vision software to help push the boundaries of self-driving cars.

Computer Vision has been primarily used as a way to perform scene understanding when deploying autonomous vehicles. But recently, there has been more and more research into how we can leverage Deep Learning and neural networks to create learned mappings from raw data to control outputs. In this talk, Sri introduced a method to train self-driving cars completely on camera, through a technique known as end2end learning. He described how we can create internal mapping spaces to go from camera images to control, how we can interpret these “black-box networks”, and how we can use end2end learning to solve self driving. Additionally, he introduced some of the most promising research in the space and how we can create end2end systems that scale faster than any other method.

If you’re interested in learning about these concepts and more, then Sri’s presentation is for you:

  • Self driving & autonomous vehicles (AVs)
  • Deep learning
  • Convolutional Neural Networks
  • End2end learning
  • Semantic segmentation
  • Transposed convolutions
  • Optical (motion) flow
  • Depth estimation
  • 3D reconstruction
  • Localization

Talk #1 Video Replay

Talk #1 Q&A Recap

Here’s a recap of the live Q&A following the presentation during the virtual Computer Vision Meetup:

​​What are some examples of more modern autonomous vehicle architectures? More specifically, have there been major changes from the 2 stream 3D-CNN approach from the 2019 paper cited? Or are companies still widely using similar architectures?

Assuming this question is regarding the Wayve Urban Driving paper, Sri’s answer to that is yes and no. The approach for end-to-end learning is very straightforward. You’re given this image, you’re given this neural network that’s able to process this image, and then you have this output, which is your lateral and longitudinal control. That’s the main meat of deep learning or in this case end-to-end learning.

But there’s a lot still being worked on in modern architectures. Check out a recent blog post by Wayve that describes how to add other complexities to your model, such as bird’s-eye view and more, in order to improve performance.

Elon Musk predicted that AVs would be commonplace by 2021. What went wrong and what is your prediction about AVs?

Sri answers that self-driving itself is a very complex environment. Even though we’re able to create driving that works in certain environments of the world, it’s very hard to create generalizable driving if you don’t have cars that drive all over the place. That’s why companies like Tesla perform so well. Because of the fact that they have so many users that live across different places in America, they’re able to collect so much data about what’s going on in the real world.

Regarding what went wrong? It’s likely something that’ll fix itself with respect to time. End-to-end learning itself is a very complex problem, especially with computer vision. The challenge in using deep neural networks and convolutional nets to go from input to output is essentially a data problem where the more and more diverse and high quality data that you have, the better your model can be and the more generalized it can be with respect to the real world. Sri states that we’ve come very far in the past 10 to 15 years, but predicts that this will take five to 10 more years to get it nailed down. We’re at the point where we have solved 99% of self-driving, but for every single 0.9 that we add to that 99.9, the challenge of self-driving becomes orders of magnitudes harder than what it was before.

Have you experimented with the ViT vision transformer in your work?

Sri has not so far, but notes that it is very interesting. ViT transformers and self-driving have had a really big boost over just the past 12 to 18 months, seeing how we can apply this idea that was originally made in NLP into self-driving. But Sri finds it really fascinating to think about how we can incorporate better scene understanding through transformers. So the answer to that is no, but it is a very interesting space to think about.

For optical flow, if the cameras are on the driving car, wouldn’t the background move faster than the other cars that are driving along with the self-driving car?

Sri answers no, mainly because of the fact that relative to the foreground, your background is more static, meaning that the change in the pixel values of tree movement in the background for example is much less compared to the change in the pixel values of other cars that are in the foreground. Especially if the car is in the opposite lane, then these cars move much faster than these trees in the background are moving. So therefore you can use that information and leverage it so that you can do better prediction.

How can I set up a similar environment to the one you’ve discussed?

Sri replies that there are a lot of really good simulators online for self driving, including a very popular one known as CARLA, which is a really good simulator that has everything you need for self-driving. It has LiDAR values, semantic segmentation, instance segmentation, and more, in addition to being very fun to play around with.

Humans mainly rely on vision for driving, but sound is also important. For example in case of an ambulance coming, you may not see it, but you will hear it. How important a role do you think sound will play in autonomous vehicles and can this help computer vision in any way?

Sri notes that he’s been thinking of this as well because sound is very important, especially in the cases of needing to hear emergency vehicles. Sri has a hypothesis on how to add this to a computer vision model that involves having a separate network that does sound detection. For example, if you hear a police car, an ambulance, or fire truck, then maybe you apply a one-hot encoding vector where you declare if sound == fire truck, then execute a command that has the car move to the right and start driving really slowly to increase caution.

Do end-to-end models with multiple sensor modalities perform better or worse than just implementing sensor fusion separately from the rest of the decision making?

Sri answers that there are two ways you could think about this. The first way that you could think about it is in sensor fusion the way that it works end-to-end is that you have it all running through one bigger network. This not only allows you to have a better representation or understanding of what’s going on in your scene, but the more important factor to think about is that now it’s not a human that’s determining what type of patterns you are looking for. So one of the problems with sensor fusion being implemented separately from the rest of the actual decision making is that humans tend to implement separate modalities into the actual sensor fusion itself.

But with end-to-end learning, you can have the network create its own internal mapping so that it can pick up on patterns that humans are not able to pick up on, which allows it to not only have a better understanding of its world itself, but also have better performance.

How do models perform in different lighting conditions?

Sri replies yes, the lighting and weather are important to computer vision and self-driving particularly. Driving at night is much more complex than driving during the day because objects are much more clear in the day versus in the night and in adverse conditions.

Talk #1 Additional Resources

Check out these additional resources on the presentation and the speaker:

Thank you so much to Sri on behalf of the entire Computer Vision Meetup community for sharing your knowledge and inspiring us!

Talk #2: Synthetic Data Generators and Deploying Highly Accurate Retail Supply Chain Computer Vision Apps

The second talk in the Computer Vision Meetup was by Tarik Hammadou, a Senior Developer Relations Manager at NVIDIA.

Training data for product recognition within a large retail supply chain context is hard to get to scale as its dynamic in nature, with new products being introduced frequently. Supervised learning models rely on the training data and this problem becomes significant with the scale of the machine learning models. In this talk, Tarik presented a method based on creating a digital twin of the fulfillment or a distribution center facility and generating photorealistic digital assets to train and optimize the classification model to be deployed in the real world. The performance of the training process is then used in a feedback loop to adjust the synthetic data generator until an acceptable result is achieved. Furthermore, he shared his deployment orchestration methodology over a large number of compute nodes. This method can also be extended to product inspection and other more complex computer vision tasks.

If you have scenario — such as material handling optimization — where you think trying it out in a “digital twin” to optimize performance would be a preferred step before attempting to implement it in your physical space (think depalletization, conveyor belts, picking and sorting stations), then this talk is for you!

Talk #2 Video Replay

Talk #2 Q&A Recap

Here’s a recap of the live Q&A following the presentation during the virtual Computer Vision Meetup:

Did you combine real and synthetic data for training?

Tarik shared that in this specific example, they trained the classifier using a pre-trained model on synthetic data.

Which pretrained model did you use?

Tarik answered that they used Yolo v5. They have also done some instance segmentation with Mask R-CNN and Faster R-CNN as well.

How significant was the improvement from the automatic tuning of the dataset generation parameters?

To this question Tarik replied that regarding the use of Replicator and the synthetic data generator, as you are training your neural network in a feedback loop, you are tuning the parameters of your synthetic data generator, and you can get significant results — very close to state-of-the-art results — in terms of performance.

When using synthetic data for testing, is the feedback to the dataset generation parameters manual or automatic?

Tarik shared that when he performed these experiments, the feedback loop was manual but they are in the process of automating it. Tarik is working with a company, Kinetic Vision, and in their workflow there is an automatic feedback loop — when you are training a network, you use the accuracy of your result to go back into a feedback loop and change those parameters.

Do you use any differentiable rendering techniques inside Omniverse?

Tarik answered that, yes, there are two different types of methods and techniques in terms of rendering inside Omniverse. Stay tuned for some additional resources discussing this topic.

You said it took 4 months from the start of the project to its deployment. How many people were working on the project?

Tarik explains that the biggest challenge he has seen in the AI world is that POCs can be painful. It takes sometimes six to eight months just to conduct a POC. And in this four month project, they had six people working on it. It gave them an indication that if they have pre-trained models, and a workflow and pipeline that are well established, they can accelerate the enablement and the deployment of those applications.

What was your ratio of synthetic data to real image to train the advanced computer vision models or does the ratio matter?

Tarik shared that in this case, they trained those models only on synthetic data. The pretrained model was trained on real data from scratch and then it was fine tuned only with the synthetic data.

Talk #2 Additional Resources

You can find the talk transcript here.

Thank you Tarik on behalf of the entire Computer Vision Meetup community for sharing your knowledge and the retail supply chain use case with us!

Computer Vision Meetup Locations

Computer Vision Meetup membership has grown to more than 1,600+ members in just a few months! The goal of the meetups is to bring together a community of data scientists, machine learning engineers, and open source enthusiasts who want to share and expand their knowledge of computer vision and complementary technologies. If that’s you, we invite you to join the Meetup closest to your timezone:

Upcoming Computer Vision Meetup Speakers & Schedule

We recently announced an exciting lineup of speakers for December, January, and February. Become a member of the Meetup closest to you, then register for the Zoom for the Meetups of your choice.

December 8

January 12

February 9

  • Breaking the Bottleneck of AI Deployment at the Edge with OpenVINO — Paula Ramos, PhD (Intel)
  • Understanding Speech Recognition with OpenAI’s Whisper Model — Vishal Rajput (AI-Vision Engineer)
  • Zoom Link

Get Involved!

There are a lot of ways to get involved in the Computer Vision Meetups. Reach out if any of these describe you:

  • You’d like to speak at an upcoming Meetup
  • You have a physical meeting space in one of the Meetup locations and would like to make it available for a Meetup
  • You’d like to co-organize a Meetup
  • You’d like to co-sponsor a Meetup

Reach out to Meetup co-organizer Jimmy Guerrero on Meetup.com or ping him over LinkedIn to discuss how to get you plugged in.

The Computer Vision Meetup network is sponsored by Voxel51, the company behind the open source FiftyOne computer vision toolset. FiftyOne enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster. It’s easy to get started, in just a few minutes.

--

--