Geek Culture
Published in

Geek Culture

Review — Look, Listen and Learn (Self-Supervised Learning)

Self-Supervised Learning Using L³-Net for Audio-Visual Correspondence Task (AVC)

Audio-visual correspondence task (AVC): By seeing and hearing many unlabelled examples, a network should learn to determine whether a pair of (video frame, short audio clip) correspond to each other or not.


1. Core Idea

1.1. Binary Classification Task

1.2. Difficulties

2. L³-Net: Network Architecture

L³-Net: Network Architecture

2.1. Vision Subnetwork

2.2. Audio Subnetwork

2.3. Fusion Network

3. Training Data Sampling & Datasets

3.1. Training Data Sampling & Other Details

3.2. Datasets

3.2.1. Flickr-SoundNet

3.2.2. Kinetics-Sounds

4. Audio-Visual Correspondence (AVC) Results

Audio-visual correspondence (AVC) results

5. Transfer Learning Results

5.1. Audio Features on ESC-50 & DCASE

Sound Classification

5.2. Video Features on ImageNet

Visual classification on ImageNet

6. Qualitative Results

6.1. Visual Features

Learnt visual concepts
Visual semantic heatmap

6.2. Audio Features

Learnt audio concepts
Audio semantic heatmaps


Self-Supervised Learning

My Other Previous Paper Readings



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store