Self-supervised Deep Learning for Modelling Audio-Visual Correspondences — Human perception is multidimensional and a balanced combination of hearing, vision, smell, touch, and taste. Recently, many pieces of research have tried to step forward on the road of improving machine perception by transitioning from single-modality learning to multimodality learning. If you are wondering what modality is, it is the…