Not much has been explored on the use of audio for object interactions classification in conjunction with vision or as a single modality.