Multimodal Deep Learning

2 min readJan 25, 2023

If you are an intermediate Machine Learning Developer/Programmer then you must have come across the term or maybe this is your first time coming across the term. I am writing this blog explaining the term and giving a really cool idea about multimodal deep learning. Basically multimodal learning is maybe one of the closest thing that AIML has come to human intelligence. In these models we can provide Two different types of input date or maybe we can rephrase it as we can provide Two different senses of a human, like providing a machine learning model with text and image data as input. What can it do? Now that’s where things get interesting. With multimodal models we can extract information from images and in some other example with input data as audio and video we can even check the confidence of someone.

If we use CNN as one input node and use ANN as another node then we will be able to combine text and image data into one multimodal model. I tried looking up for any code samples or examples on how to implement multimodal learning but couldn’t find anything other than really complicated code a million files (just a figure of speech) with some complicated code I couldn’t look into for very long (cause of exams coming) but I could come up with something from the code I saw. Basically we can use Two different neural networks and use their outputs as inputs for a final third neural network and get a combined output.

This is Multimodal Deep Learning

Signing Off

Cypher De. Lyncan

Multimodal Deep Learning

Written by Cypherlynk