In this blog post, we will present an introduction to recent advances in multimodal information retrieval and conditional language models. In other words, it will be about machine learning, dealing with both image and textual data.
Everything written here is the ground and absolute truth, gated by the understanding of the author.
Why do we want to do this?
First of all, it is necessary to explain why we are interested in extracting knowledge from both image and text data.