Beyond imagination, Be real, Multi-modal AI

Minseo Jang
Cochl
Published in
6 min readDec 19, 2023

Recently, a question has been lingering in my mind: ‘When will AI replace me?’ This contemplation wasn’t sudden, but it has gained momentum, especially after watching Google’s latest Multi-modal AI, Gemini. Previously, I held the belief that there might be inherent limitations to the development of AI. However, even though the video is somewhat dedicated, the Gemini clip changed my perspective. It left me with the impression that machines can now embrace diverse senses simultaneously, comprehend each of them, and employ them akin to human beings.

Hands-on with Gemini: Interacting with multimodal AI

I aimed to capture your interest by introducing Gemini, and indeed, the hottest keyword in the recent AI scene is Multi-modal AI. Existing AI models are designed for specific inputs and outputs, such as text-to-image and speech-to-text. In contrast, Multi-modal AI excels in comprehending and handling more complex information, demanding diverse perspectives — essentially, it’s anything-to-anything. This AI can seamlessly process various types of data, including text, sound, image, and video.

Before the emergence of Multi-modal AI, AI models were limited to uni-modal information processing. However, with Multi-modal AI, the capability to understand and utilize diverse channel is now possible, akin to how humans use their senses to comprehend information. As a result, AI has reached a level where it can communicate in a more natural, human-like manner.

<Image 1> Five senses that human has

AI models, predating Multi-modal AI, struggled to comprehend information from complex perspectives because they were uni-modal AI, interpreting only one type of information, such as vision or sound. In contrast, Multi-modal AI can segregate information from both sound and vision, infer their perspective meanings, and then present integrated results with appropriate reactions.

While it’s not a perfect analogy, consider explaining uni-modal to dogs that comprehend human language. When their owner speaks, dogs don’t actually understand the words but instead try to discern the meaning through the speed or tone of the voice. For instance, if the owner joyfully exclaims, ‘Which naughty dog made a mess on the bed!’ in a cheerful, high-pitched tone, dogs perceive that their owner is in a good mood. Conversely, if the owner sternly declares ‘Who wants to go out for a walk!’ in an angry tone, dogs interpret it as a sign of displeasure. The YouTube shorts below will show you exactly how it works!

With the emergence of Multi-modal AI, individuals can now access information more seamlessly, enriching their daily lives. For instance, when asking a question like ‘Please provide me with some useful expressions for my overseas trip,’ a conventional Large Language Model (LLM) would present results in text or provide a link to a website. The response from the LLM is typically text-based. In contrast, Multi-modal AI presents results in diverse ways. It can fetch a YouTube video demonstrating useful expressions, highlight key scenes, instruct on pronunciation, and even offer a summarized overview of the video. This versatility allows it to deliver a range of responses regardless of the answer’s format.

As Multi-modal AI evolves, the integration of AI into our daily lives becomes a commonplace and natural phenomenon. Take ChatGPT, for instance — while it efficiently responds to prompts, it is limited in its ability to ‘suggest’ solutions by comprehending the broader context independently. Currently, its functionality is confined to answering specific queries, highlighting a current constraint in AI capabilities.

<Image 2>

If Multi-modal AI becomes commonplace, it holds the potential to comprehend situations by harnessing both visual and audio information, providing diverse suggestions tailored to each context. For instance, when a user initiates a virtual meeting, Multi-modal AI can analyze audio signals such as talking, keyboard typing, mouse clicks, and breathing. If it detects only the sound of a door closing and doesn’t detect any sounds mentioned above, it can naturally deduce that the user is absent. In response, it may autonomously initiate actions like recording the meeting, transcribing spoken words into text, summarizing key points, and adding a new schedule to the calendar.

Moreover, consider individuals dealing with ADHD. In such cases, Multi-modal AI can leverage eye-tracking technology to monitor their focus. If the eyes remain fixed on one point for an extended period or exhibit constant movement, Multi-modal AI can interpret these signals as indicators of waning attention. In response, it may activate features such as playing attention-grabbing music, or record the duration of focused engagement to provide insights into how much time users concentrate on a particular task.

<Image 3>

Additionally, Multi-modal AI can play a pivotal role in assessing patients’ conditions within the medical field. In hospitals, the task of checking on each patient’s condition often strains available medical personnel, and variations in skill levels may lead to different interpretations of the same signals. Addressing this discrepancy is crucial in healthcare.

For instance, when dealing with a patient exhibiting severe symptoms in the bronchia tubes, Multi-modal AI can aggregate comprehensive data. This includes the patient’s inhalation and exhalation history, the frequency and severity of coughing, and any changes observed before and after surgeries or medical interventions. By converting this information into audio data and then transforming it into visual results, healthcare professionals gain a more accurate and standardized dataset. Doctors can then leverage this data to enhance the precision of their prescriptions and diagnoses.

As Multi-modal AI relies on diverse data formats for input and output, the success of Multi-modal AI hinges on its ability to precisely understand and effectively utilize such data. In this regard, Cochl’s Sound AI Foundation model stands out for its impressive performance in comprehending audio data. This capability empowers Multi-modal AI to exhibit a more human-like perception and behavior.

In alignment with Cochl’s vision statement — ‘Creating ears for artificial intelligence’- Cochl’s Sound AI model possesses the capability to comprehend and utilize audio data at a level comparable to human beings. This implies that Cochl can equip a LLM with sound detection skills at a human level, thereby enhancing its ability to provide improved and clearer options and suggestions.

Particularly noteworthy is Cochl.Sense’s ability to detect specific sounds and presents the results in various formats, termed as ‘post-action.’ The Sound AI Foundation model offers versatility through two distinct formats: API and SDK. This flexibility allows users to easily customize our services according to their specific needs.

Thanks to Multi-modal AI, our lives become close friends with AI. Still, it’s far from the ability of JARVIS, the AI butler starring in the Ironman movie, but it is certain that it enriches our daily lives, increases the accessibility, and suggests the new opportunities that will happen in future.

Various aspects draw our attention to AI, yet we must conscientiously address both ethical and technical challenges together. Additionally, from a user experience perspective, it’s crucial to contemplate how AI can seamlessly communicate.

Considering AI’s increasing resemblance to human beings, what are your thoughts on this matter?

--

--

Cochl
Cochl

Published in Cochl

Creating ears for artificial intelligence.

Minseo Jang
Minseo Jang

Written by Minseo Jang

I’ve decided that in this life, I want to be defined by the things I love — putting myself into new challenges, continuously questioning and connecting the dots