Who made the sound? - From sound identification to speaker recognition
When I was young, I used to engage in a particular behavior whenever my parents called. I tended to imitate my brother, and when my parents said, “Put your sister on the phone,” I would reveal that it was me, not my younger brother. Even after growing up, when my friend’s boyfriend called her, instead of passing the phone to her, I would talk to him while pretending to be my friend. I guess many of us have had similar experiences.
If you have ever been deceived by this trick or if you’ve done similar things yourself, let’s discuss ‘Speaker Recognition,’ one of the Sound AI technologies. With this technology, no one will be able to deceive you from now on!
1. Sound Identification and Speaker Recognition
In the story of “Ali Baba and the Forty Thieves,” Ali Baba knows the secret phrase to enter the cave where treasures are hidden. The secret phrase is “Open, Sesame!” which Ali Baba has to shout to gain entry into the cave.
If you’re a big fan of the Harry Potter series like me, then you may be familiar with the “Marauder’s Map.”
This map was handed down to Harry by the Weasely twins, and it served as a perfect accomplice for secretly carrying out mischief. To activate the map, users had to tap it with their wands and say “I solemnly swear that I am up to no good.”
What do “Open, Sesame!” and “I solemnly swear that I am up to no good” have in common? The key is that if someone knows the secret code or phrase, they can access the associated information or location. This concept is similar to sound identification technology, in which the spoken human language (input) is identified and transcribed into text for the computer to understand. The computer then compares the input text to the information it already has and decides whether they match.
In other words, this means that ‘anyone who knows it’ can access the information. You might be wondering about the vulnerability of security. To address this concern, I introduce ‘Speaker recognition’ technology.
2. Speaker Recognition by validation — Speaker Identification and Speaker Verification
Speaker recognition technology identifies or confirms a person based on the unique traits or characteristics of their voice. Speaker recognition technology can be divided into two broad categories — speaker identification and speaker verification — based on the validation process.
Speaker identification technology matches new input voices to enrolled voices. This technology allows us to determine the speaker’s identity — is it A’s voice or B’s voice.
However, a critical downside of this technology is that even when the C’s voice is not enrolled beforehand, speaker identification technology tries to define the most resembled voices and judge even if the not-enrolled voice is the verified one. This occurs because speaker identification technology focuses on ‘similarity’, which can be a weakness in this part.
On the other hand, speaker verification verifies and confirms whether new input speaker identity matches the enrolled voice. It calculates the similarity between the uttered voice and the enrolled voice, then determines whether to accept or reject the uttered voice. The output of this technology can be ‘Yes’ or ‘No’.
Therefore, when a stranger’s voice is input, speaker identification identifies the most similar enrolled voice to the utterance, while speaker verification determines whether the utterance matches any enrolled voices.
Speaker verification technology can play an important role in sectors requiring individual profile creation, such as banking services or services requiring advanced bio-verification technology and it can ensure enforced safety. Speaker identification technology can be used in meeting transcripts or controlling IoT devices which multi-user uses.
3. Speaker Recognition — Text-Dependent and Text-Independent
Speaker recognition technology can be used in two different ways — text-dependent or text-independent. For text-dependent methods, the recognition of the speaker requires the utterance to be in a designated form or phrase, such as the phrase “I solemnly swear that I am up to no good”. Although this method is less convenient, it demonstrates a higher recognition rate.
Unlike text-independent, text-independent methods do not limit the form or phrases uttered. With this technology, even if the speaker says “I thoughtlessly swear that I am up to good”, it can still identify and verify the speaker. However, it requires more advanced technology, and the overall recognition rate is comparatively lower.
Not covered in detail here, the speaker separation technology can separate the individual speech signals of each speaker from the situation where multiple speakers are talking simultaneously. This technology can be used in making transcripts of video contents or providing additional assistance for the hearing-impaired.
To improve speaker recognition technology, several factors need to be considered. For example, the speaker might cough during speaking or the speaking environment contains many noises, and the speaker’s voice can be changed depending on their age, health condition or even voice break. Therefore, considering these complex variables can be the key to advanced speaker recognition technology.
4. What’s going on Cochl?
Meet our motto, ‘Creating ears for artificial intelligence’, as we conduct various experiments to enhance our Sound AI service. Today, I’d like to introduce one of these experiments, the ‘Meeting Note’ project.
Meeting Notes utilize Cochl’s speaker identification technology to create meaningful meeting time for attendees. Users register their voices by recording designated sentences, and when the meeting starts, the system automatically creates meeting notes about who talked about what during the meeting.
This project combines three technologies: Speech identification (Transforming utterance to text), speaker identification & speaker verification (identifying and confirming right speakers), and speaker separation (isolating speakers’ utterances when multiple speakers are talking simultaneously).
Cochl is continuously conducting new experiments to enhance Cochl.Sense for universal use in the field of Sound AI. Look forward to significant improvements in the upcoming Cochl.Sense release!