The On-Device Machine Learning Behind the Pixel 4 Google New Recorder Application

Anikesh
Analytics Jobs
Published in
5 min readJan 1, 2020

Over the past 2 decades, Google has made information widely accessible by searching — from the textual information, videos and photos, to jobs and maps. But a lot of the world’s information is conveyed through speech. However, almost everyone makes use of recording devices to capture information that is important in interactions, interviews, lectures and much more, it could be extremely difficult to eventually parse through hours of recordings to determine and extract information of interest. But imagine if there was the ability to immediately transcribe and tag long recordings in real-time, allowing you to intuitively find the information you need and at the right time.

How easier would life be, if an application that would be transcribing your recorded lectures or interviews as notes for you? How about if one could visualize easily which part of those hour-long lectures would have the much-needed emphasis.

There’s no question about the fact that Google is at the cutting-edge of artificial intelligence (AI) and machine learning (ML). From industry-leading computational photography to making suggestions while we email. Artificial intelligence (AI) and machine learning (ML) are at the core of most of Google’s work.

Machine learning is among the most amazing new feature of our smartphones, though it is a phrase that is frequently used and rarely understood. In a post, Google took the time to describe in detail the way machine learning algorithms had been used and implemented especially in the new Recorder app for Pixel cell phones, particularly how machine learning tends to make this the greatest recording app you have ever used in your life.

Recorder’s interface that is deceiving. At the back-end is a set of code that is designed in such a way that it can listen to, transcribe, understand, and classify other audio and the speech that is heard by your phone when recording together with the Recorder app. While recording, you will instantly notice a couple of things: with the wavelength as well as the timeline presented, you will also see categories and different colors show up on display in the main tab, even though the words being said are put in the transcription tab and appear in real-time.

Currently, this feature is only available for Pixel 4 users.

The Pixel 4’s Recorder app is still another instance of Google’s ML prowess. The company presented the smart audio recorder app with the Pixel 4, using on-device machine learning to automatically transcribe the recording. Google has now explained in detailed exactly how the new Recorder app functions.

Transcribing

Recorder transcribes speech in real-time making use of an on-device automatic speech recognition design based on improvements announced earlier this year. An important component to a lot of Recorder’s smart features, we made certain this model can easily transcribe long audio recordings (a few hours), while simultaneously indexing conversations by mapping text to timestamps as computed by way of the speech recognition model. This makes it possible for the user to check out a word in the transcription and initiate playback beginning from that time in the recording, or perhaps to look for a term and jump to the precise point in the recording where it was being said.

Recording Content Visualization via Sound Classification

While showing a transcript for a recording is beneficial and also enables one to search for particular words, often (especially for long recordings) it is more helpful to visually search for sections of a recording on the bases of particular moments or even sounds. To allow this, the Recorder also presents audio-visually as a colored waveform in which every color is linked with a different sound category. This’s accomplished by combining research into using CNNs to classify audio sounds (e.g., determining a dog barking or a musical instrument playing) with earlier published data sets for sound event detection to classify obvious sound events in the recorded audio frames.

In most situations, many sounds could appear at the same time. To imagine the sound in a really clear way, we decided to color every waveform bar in one color which presents probably the most dominant sound in a certain time frame i.e. 50ms bars in our case. The colorized waveform allows users to know what content type was taken in a certain recording and then navigate along with an ever-growing audio library easily. This takes a visual representation of these sound recordings to the users and allows them to browse over audio events in their recordings.

Recorder implements a sliding window capability that processes partly overlapping 960ms audio frames at 50ms time periods & outputs a sigmoid score vector, that represents the probability for every supported audio category in the frame. We implement a linearization process on the sigmoid scores combining a thresholding mechanism, as a way to maximize the system precision and report the proper sound classification. This procedure for analyzing the information in the 960ms window with modest 50ms offsets can help you pinpoint exact start and end times in a fashion that is less prone to errors than analyzing consecutive large 960ms window slices by themselves.

Since the model analyzes each frame independently, it could be susceptible to fast jittering between audio classes. This’s resolved with an adaptive size median filtering method applied to the model audio class outputs, therefore offering a smoothed consecutive output. The procedure runs continuously in real-time, requiring it to satisfy really strict power usage limitations.

Suggesting Tags for Titles

When a recording is done, Recorder suggests 3 tags that this app deems to represent by the most memorable content, allowing the end-user to easily compose a meaningful title. To be in a position to recommend these tags right when the recording finishes, Recorder analyzes the information in the recording as it’s being transcribed. First, Recorder counts phrase occurrences along with their grammatical role in the sentence. The terms recognized as entities are capitalized. Next, we use an on-device part-of-speech-tagger — a design that labels every word in the sentence based on its grammatical function — to identify common nouns and proper nouns, which seem to be far more memorable by users. The recorder has a previous score table supporting both uni gram and big ram terms extraction. In order to create the scores, we educated a boosted judgment tree with conversational details and utilized textual features as document words frequency & specificity. Last, filtering of stop text and swear words are used and the best tags are outputted.

If you have used the new Google Recorder App, share your experience in the comment section below…

--

--