“MeowTalk” — How to train YAMNet audio classification model for mobile devices

5 min readSep 17, 2020

Each cat has its own unique vocabulary to communicate with their owners consistently when in the same context. For example, each cat has their distinct meow for “food” or “let me out.” This is not necessarily a language, as cats do not share the same meows to communicate the same thing, but we can use Machine Learning to interpret the meows of individual cats.

In this article, we provide an overview of the MeowTalk app along with a description of the process we used to implement the YAMNet acoustic detection model for the app.

MeowTalk Project and Application Overview

In this section, we provide an overview of the MeowTalk project and app along with how we use the YAMNet acoustic detection model. In short, our goal was to translate cat vocalizations (meows) to intents and emotions.

MeowTalk application — Main Screen (left), Intent Recognition Results (Right)

We use the YAMNet acoustic detection model (converted to a TFLite model) with transfer learning to make predictions on audio streams while operating on a mobile device. There are two model types in the project:

A general cat vocalization model (detects a cat vocalization);
Specific cat intent model that detects specific intents and emotions for individual cats (e.g. Angry, Hungry, Happy, etc.).

If the general model returns a high score for a cat vocalization, then we send features from the general model to the cat-specific intent model.

I highly recommend you become familiar with the YAMNet project — it is incredible. YAMNet is a fast, lightweight, and highly accurate model. You can find all the details about installation and set up in the TensorFlow repo.

Next, I will go over the main stages of the model development, training, and conversion to the TFLite format.

How to change the YAMNet architecture for transfer learning

The YAMNet model predicts 512 classes from the AudioSet-YouTube corpus. In order to use this model in our app, we need to get rid of the network’s final Dense layer and replace it with the one we need.

As you can see in the image, we have a global average pooling, which produces tensors of size 1024. To train the last dense layers of the network, we have to create a set of inputs and outputs.

Firstly, we need to choose the type of input. I propose to train the last layers with our training data and connect them to the YAMNet model after the training. Furthermore, that means we will extract YAMNet features from our audio samples, add labels to each feature set, train the network with these features, and attach the obtained model to YAMNet. Depending on the length of the audio sample, we will get a different number of feature vectors. According to the “params.py” file, we have the following properties:

patch_window_seconds = 0.96 
patch_hop_seconds = 0.48
stft_window_seconds = 0.025
stft_hop_seconds = 0.010

In the “features.py” file, you can find that the minimum length of audio is:

min_waveform_seconds = (
  params.patch_window_seconds +
  params.stft_window_seconds - params.stft_hop_seconds)

So, the minimum size of audio is 0.975s or 15,600 samples (as we have sample rate equal to 16,000) and an offset size of 0.48s. In our case, it will look like this:

According to the picture, if we have a two-second audio sample, we will get four feature vectors from the YAMNet model.

How to get features from YAMNet?

UPD: After the last update, the authors add feature extraction to the output, so we do not need to change the structure.

Outputs of the model are:

predictions — scores for each of 512 classes;
embeddings — YAMNet features;
log_mel_spectrogram — spectrograms of patches.

To create the training dataset we need to create a set of embeddings paired with the label. We assume that all the sounds from the file belong to one class and samples of each class store in a directory named as this class. But you can easily change the pipeline for the multi-label classification problem.

After this step, we have a training dataset.

Note: I advise you to implement silence removal to improve the training process if your audio files contain more than one needed sound. It increases accuracy significantly.

Training the model

First of all, we need to generate the model. We assume that each cat audio sample has only one label. I propose to create two dense layers with softmax activation. During the experiment stage, we concluded that this is the best configuration for our task. You can freely change the network structure depending on your experiment results.

Now let’s create the last layers of the model. Remember, the shape of the input is equal to 1024:

After that, we are ready to train our last layers.

During this step, we already have weights for our classifier. But this is not enough to end with the whole pipeline. Now we need to replace the last dense layers from the original YAMNet with our classifier. For this task, we need to modify the YAMNet model creation.

And now you are ready to use it.

TensorFlow Lite model

In this section, we will discuss how to convert the custom YAMNet model into the TFLite model.

Note: Since the last update of the YAMNet model, you don’t have to change the spectrogram generation process.

So, all we need to do is to modify the export function to make it compatible with our model. Replace the model creation function with our custom function and add a path to the obtained model.

Also, we get rid of the spectrogram output when we modified the model. We have to take this into account and remove redundant lines of code in the exporter.

Now you are ready to train your own great audio classification models and run them on a mobile device.

The finished project and all the instructions are available here.