Tutorial 6: Speech Recognition through Computer Vision

David Yang
Fenwicks
Published in
5 min readApr 17, 2019

Prerequisite: Tutorial 0 (setting up Google Colab, TPU runtime, and Cloud Storage)

When we study deep learning, most likely we start with image classification, such as MNIST and Cifar10. A not-so-well-known fact is that these image classification models directly apply to speech recognition. This is what we are going to do today.

Speech is audio. An audio clip is simply an array of numbers, one for each time instance, such as a millisecond. Each such number tells the speaker how much magnetic force to apply to push (or pull) the air around, which generates sound.

Audio is nothing more than a 1-dimensional sequence of numbers. © Deepmind

So audio is a 1-dimensional time series. How is this related to computer vision? The connection is a technique called Fourier transformation, which decomposes sound (represented as a time series) into the sum of simple sine waves with different frequencies. In other words, Fourier transform gives another dimension of sound: the frequencies. The result is a 2-dimensional diagram, call a spectrogram.

Example spectrogram © Wikipedia

So the task becomes to analyze those spectrograms. If we view each of them as an image, we can apply our familiar computer vision techniques, such as convolutional neural networks (ConvNets). Instead of listening to sound, the AI reads the spectrogram, and makes a prediction.

In this tutorial, we classify short voice commands using Tensorflow’s speech command dataset.

Setup. As usual, download Fenwicks, give options for hyperparameters, and initialize Google Cloud Storage (GCS).

import tensorflow as tfif tf.io.gfile.exists('./fenwicks'):
tf.io.gfile.rmtree('./fenwicks')
!git clone -q https://github.com/fenwickslab/fenwicks.git
from IPython.display import Audio
from scipy.io import wavfile
import librosa
import fenwicks as fw
import os
import numpy as np
ROOT_DIR = 'gs://gs_colab'
PROJECT = 'tutorial6'
BATCH_SIZE = 128 #@param ["128", "256", "512"] {type:"raw"}
EPOCHS = 24 #@param {type:"slider", min:0, max:100, step:1}
LEARNING_RATE = 0.001 #@param ["0.001", "0.01", "0.1"] {type:"raw"}
WARMUP = 0.1 #@param {type:"slider", min:0, max:0.5, step:0.05}
fw.colab_utils.setup_gcs()
data_dir, work_dir = fw.io.get_project_dirs(ROOT_DIR, PROJECT)

You may have noticed that we import a library called “librosa”. Librosa to sound is like OpenCV to images. The library provides common functions for reading, writing and transforming audio data, which we will use soon.

Preparing data. We download Tensorflow’s “Speech Command” dataset, version 0.01. The same dataset is used in a Kaggle competition.

data_dir_local = fw.datasets.untar_data(
fw.datasets.URLs.SPEECH_CMD_001, './speech001')

This dataset contains short audio clips (around 1 second), corresponding to 31 different words. Some of these are commands, such as “up”, “down” and “play”. If we can recognize these commands using a smartphone’s AI chip, available in both Apple and Samsung phones, we won’t need to transmit the user’s voice to a server. This protects the privacy of the user.

Let’s listen to a sample audio file from the dataset, saying “happy”:

This dataset also contains a folder called “_background_noise_”. We extract segments from the sample noise files, which represent silence. Additionally, we also generate synthesized silence with random noise. The following code is adapted from an open solution:

def gen_silence():
NUM_SEGMENT = 400
NUM_SYNTH = 500

path = os.path.join(data_dir_local, NOISE_DIR)
out_path = os.path.join(data_dir_local, 'silence/')

fw.io.create_clean_dir(out_path)
files = fw.io.enum_files(path, 'wav')

for filename in files:
_, samples = wavfile.read(filename)
for i in range(NUM_SEGMENT):
out_name = f'segment_{i}_{os.path.basename(filename)}'
data = (samples[i * 200: i * 200 + SAMPLE_RATE]
* max(0, 2 * (np.random.random() -
0.25))).astype('int16')
if data.max() != 0:
wavfile.write(out_path + out_name, SAMPLE_RATE, data)

for i in range(NUM_SYNTH):
d = fw.audio_io.gen_synth_silence(sr = SAMPLE_RATE,
n_rand = 4600)
wavfile.write(os.path.join(out_path,
f'new_synthesized_{i}.wav'), SAMPLE_RATE, d)
gen_silence()

Converting audio to Mel-spectrograms. So far we have been dealing with raw wave files, which contain 1D time series representations of audio. Next, we convert audio to images, called Mel-spectrograms. A Mel-spectrogram is a transformed spectrogram that places more emphasis on the frequencies that the (adult) human ears are more sensitive to. In general, people are more sensitive to some frequencies than others, and everybody is slightly different. For example, some schools reported play high-pitched, mosquito-like sounds that annoy teenagers, but not adults.

Let’s get the Mel-spectrogram for the “happy” audio clip we have just played:

x_example = fw.audio_io.read_logmelspectrogram(example_audio_fn)
x_example.shape

The shape of x_example is 40x101, using default parameters of Fenwicks. This is configurable. Internally, Fenwicks pads (or clips) the audio to 1 second, which corresponds to 16000 numbers in the 1D time series representation, since we have a sampling rate of 16000. Then, it applies Fourier transformation to extract the values for 40 different frequencies. Therefore, the height of the output image is 40. Fenwicks does so once for every 160 time instances, leading to a horizontal dimension of 16000/160+1=101. This is the width of the resulting image.

Now we do the same for all audio files, except for the ones corresponding to background noise:

paths_train, paths_valid, y_train, y_valid, labels =
fw.data.data_dir_tfrecord_split(data_dir_local, train_fn,
valid_fn, extractor=fw.audio_io.read_logmelspectrogram,
file_ext='wav', exclude_dirs=[NOISE_DIR])

n_classes = len(labels)
n_train, n_valid = len(y_train), len(y_valid)

The function data_dir_tfrecord_split does the same as data_dir_tfrecord as we have seen in previous tutorials, except that it splits the dataset into two parts: a training set and a validation set. By default, the train/validation split is 80/20.

Building a ConvNet. Now that we have converted our speech recognition problem to image classification, we can simply apply a familiar method: ConvNet for our purpose. Let’s build our network — a customized VGG model:

def build_nn(c=16, kernel_size=(2,5), c_dense=256, drop_rate=0.5):
model = fw.Sequential()
model.add(fw.layers.ConvBlk(c, convs=2, kernel_size=kernel_size))
model.add(fw.layers.ConvBlk(c*2, convs=2,
kernel_size=kernel_size))
model.add(fw.layers.ConvBlk(c*4, convs=2,
kernel_size=kernel_size))
model.add(fw.layers.ConvBlk(c*8, convs=2,
kernel_size=kernel_size))
model.add(fw.layers.GlobalPools2D())
model.add(fw.layers.DenseBN(c_dense, drop_rate=drop_rate))
model.add(fw.layers.DenseBN(c_dense, drop_rate=drop_rate))
model.add(fw.layers.Classifier(n_classes))
return model

In the above code, a ConvBlk with convs=2 contains two convolutional layers, each followed by a Batch Normalization and ReLU activation. The main customization here is the kernel size. Normally, for image classification, we use square kernels, typically 3x3. For our dataset, however, each image has dimension 40x101, with an aspect ratio of 2:5. Accordingly, we set the kernel size to 2x5.

Training the model. We train and evaluate our model similarly as our last tutorial, with the Adam optimizer and cosine learning rate schedule. First, we build the optimizer and model function:

n_valid = n_all // 5 // 8 * 8
n_train = n_all - n_valid
steps_per_epoch = n_train // BATCH_SIZE
total_steps = steps_per_epoch * EPOCHS
warmup_steps = int(total_steps * WARMUP)
cosine_decay = tf.train.cosine_decay_restarts
lr_func = fw.train.one_cycle_lr(LEARNING_RATE, total_steps,
warmup_steps, cosine_decay)
opt_func = fw.train.adam_optimizer(lr_func)
model_func = fw.train.get_clf_model_func(build_nn, opt_func)

Then the parser and TPUEstimator:

parser = lambda x: fw.data.tfexample_numpy_image_parser(
x, 40, 101, 1)
est = fw.train.get_tpu_estimator(n_train//BATCH_SIZE, model_func,
work_dir, trn_bs=BATCH_SIZE, val_bs=n_valid)

Note that here we can’t apply image augmentations such as flipping, cropping, and so on, since they don’t make any sense on a Mel-spectrogram plot. So we end up with plain input functions:

train_input_func = lambda params: fw.data.tfrecord_ds(train_fn,
parser, params['batch_size'], training=True)
valid_input_func = lambda params: fw.data.tfrecord_ds(valid_fn,
parser, params['batch_size'], training=False)

Let’s train the TPUEstimator and evaluate its accuracy:

est.train(train_input_func, steps=total_steps)result = est.evaluate(valid_input_func, steps=1)

Accuracy here is about 97%. If we tune the hyperparameters (for example, to train longer), we can get over 99%.

Here’s the complete Jupyter notebook:

All tutorials:

--

--