A Journey to Speech Recognition Using TensorFlow

Published in

The Startup

17 min readOct 28, 2020

Nowadays, we can use high precision voice recognition in our smartphone or any smart devices. However, those systems are provided by big companies like Google, Amazon or Apple, and are not free.

Many people, including myself, thought that it was because of a lack of free data. However, nowdays, we can easily find free data on the Internet.

Last update: 12/02/2023 (Article and GitHub code)

Voice datasets

Mozilla Common Voice: https://commonvoice.mozilla.org/en/datasets

Dataset sizes for some languages (2020/10/12)

English → 74GB (50 GB in 2020/10/12)
German → 30GB (16GB in 2020/10/12)
French → 25GB (15GB in 2020/10/12)
Japanese → 2.88GB (265MB in 2020/10/12)

Some other data here:

Speech command Dataset: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html

Tools

Maybe, it was because no tools are available; however TensorFlow I/O is avaible and provides necessary tools to manipulate sounds.

https://github.com/tensorflow/io

Then, we have all to implement a simple Speech Recognition System.

Processing explanation (drawing)

Convert sentence to vector

Predict next word with LSTM

Predict next word based on voice input

Basic knowledge

RNN

Recurrent neural networks (RNN) are generally used to predict next values of a sequence, such as next temperatures from the previous n-temperatures. The most famous RNN is likely LSTM.

Seq2Seq model

Seq2Seq models are used to predict an Output sequence from an Input sequence. Commonly, we can input sentence in English and predict the corresponding sentence in French.

https://towardsdatascience.com/implementing-neural-machine-translation-using-keras-8312e4844eb8

A speech can be seen like a sequence of sounds that we want to convert to a sequence of words. Then, we are supposed to be able to use Seq2Seq model for Speech Recognition.

Let’s code

The main aim of the article is to provide a simple base to your development. In consequence, I will not explain each portion of code. In addition, I will not focus on the final precision, but simply on make code excuting; I will also try to limit used library to the minimum. I hope we can discuss how to increase precision in the comment section.

Files are on Github: https://github.com/aruno14/speechRecognition

Used libraries: pip install tensorflow tensorflow_io

1) Preparation Phase: Load the data

We project to use Mozilla’s Common Voice Dataset. First, let’s simply load the data. We set a maxData variable to obtain a quick result.

import csvmaxData = 10
dataString = []
string_max_length = 0
with open('validated.tsv') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    #header = (client_id path sentence up_votes down_votes age gender accent locale segment)
    next(reader)#skip header
    for row in reader:
        if len(dataString) >= maxData:
            break
        sentence = row[2].split(" ")
        print("sentence: ", sentence)
        dataString.append(sentence)
        string_max_length = max(len(sentence), string_max_length)
print("string_max_length:", string_max_length)
print("len(dataString):", len(dataString))

File on GitHub: https://github.com/aruno14/speechRecognition/blob/main/test_load.py

2) Preparation Phase: Predict next words using LSTM

Here, we predict next words of a sentence, for example what are words after “She is”; the output could be “very clever”. Actualy, I obtained “She is mozilla going to handle ambiguities like queue and cue?”.

import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Input, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as pltmaxData = 10
dataString = []
string_max_length = 0
with open('validated.tsv') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    #header = (client_id path sentence up_votes down_votes age gender accent locale segment)
    next(reader)#skip header
    for row in reader:
        if len(dataString) >= maxData:
            break
        sentence = ("start " + row[2] + " end").split(" ")
        dataString.append(sentence)
        string_max_length = max(len(sentence), string_max_length)print("string_max_length: ", string_max_length)tokenizer = Tokenizer(num_words=2000, lower=True, oov_token="<rare>")
tokenizer.fit_on_texts(dataString)
sequences = tokenizer.texts_to_sequences(dataString)vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)X, y = list(), list()
for i, seq in enumerate(sequences):
    for j in range(1, len(seq)):
        in_seq, out_seq = seq[:j], seq[j]
        in_seq = pad_sequences([in_seq], maxlen=string_max_length)[0]
        in_seq = to_categorical([in_seq], num_classes=vocab_size)[0]
        out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
        X.append(in_seq)
        y.append(out_seq)
print('Total Sequences:', len(X))model = Sequential()
model.add(Input(shape=(string_max_length, vocab_size)))
model.add(LSTM(units=32))
model.add(Dense(units=vocab_size, activation='softmax'))
model.summary()
tf.keras.utils.plot_model(model, to_file='model_lstm.png', show_shapes=True)epoch = 300
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history=model.fit(np.array(X), np.array(y), epochs=epoch)
model.save_weights('model_lstm.h5')metrics = history.history
plt.plot(history.epoch, metrics['loss'], metrics['accuracy'])
plt.legend(['loss', 'accuracy'])
plt.savefig("learning-lstm.png")
plt.show()
plt.close()def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return Nonesinput_str = "She is"
in_text = 'start ' +  input_str
print("in_text: ", in_text)
for i in range(string_max_length):
    sequence = tokenizer.texts_to_sequences([in_text])[0]
    sequence = pad_sequences([sequence], maxlen=string_max_length)
    sequence = to_categorical([sequence], num_classes=vocab_size)[0]
    pred = model.predict(sequence, verbose=0)
    pred = np.argmax(pred)
    word = word_for_id(pred, tokenizer)
    if word is None:
        break
    in_text += ' ' + word
    if word == 'end':
        break
print("out_text: ", in_text)

File on GitHub: https://github.com/aruno14/speechRecognition/blob/main/test_lstm.py

3) Preparation Phase: Convert audio to single word

First, we need to convert sound data in format we can input in our model. The most common way is to split the sound in a sequence of frames and then extract features from them.

What audio frame length?

The process of segmenting the speech samples obtained from analog to digital conversion (ADC) into a small frame with the length with-in the range of 20 to 40 msec.

https://arxiv.org/pdf/1003.4083.pdf

How to extract features?

The common way is to use spectrogram.

Let’s try!

In this example, we used Speech command Dataset. The code is inspired by:

https://www.tensorflow.org/tutorials/audio/simple_audio

Here, there is not good reason to use LSTM, we only use it to close your goal which is to implement a Seq2Seq model.

Important: We need to order data, to have all the labels in each batch. If we do not, fitting will simply not work.

Main point: We use TimeDistributed layer in order to process frames of sound as a sequence. Frames of sound are processed like an image with Convolution layers before be input in the LSTM layer.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/TimeDistributed

import numpy as np
import glob
import os
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Input, Dense, BatchNormalization, Conv2D, MaxPooling2D, Dropout, Flatten, TimeDistributed
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import layers
from sklearn.model_selection import train_test_splitimport matplotlib.pyplot as pltdataFolder = "mini_speech_commands/"words = [os.path.basename(x) for x in glob.glob(dataFolder + "*")]
words.remove("_background_noise_")
batch_size = 32
epochs = 16
block_length = 0.500#->500ms
audio_max_length = int(2/block_length)#->2s
frame_length = 512
fft_size = int(frame_length//2+1)
step_length = 0.008
split_count = 7
latent_dim=512def audioToTensor(filepath:str):
    audio_binary = tf.io.read_file(filepath)
    audio, audioSR = tf.audio.decode_wav(audio_binary)
    audioSR = tf.get_static_value(audioSR)
    audio = tf.squeeze(audio, axis=-1)
    frame_step = int(audioSR * step_length)#16000*0.008=128
    if len(audio)<audioSR*audio_max_length:
        audio = tf.concat([np.zeros([int(audioSR*audio_max_length)-len(audio)]), audio], 0)
    else:
        audio = audio[-int(audioSR*audio_max_length):]
    spectrogram = tf.signal.stft(audio, frame_length=frame_length, frame_step=frame_step)
    spect_real = tf.math.real(spectrogram)
    spect_real = tf.abs(spect_real)
    spect_real = (tf.math.log(spect_real)/tf.math.log(tf.constant(10, dtype=tf.float32))*20)-60
    spect_real = tf.where(tf.math.is_nan(spect_real), tf.zeros_like(spect_real), spect_real)
    spect_real = tf.where(tf.math.is_inf(spect_real), tf.zeros_like(spect_real), spect_real)
    return spect_realwordToId, idToWord = {}, {}
testParts = audioToTensor(os.path.join(dataFolder, 'go/0a9f9af7_nohash_0.wav'))
print("Test", testParts.shape)X_audio, Y_word = [], []for i, word in enumerate(words):
    for file in glob.glob(os.path.join(dataFolder, word) + '/*.wav'):
        X_audio.append(file)
        Y_word.append(np.array(to_categorical([i], num_classes=len(words))[0]))X_audio, Y_word = np.asarray(X_audio), np.asarray(Y_word)class MySequence(tf.keras.utils.Sequence):
    def __init__(self, x_audio, y_word, batch_size):
        self.x_audio, self.y_word = x_audio, y_word
        self.batch_size = batch_size    def __len__(self):
        return len(self.x_audio) // self.batch_size    def __getitem__(self, idx):
        batch_y = self.y_word[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = np.zeros((self.batch_size, testParts.shape[0], testParts.shape[1]))
        for i in range(0, batch_size): 
            batch_x[i] = audioToTensor(self.x_audio[idx * self.batch_size + i])
        return batch_x, batch_yX_audio, X_audio_test, Y_word, Y_word_test = train_test_split(X_audio, Y_word)
print("X_audio.shape: ", X_audio.shape)
print("Y_word.shape: ", Y_word.shape)
print("X_audio_test.shape: ", X_audio_test.shape)
print("Y_word_test.shape: ", Y_word_test.shape)encoder_inputs = Input(shape=(testParts.shape[0], testParts.shape[1]))
normalization = BatchNormalization()(encoder_inputs)
split = tf.keras.layers.Reshape((normalization.shape[1]//split_count, -1, normalization.shape[2], 1))(normalization)
conv2d = TimeDistributed(Conv2D(34, 3, activation='relu'))(split)
conv2d = TimeDistributed(Conv2D(64, 3, activation='relu'))(conv2d)
maxpool = TimeDistributed(MaxPooling2D())(conv2d)
dropout = TimeDistributed(Dropout(0.25))(maxpool)
flatten = TimeDistributed(Flatten())(dropout)
encoder_lstm = LSTM(units=latent_dim)(flatten)
decoder_dense = Dense(len(words), activation='softmax')(encoder_lstm)
model = Model(encoder_inputs, decoder_dense)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()
tf.keras.utils.plot_model(model, to_file='model_words.png', show_shapes=True)history=model.fit(MySequence(X_audio, Y_word, batch_size), shuffle=True, batch_size=batch_size, epochs=epochs, validation_data=MySequence(X_audio_test, Y_word_test, batch_size))
model.save("model_words")
metrics = history.historyplt.plot(history.epoch, metrics['loss'], metrics['acc'])
plt.legend(['loss', 'acc'])
plt.savefig("learning-words.png")
plt.show()
plt.close()score = model.evaluate(X_audio_test, Y_word_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
print("Test voice recognition")for test_path, test_string in [('mini_speech_commands/go/0a9f9af7_nohash_0.wav', 'go'), ('mini_speech_commands/right/0c2ca723_nohash_0.wav', 'right')]:
    print("test_string: ", test_string)
    test_audio = audioToTensor(test_path)
    result = model.predict(np.array([test_audio]))
    maxIndex = np.argmax(result)
    print("decoded_sentence: ", result, maxIndex, idToWord[maxIndex])

File on GitHub: https://github.com/aruno14/speechRecognition/blob/main/test_words.py

Accuracy was 0.9012 after 100 epochs.

Realtime word recognition:

I also wrote a small script to record sound from mic and input it in our model. You may need to update meanNoise trigger value according to your system.

Script: https://github.com/aruno14/speechRecognition/blob/main/test_voice.py

4) Preparation Phase: Translate text in the same language using Seq2Seq

Since we only want to check our code, we will implement a Seq2Seq model which will output the same sentence as the input.

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Embedding, Input, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categoricalmaxData = 10
dataString = []
string_max_lenght = 0
with open('validated.tsv') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')
    #header = (client_id path sentence up_votes down_votes age gender accent locale segment)
    next(reader)#skip header
    for row in reader:
        if len(dataString) > maxData:
            break
        sentence = ("start " + row[2] + " end").split(" ")
        print("sentence: ", sentence)
        dataString.append(sentence)
        string_max_lenght = max(len(sentence), string_max_lenght)
print("string_max_lenght: ", string_max_lenght)tokenizer = Tokenizer(num_words=2000, lower=True, oov_token="<rare>")
tokenizer.fit_on_texts(dataString)vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)X_voice, X_string, Y_string = list(), list(), list()
for i, seq in enumerate(dataString):
    seq_no_tag = seq[1:-1]
    seq = tokenizer.texts_to_sequences([seq])[0]
    seq_no_tag = tokenizer.texts_to_sequences([seq_no_tag])[0]
    seq_full_no_tag = pad_sequences([seq_no_tag], maxlen=string_max_lenght-2)[0]for j in range(1, len(seq)):
        in_seq, out_seq = seq[:j], seq[:j+1]
        in_seq = pad_sequences([in_seq], maxlen=string_max_lenght-1)[0]
        out_seq = pad_sequences([out_seq], maxlen=string_max_lenght-1)[0]
        out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]X_voice.append(seq_full_no_tag)
        X_string.append(in_seq)
        Y_string.append(out_seq)def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return NoneX_voice, X_string, Y_string = np.array(X_voice), np.array(X_string), np.array(Y_string)latent_dim=32print("x_voice.shape: ", X_voice.shape)
print("x_string.shape: ", X_string.shape)
print("y_string.shape: ", Y_string.shape)
num_encoder_tokens = X_voice.shape[1]
num_decoder_tokens = Y_string.shape[1]
print("num_encoder_tokens: ", num_encoder_tokens)
print("num_decoder_tokens: ", num_decoder_tokens)# Set up the encoder
encoder_inputs = Input(shape=(num_encoder_tokens))
enc_emb =  Embedding(vocab_size, latent_dim)(encoder_inputs)
encoder_lstm = LSTM(units=latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(num_decoder_tokens))
dec_emb_layer = Embedding(vocab_size, latent_dim)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(units=latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary(line_length=200)
tf.keras.utils.plot_model(model, to_file='model_1.png', show_shapes=True)model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])
batch_size = 32
epochs = 500model.fit([X_voice, X_string], Y_string, epochs=epochs)# Encode the input sequence to get the "Context vectors"
encoder_model = Model(encoder_inputs, encoder_states)
#encoder_model.save_weights('model_encoder.h5')
encoder_model.summary(line_length=200)
tf.keras.utils.plot_model(encoder_model, to_file='model_encoder.png', show_shapes=True)# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_state_input = [decoder_state_input_h, decoder_state_input_c]
# Get the embeddings of the decoder sequence
dec_emb2= dec_emb_layer(decoder_inputs)
# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_state_input)
decoder_states2 = [state_h2, state_c2]
# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_outputs2 = decoder_dense(decoder_outputs2)
# Final decoder model
decoder_model = Model([decoder_inputs] + decoder_state_input, [decoder_outputs2] + decoder_states2)
#decoder_model.save_weights('model_decoder.h5')
decoder_model.summary(line_length=200)
tf.keras.utils.plot_model(decoder_model, to_file='model_decoder.png', show_shapes=True)def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    decoded_sentence = "start"
    stop_condition = False
    while not stop_condition:
        print("decoded_sentence: ", decoded_sentence)
        sequence = tokenizer.texts_to_sequences([decoded_sentence.split(" ")])[0]
        sequence = pad_sequences([sequence], maxlen=string_max_lenght-1)
        sequence = np.array(sequence)
        output_tokens, h, c = decoder_model.predict([sequence] + states_value)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        print("sampled_token_index: ", sampled_token_index)sampled_word = word_for_id(sampled_token_index, tokenizer)
        decoded_sentence += ' ' + sampled_word
        if (sampled_word == 'end' or len(decoded_sentence.split(" ")) > string_max_lenght):
            stop_condition = True
        # Update states
        #states_value = [h, c]
    return decoded_sentenceprint("Test translation")
for test_string in ['Chữ Đức chữ TÂM']:
    print("test_string: ", test_string)
    wordList = ("start "+ test_string + " end").split(" ")
    print("wordList: ", wordList)
    in_seq = tokenizer.texts_to_sequences([wordList])[0]
    in_seq = pad_sequences([in_seq], maxlen=string_max_lenght-2)sequence = np.array(in_seq)
    decoded_sentence = decode_sequence(in_seq)
    print("decoded_sentence: ", decoded_sentence)

File on GitHub: https://github.com/aruno14/speechRecognition/blob/main/test_trad.py

5) Final Phase: Voice recognition using Seq2Seq

First, we convert Common voice Dataset .mp3 files to .wav with a small bash script.

for i in *.mp3;
  do name=`echo "$i" | cut -d'.' -f1`
  echo "$name"
  ffmpeg -i "$i" "${name}.wav"
done

Let’s combine all we saw until now!

In order to simplify processing, we remove non-meaningfull (cleaning) part of the sound, likely silence. We simply use:

tfio.experimental.audio.trim(audio_slice, axis=0, epsilon=0.065)

import csv
import io
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential, load_model
from tensorflow.keras.layers import LSTM, Embedding, Input, Dense, BatchNormalization, Conv2D, MaxPooling2D, Dropout, Flatten
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plttf.config.set_visible_devices([], 'GPU')reuse = True
maxData = 64
max_num_words=2000
batch_size = 32
epochs = 1
latent_dim=512
model_name = "model_sentence"
data_folder = "en/"
clips_folder = os.path.join(data_folder, "clips")
block_length = 0.500#->500ms
frame_length=512
voice_max_length = int(8/block_length)#->8s
print("voice_max_length:", voice_max_length)def audioToTensor(filepath:str):
    audio_binary = tf.io.read_file(filepath)
    audio, audioSR = tf.audio.decode_wav(audio_binary)
    audioSR = tf.get_static_value(audioSR)
    audio = tf.squeeze(audio, axis=-1)
    audio_length = int(audioSR * block_length)#20-> 50ms 40 -> 25ms
    frame_step = int(audioSR * 0.010)# 128 when rate is 1600 -> 8msrequired_length = audio_length*voice_max_length    
    if len(audio)<required_length:
        audio = tf.concat([np.zeros([required_length-len(audio)]), audio], 0)
    else:
        audio = audio[-required_length:]spectrogram = tf.signal.stft(audio, frame_length=frame_length, frame_step=frame_step)
    spectrogram = (tf.math.log(tf.abs(tf.math.real(spectrogram)))/tf.math.log(tf.constant(10, dtype=tf.float32))*20)-60
    spectrogram = tf.where(tf.math.is_nan(spectrogram), tf.zeros_like(spectrogram), spectrogram)
    spectrogram = tf.where(tf.math.is_inf(spectrogram), tf.zeros_like(spectrogram), spectrogram)
    return spectrogramdef sampleFromFile(filepath):
    print("Load data from", filepath)
    with open(filepath) as tsvfile:
      reader = csv.reader(tsvfile, delimiter='\t')
      next(reader)#skip header
      for row in reader:
        sentence = row[2].replace(".", "")
        wordList = ("start " + sentence + " end").split(" ")
        if(len(wordList)<5):
            continue
        return row[1]+".wav"samplePath = sampleFromFile(os.path.join(data_folder, 'train.tsv'))
testParts = audioToTensor(os.path.join(clips_folder, samplePath))
print(testParts.shape)def loadDataFromFile(filepath):
    print("Load data from", filepath)
    dataVoice, dataString = [], []
    string_max_length = 0
    with open(filepath) as tsvfile:
      reader = csv.reader(tsvfile, delimiter='\t')
      next(reader)#skip header
      for row in reader:
        if len(dataString)>maxData:
            break
        sentence = row[2].replace(".", "")
        wordList = ("start " + sentence + " end").split(" ")
        if(len(wordList)<5):
            continue
        print(row[1], row[2], wordList)
        string_max_length = max(len(wordList), string_max_length)
        dataString.append(wordList)
        #dataVoice.append(row[1].replace(".mp3", '.wav'))
        dataVoice.append(row[1]+'.wav')
    return dataVoice, dataString, string_max_lengthdataVoice, dataString, string_max_length = loadDataFromFile(os.path.join(data_folder, 'train.tsv'))print("voice_max_length: ", voice_max_length)
print("string_max_length: ", string_max_length)
tokenizer = Tokenizer(num_words=max_num_words, lower=True, oov_token="<rare>")
tokenizer.fit_on_texts(dataString)
with io.open('tokenizer.txt', 'w', encoding='utf-8') as f:
    for word, index in tokenizer.word_index.items():
        f.write(word + ":" + str(index) + "\n")
vocab_size = min(len(tokenizer.word_index) + 1, max_num_words)
print('Vocabulary Size: %d' % vocab_size)def prepareData(dataString, dataVoice):
    X_voice, X_string, Y_string = list(), list(), list()
    all_seq = tokenizer.texts_to_sequences(dataString)
    for i, seq in enumerate(all_seq):
        voice =  dataVoice[i]
        for j in range(1, len(seq)):
            in_seq, out_seq = seq[:j], [seq[j]]
            in_seq = pad_sequences([in_seq], maxlen=string_max_length)[0]
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
            X_voice.append(voice)
            X_string.append(in_seq)
            Y_string.append(out_seq)
    return X_voice, X_string, Y_stringX_voice, X_string, Y_string = prepareData(dataString, dataVoice)
print("len(X_voice): ", len(X_voice))class MySequence(tf.keras.utils.Sequence):
    def __init__(self, x_voice, x_string, y_string, batch_size):
        self.x_voice, self.x_string, self.y_string = x_voice, x_string, y_string
        self.batch_size = batch_size
    
    def __len__(self):
        return len(self.x_voice) // self.batch_sizedef __getitem__(self, idx):
        batch_x_string = self.x_string[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y_string = self.y_string[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x_voice = np.zeros((self.batch_size, testParts.shape[0], testParts.shape[1]))
        for i in range(0, batch_size):
            batch_x_voice[i] = audioToTensor(os.path.join(clips_folder, self.x_voice[idx * self.batch_size + i]))
        batch_x_string = np.array(batch_x_string)
        batch_y_string = np.array(batch_y_string)
        return [batch_x_voice, batch_x_string], batch_y_stringdef word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return Noneif os.path.exists(model_name) and reuse:
    print("Load: " + model_name)
    model = load_model(model_name)
else:
    encoder_inputs = Input(shape=(testParts.shape[0], testParts.shape[1]))
    encoder_inputs = tf.expand_dims(encoder_inputs, axis=-1)
    
    preprocessing = preprocessing.Resizing(400, testParts.shape[1]//2)(encoder_inputs)
    normalization = BatchNormalization()(preprocessing)split = tf.keras.layers.Reshape((voice_max_length, -1, normalization.shape[2], normalization.shape[3]))(normalization)conv2d = TimeDistributed(Conv2D(34, 3, activation='relu'))(split)
    conv2d = TimeDistributed(Conv2D(64, 3, activation='relu'))(conv2d)
    maxpool = TimeDistributed(MaxPooling2D())(conv2d)
    dropout = TimeDistributed(Dropout(0.25))(maxpool)
    flatten = TimeDistributed(Flatten())(dropout)encoder_lstm = LSTM(units=latent_dim, return_state=True)
    encoder_outputs, state_h, state_c = encoder_lstm(flatten)
    encoder_states = [state_h, state_c]decoder_inputs = Input(shape=(string_max_length))
    dec_emb = Embedding(vocab_size, latent_dim)(decoder_inputs)
    decoder_outputs = LSTM(units=latent_dim)(dec_emb, initial_state=encoder_states)
    decoder_outputs = Dense(vocab_size, activation='softmax')(decoder_outputs)model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    tf.keras.utils.plot_model(model, to_file='model_sentence.png', show_shapes=True)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])model.fit(MySequence(X_voice, X_string, Y_string, batch_size), epochs=epochs, batch_size=batch_size)
model.save(model_name)def decode_sequence(input_seq):
    decoded_sentence = tokenizer.texts_to_sequences(["start"])[0]
    while len(decoded_sentence) < string_max_length:
        sequence = pad_sequences([decoded_sentence], maxlen=string_max_length)
        output_tokens = model.predict([input_seq, sequence], verbose=0)
        sampled_token_index = np.argmax(output_tokens[0])
        decoded_sentence.append(sampled_token_index)
    return tokenizer.sequences_to_texts([decoded_sentence])[0]print("Test voice recognition")
for test_path, test_string in [('common_voice_en_346569.wav', "Do you want me?"), ('common_voice_en_12677.wav', 'Man in red tshirt and baseball cap viewed from above he is has a pile of posters'), ('common_voice_en_590694.wav', 'Touchscreens do not provide haptic feedback')]:
    print("test_string: ", test_string)
    test_voice = audioToTensor(clips_folder+test_path)
    print(np.array([test_voice]).shape)
    decoded_sentence = decode_sequence(np.array([test_voice]))
    print("decoded_sentence: ", decoded_sentence)

File on GitHub: https://github.com/aruno14/speechRecognition/blob/main/sentence.py

Learning was too long to compute it on my laptop, but precision seems very bad. I hope you can use this sample as base and improve it to obtain a satisfying result :)

Convert to TensorFlow lite

To use the model on embded devices or on Android devices, we can convert it to TensorFlow lite with below command:

~/.local/bin/tflite_convert --saved_model_dir=model_reco/ --output_file=model.tflite

Convert to TensorFlow.js

We can also convert it to TensorFlow.js to use it in browser:

~/.local/bin/tensorflowjs_converter model_reco/ quantized_model/
--input_format tf_saved_model --output_format tfjs_graph_model --quantize_float16

Use our model in browser

First, we need a webserver to serve our files, let’s do it in a single line:

python3 -m http.server 3000

Then, the html page. We are lucky, FFT is natively implementet in browser with the Web Audio API.

https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API
Example of what is possible: http://arc.id.au/Spectrogram.html

However, the output is in decibels, while tensorflow output is linear.

inline float ConvertLinearToDecibels(float aLinearValue, float aMinDecibels) {
return aLinearValue ? 20.0f * std::log10(aLinearValue):aMinDecibels;
}

https://searchfox.org/mozilla-central/source/dom/media/webaudio/WebAudioUtils.h#60

We update our learning script to fit it. I don’t know why; but I had to shift -60 decibels to match outputs of my browser; it may change in your case.

Code below uses model create at “Preparation Phase: Convert audio to single word”.

<html>
<head>
  <meta charset="utf-8">
  <script src="https://code.jquery.com/jquery-3.5.1.slim.min.js" integrity="sha384-DfXdz2htPH0lsSSs5nCTpuj/zy4C+OGpamoFVy38MVBnE+IbbVYUew+OrCXaRkfj" crossorigin="anonymous"></script>
  <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js"></script>
  <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/css/bootstrap.min.css" integrity="sha384-9aIt2nRpC12Uk9gS9baDl411NQApFmC26EwAOH8WgZl5MYYxFfc+NcPb1dKGj7Sk" crossorigin="anonymous">
  <title>Speech Recognition in browser</title>
  <script>
  const WIDTH=800;
  const HEIGHT=100;
  const enDic = ['down', 'go', 'left', 'no', 'right', 'stop', 'up', 'yes'];
  let spectroOffset = 0;
  let modelEn = null;
  let captureStream = null;
  let source = null;
  let analyser = null;
  let audioCtx = null;
  let stop = true;
  const blockSize = 6;
  const maxLength = 39*6;
  let silenceCount = 0;
  let buffer = [];
  let interval = null;
  let timeFromLastReco = 100;function fillBuffer()
  {
    buffer=[];
    for(i=0;i<150;i++)
    {
      buffer.push(new Array(513).fill(0));
    }
  }async function loadModel()
  {
    modelEn = await tf.loadGraphModel('http://localhost:3000/quantized_model_en/model.json');
  }
  async function predictWord()
  {
    console.log("buffer:", buffer);
    let newBufferEn = [];
    for(i=buffer.length;i>0;)
    {
      let newBlock = [];
      for(j=0;j<blockSize;j++)
      {
        if(buffer[i-1-j])
        {
          newBlock.unshift(buffer[i-1-j]);
        }
        else {
          newBlock.unshift(new Array(513).fill(0));
        }
      }if(newBufferEn.length>=57) {break;}
      newBufferEn.unshift(newBlock);
      i-=blockSize/2;
    }
    let tensorEn = tf.tensor(newBufferEn).expandDims(-1);
    tensorEn = tf.where(tf.isInf(tensorEn), tf.zerosLike(tensorEn), tensorEn);
    let predictionEn = await modelEn.executeAsync(tensorEn.expandDims(0));
    predictionEn = predictionEn.squeeze();
    const predictionArrayEn = await predictionEn.array();
    const wordEn = await predictionEn.argMax(-1).array();
    console.log(predictionArrayEn, wordEn, predictionArrayEn[wordEn], enDic[wordEn]);
    if(predictionArrayEn[wordEn]>0.6)
    {
      $('#recognitionResult').append($('<p></p>').html("en: " + enDic[wordEn] + " " + predictionArrayEn[wordEn]));
    }
    return wordEn;
  }
  function catchData()
  {
    const spectrum = new Float32Array(analyser.frequencyBinCount);
    analyser.getFloatFrequencyData(spectrum);
    let arrayData = Array.from(spectrum);
    timeFromLastReco++;
    const volume = (arrayData[0] + arrayData[1] + arrayData[2] + arrayData[3])/4;
    if(volume < -60)
    {
      //console.log("Silence skip:", volume);
      silenceCount++;
      if(silenceCount>100)
      {
        fillBuffer();
      }
      return;
    }
    silenceCount=0;
    arrayData.push(arrayData[arrayData.length-1]);
    buffer.push(arrayData);
    if(buffer.length>(maxLength*blockSize/2-1)/4&&(timeFromLastReco>10))
    {
      timeFromLastReco=0;
      buffer.shift();
      predictWord();
    }
  }
  function catchDataFile()
  {
    const spectrum = new Float32Array(analyser.frequencyBinCount);
    analyser.getFloatFrequencyData(spectrum);
    let arrayData = Array.from(spectrum);
    const volume = (arrayData[0] + arrayData[1] + arrayData[2] + arrayData[3])/4;
    arrayData.push(arrayData[arrayData.length-1]);
    buffer.push(arrayData);
  }
  function catchDataRec()
  {
    interval = setInterval(function(){ catchData() }, 8);
  }
  function catchDataRecFile()
  {
    interval = setInterval(function(){ catchDataFile() }, 8);
  }
  async function drawSignal()
  {
    if(stop) return;
    let bufferLength = analyser.frequencyBinCount;
    let dataArray = new Uint8Array(bufferLength);
    let canvas = document.getElementById('oscilo');
    let canvasCtx = canvas.getContext('2d');
    let drawVisual = requestAnimationFrame(drawSignal);
    analyser.getByteTimeDomainData(dataArray);
    canvasCtx.fillStyle = 'rgb(200, 200, 200)';
    canvasCtx.fillRect(0, 0, WIDTH, HEIGHT);
    canvasCtx.lineWidth = 2;
    canvasCtx.strokeStyle = 'rgb(0, 0, 0)';
    canvasCtx.beginPath();
    let sliceWidth = WIDTH * 1.0 / bufferLength;
    let x = 0;
    for(let i = 0; i < bufferLength; i++) {
      let v = dataArray[i] / 128.0;
      let y = v * HEIGHT/2;
      if(i === 0) {
        canvasCtx.moveTo(x, y);
      } else {
        canvasCtx.lineTo(x, y);
      }
      x += sliceWidth;
    }
    canvasCtx.lineTo(canvas.width, canvas.height/2);
    canvasCtx.stroke();
  }
  async function drawSpect()
  {
    if(stop) return;
    const spectrum = new Uint8Array(analyser.frequencyBinCount);
    analyser.getByteFrequencyData(spectrum);
    const spectroCanvas = document.getElementById('spectrogram');
    const spectroContext = spectroCanvas.getContext('2d');
    requestAnimationFrame(drawSpect);
    const slice = spectroContext.getImageData(0, spectroOffset, spectroCanvas.width, 1);
    for (let i = 0; i < spectrum.length; i++) {
      slice.data[4 * i + 0] = spectrum[i] // R
      slice.data[4 * i + 1] = spectrum[i] // G
      slice.data[4 * i + 2] = spectrum[i] // B
      slice.data[4 * i + 3] = 255         // A
    }
    spectroContext.putImageData(slice, 0, spectroOffset);
    spectroOffset += 1;
    spectroOffset %= spectroCanvas.height;
  }
  async function startAudioCapture()
  {
    console.log("startAudioCapture");
    stop = false;
    let constraints = {audio: {channelCount:1, echoCancellation:true, noiseSuppression:true, sampleRate:16000}, video: false};
    try {
      captureStream = await navigator.mediaDevices.getUserMedia(constraints);
      console.log("captureStream: ", captureStream);
      console.log("getAudioTracks: ", captureStream.getAudioTracks());
      const audio_track = captureStream.getAudioTracks()[0];
      console.log("Use mic: ", audio_track.label);
      audioCtx = new (window.AudioContext || window.webkitAudioContext)();
      console.log("audioCtx.sampleRate:", audioCtx.sampleRate);
      analyser = audioCtx.createAnalyser();
      source = audioCtx.createMediaStreamSource(captureStream);
      source.connect(analyser);
      analyser.fftSize = 1024;
      analyser.smoothingTimeConstant = 0;
      console.log("frequencyBinCount: ", analyser.frequencyBinCount)
      drawSignal();
      drawSpect();
      catchDataRec();
    } catch(err) {
      console.error("Error: " + err);
    }
  }
  $(document).ready(function() {
    loadModel();
    fillBuffer();
    $("#startAudioCapture").click(function() {
      startAudioCapture();
    });
    $(".fileToWord").click(function() {
      const filename=$(this).attr('file');
      const sr = $(this).attr('sr');
      fillBuffer();
      stop=false;
      audioCtx = new (window.AudioContext || window.webkitAudioContext)({"sampleRate":sr});//16000}); 44100
      source = audioCtx.createBufferSource();
      analyser = audioCtx.createAnalyser();
      source.connect(analyser);
      source.onended = function(){
        predictWord();
        clearInterval(interval);
        source.stop(0);
        stop=true;
      };
      analyser.fftSize = 1024;
      analyser.smoothingTimeConstant = 0;
      console.log("frequencyBinCount: ", analyser.frequencyBinCount);
      let request = new XMLHttpRequest();
      request.open('GET', filename, true);
      request.responseType = 'arraybuffer';
      request.onload = function() {
        let audioData = request.response;
        audioCtx.decodeAudioData(audioData, function(buffer) {
          source.buffer = buffer;
          source.connect(audioCtx.destination);
          source.loop = false;
          console.log(audioCtx.sampleRate);
          const spectrum = new Uint8Array(analyser.frequencyBinCount);
          analyser.getByteFrequencyData(spectrum);
          const spectroCanvas = document.getElementById('spectrogram');
          console.log(spectrum.length)
          spectroCanvas.width = spectrum.length;
          spectroCanvas.height = 200;
          source.start(0);
          catchDataRecFile();
          drawSignal();
          drawSpect();
        },
        function(e){"Error with decoding audio data" + e.err});
      }
      request.send();
    });
    $("#stopAudioCapture").click(function() {
      console.log("stopAudioCapture");
      stop=true;
      clearInterval(interval);
      if(audioCtx!=null) audioCtx.close();
      if(captureStream!=null) captureStream.getAudioTracks().forEach(function(track) {if (track.readyState == 'live') {track.stop();}});
    });
  });
</script>
</head>
<body>
  <div class="container">
    <div class="jumbotron">
      <h1>Speech Recognition in browser</h1>
      <div class="align-center">
        <a id="startAudioCapture" class="btn btn-primary">Share mic</a>
        <a id="stopAudioCapture" class="btn btn-primary">Stop mic</a>
        <a class="fileToWord btn btn-primary" file="clips/00f0204f_nohash_0.wav" sr="16000">File to word (down)</a>
        <a class="fileToWord btn btn-primary" file="clips/0b40aa8e_nohash_0.wav" sr="16000">File to word (yes)</a>
        <a class="fileToWord btn btn-primary" file="clips/1c6e5447_nohash_0.wav" sr="16000">File to word (no)</a>
        <a class="fileToWord btn btn-primary" file="clips/2a89ad5c_nohash_0.wav" sr="16000">File to word (left)</a>
      </div>
      <hr>
      <canvas id="oscilo" width="800" height="100"></canvas>
      <canvas id="spectrogram" width="1024" height="400"></canvas>
      <h2>Recognition results</h2>
      <p id="recognitionResult"></p>
    </div>
  </div>
</body>
</html>

File on GitHub: https://github.com/aruno14/speechRecognition/blob/main/html/index.html

HTML page result

Next Journey: Accurency improvement

Finaly, to create a usuable system we need to improve accurency.

When I searched about voice features extraction, Wikipedia told me:

MFCCs are commonly used as features in speech recognition [6] systems, such as the systems which can automatically recognize numbers spoken into a telephone.

https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

We can implement it ourself.

https://developpaper.com/mel-frequency-cepstrum-coefficient-mfcc-and-its-implementation-in-python/

However, TensorflowIO has function for Mel-frequency cepstrum.

https://www.tensorflow.org/io/tutorials/audio

More details: https://aruno14.medium.com/comparaison-of-audio-representation-in-tensorflow-b6c33a83d77f

Nevertheless, there are not such functions in the Web Audio API, in consequence, we will need to implement it ourself in javascript to use them in browser.

More explanation about FFT

Comment: On the Internet I fount that the frequency range for human speech varies between varies between 80 and 260hz (450hz for children). However, when I tried myself using Audacity, I feel frequencies until ~1kHz are important to heard a clear sound.

In order to understand how the system works, we need some knowledge about spectrogram and FFT (Fast Fourier Transform).

Mozilla Common Voice Dataset sampling rate is 48000Hz, so we have 48000 value in 1 second. Let’s take a frame of 25ms or 40ms.

1s/40 = 25ms or 1s/25 = 40ms
48000/40 = 1200 samples or 48000/25 = 1920 samples
44100/40 = 1102.5 samples or 44100/25 = 1764 samples

Refering first above link, the lowest detectable frequency F0 = 5*(SamplingRate/Window Size). 5 seems a very big coefficient. Theoritical coefficient is 2.

F0 = 2*(48000/1200) = 80Hz -> Not bad
F0 = 2*(48000/1920) = 50Hz -> Too precise
F0 = 2*(48000/1024) = 93Hz -> Not bad
F0 = 2*(44100/1024) = 86Hz -> Not bad

1024 seems enough for 48000hz and 44100hz files. Since, power of two value is required, next one 2048 will be too precise for our purpose.

The bins or strides (horizontal line or graph) represent the frequency resolution of the resulting spectrogram: 1024 length give 512 bins with a resolution of 46.87Hz. Max detected frequency is 23997Hz. In order to optimize system, we could remove unnecessary high-frequency values.