Deep Speech : Train Native Languages with Transfer Learning Part #0b01

Loghi
Analytics Vidhya
Published in
3 min readOct 17, 2020

Introduction

Deep Speech is an open-source Speech-To-Text engine. Project Deep Speech uses TensorFlow for the easier implementation.

Deep Speech is composed of two main subsystems:

  1. Acoustic model : a deep neural network that receives audio features as inputs, and outputs character probabilities.
  2. Decoder : uses a beam search algorithm to transform the character probabilities into textual transcripts that are then returned by the system.

Transfer learning is the reuse of a pre-trained model on a new problem. It’s currently very popular in deep learning because it can train deep neural network with comparatively little data. This is very useful in the data science field since most real-world problems typically do not have millions of labeled data points to train such complex models.

Comparatively most native languages are lack of resources to train a neural network from scratch. This approach will be useful to create your own model using a small amount of speech to text corpus.

Benchmarks

English and Mandarin (also some European languages) are the super example for Deep Speech ASR models. This shows that completely different linguistic features can be learned through the same network. It can be easily adapted to different languages. There are some language under progress in development.

Steps

  1. Clone Deep Speech repository from https://github.com/mozilla/DeepSpeech
  2. Prepare Speech and Transcript corpus form https://commonvoice.mozilla.org/en/datasets
  3. Building Language model using KenLM
  4. Get relevant pre-trained English model here https://github.com/mozilla/DeepSpeech/releases

5. Train while freezing layers in pre-trained model

*Note : Language Model is the time consuming part of this approach. Depending on the response, will make it a new article on building language model to train Deep Speech in a custom way.

Prepare workspace in Colab

Download and prepare Common Voice Data

  • Use wget command to get compressed file and extract it.

! wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-4-2019-12-10/ta.tar.gz

  • Extract and prepare voice data

! python3 /content/DeepSpeech/bin/import_cv2.py /content/DeepSpeech/data/ta

Install KenLM model

  • Build KenLM binaries

! git clone https://github.com/kpu/kenlm.git
! mkdir -p build
! cmake kenlm
! make -j 4

  • Prepare your text corpus
  • Build your language model

! python3 /content/DeepSpeech/data/lm/generate_local_lm.py

  • Generate your scorer from Language model

! python3 /content/DeepSpeech/data/lm/generate_package.py — lm /content/DeepSpeech/data/lm/lm.binary — vocab /content/DeepSpeech/data/alphabet.txt — default_alpha 0.75 — default_beta 1.85 — package /content/DeepSpeech/data/lm/kenlm_tamil.scorer

Download pre-trined English model

  • Get pre-trained model

! curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.0/deepspeech-0.8.0-checkpoint.tar.gz

  • Extract the zip

! tar xvf deepspeech-0.8.0-checkpoint.tar.gz

Train your model

  • Freeze 2 layers and train

! python3 /content/DeepSpeech/DeepSpeech.py — drop_source_layers 2 — train_cudnn 1 load_checkpoint_dir /content/20epochs — epochs 10 — alphabet_config_path=/content/DeepSpeech/data/alphabet.txt — scorer /content/DeepSpeech/data/lm/kenlm_tamil.scorer — export_dir /content/DeepSpeech/model — utf8 1 — export_dir /content/DeepSpeech/model — train_files /content/DeepSpeech/data/ta/clips/train.csv — dev_files /content/DeepSpeech/data/ta/clips/dev.csv — one_shot_infer ‘/content/decode_data/Tamil_Dataset’

Conclusion

In a brief, we can train our own model using transfer learning with low amount of data. This will be more successful in industries rather than unsupervised learning techniques.

For further clarifications and notebooks, feel free to drop a question or request.

✌️ Peace !

--

--