Use of Transformers Hugging Face — Automatic Speech Recognition (ASR) with Poetry and Python

Published in

Allient

6 min readMar 27, 2023

In an increasingly technology-driven world, Automatic Speech Recognition has become a vital tool in a wide range of applications, from virtual assistants like Siri and Alexa to call transcriptions and video subtitling.

Automatic Speech Recognition uses algorithms and deep learning models to transcribe human speech into text. These models are trained with large datasets of speech to learn how to identify patterns and features of human speech and map them to the corresponding text transcriptions.

What is Poetry?

Poetry is a dependency management tool within Python. This tool allows you to manage dependencies, packages, and libraries within a project. The file in charge of managing the dependencies is called pyproject.toml. This file replaces setup.py, requirements.txt, configuration.cfg, MANIFEST.in, and Pipfile.

Advantages of Poetry:

Keep the dependency version compatible with the project.
Easy to add dependencies to a project.
Simple file structure.

Poetry installation:

In Windows PowerShell

(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python –

Using pip

pip install poetry

To verify if poetry installed or not:

poetry –version

Activating virtual environment

poetry shell

To deactivate the virtual environment you can use the command:

exit

Starting a new project

poetry new project-name

The structure of the project is like this:

pyproject.toml: It is the most important file that contains project metadata and other information such as project dependencies and developer dependencies with their versions. As we add new dependencies to our project, it gets added here. This file should not be explicitly edited.
README.rst: It is a file that can be edited by the project administrator and should contain information about what this project is and how to use it.
Tests: The folder contains unit tests related to different units/functions of the program.
The project folder contains another folder with the same name as the main directory containing the program code files.

Adding new dependencies

poetry add new-dependencie1 new-dependencie2 …

Installing dependencies

poetry install

Running python scripts

poetry run python project-name/main.py

What is automatic speech recognition?

Automatic speech recognition, also known as ASR (Automatic Speech Recognition), is a technology that allows the automatic transcription of the human voice into written text.

ASR is used in many applications, such as voice dictation systems, voice control of electronic devices, and speech transcription, among others. It is also an important technology in natural language processing and artificial intelligence research.

Figure 1. Automic Speech Recognition (ASR)

What are Transformers — Hugging Face?

Transformers provide APIs to easily download and train state-of-the-art pre-trained models. The use of pre-trained models can reduce computational costs and model training times. The models can be used in different modalities, such as:

Text: text classification, information extraction, answering questions, summarizing, translation, and generation of text in more than 100 languages.
Images: image classification, object detection, and segmentation.
Audio: voice recognition and audio classification.
Multimodal: response to questions in tables, optical character recognition, information extraction from scanned documents, video classification, and visual response to questions.

Create a Project with Poetry and load automatic speech recognition models

This post shows how to develop a project with Poetry and Python and load different automatic speech recognition models.

Step 1: Starting a new project with Poetry.

poetry new projectTransformers

This will create a new project with the name “projectTransformers” and you will have a different directory structure within it.

Step 2: Activate virtual environment.

poetry shell

Step 3: Add required dependencies.

poetry add transformers soundfile booksa huggingsound

This will add the necessary dependencies to develop the related project.

Step 4: Define the function in charge of loading audio.

Inside the project transformers folder, the config and data folders are created. Next, inside the config folder, the definitions.py file is created.

import os
ROOT_DIR = os.path.realpath(os.path.join(os.path.dirname(__file__), '..'))

Within another file (model-ASR.py) the following libraries are imported and the function load_audio(name_audio: str) is added:

import soundfile as sf
import librosa
import torch
import os
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from huggingsound import SpeechRecognitionModel
from transformers import WhisperProcessor, WhisperForConditionalGeneration, HubertForCTC
from config.definitions import ROOT_DIR


def load_audio(name_audio: str) -> torch.Tensor:
    # Get the path of the audio file to be transcribed
    file_path: str = os.path.join(ROOT_DIR, 'data', name_audio)

    input_audio, _ = librosa.load(file_path, sr=16000)
    return input_audio

Using the existing code inside the definitions.py file, it is possible to obtain the real path of some files regardless of the operating system. Then in the function load_audio(name_audio: str) the full path of the audio file to be loaded is obtained. This function uses the Book library to load the audio file specified in the path and uses a sample rate (sr) of 16000 Hz. The result is stored in the “input_audio” variable.

Step 5: Load Facebook’s pre-trained Wav2Vec2 speech recognition model to transcribe an audio file.

First, the Wav2Vec2 model and associated tokenizer are loaded via the Transformers library.

tokenizer = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Then you need to use the tokenizer to prepare the model input data. Making this a proper input tensioner for the model.

input_values = tokenizer(input_audio, return_tensors="pt").input_values

The model is then used to make an inference about the input data. The logits (raw values before applying the softmax function) are calculated from the input data using the model.

logits = model(input_values).logits

The PyTorch function argmax is then used to get the most likely label for each time frame from the logits. These tags are decoded into a text transcript using the tokenizer.

predicted_ids = torch.argmax(logits, dim=-1)
transcription = tokenizer.batch_decode(predicted_ids)[0]

Finally, the function in full form would be:

def model1() -> str:
    tokenizer = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

    input_values = tokenizer(input_audio, return_tensors="pt").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.batch_decode(predicted_ids)[0]

    return transcription

To check how to load the corresponding automatic speech recognition models, you can visit the Models-Hugging Face page.

Step 6: Load audio with noise and without noise and see the results.

To load the different audios, the load_audio() function is used and the model1() function is called to use the corresponding model.

# Load noise-free-audio.wav
input_audio = load_audio('noise-free-audio.wav')
model1_transcription: str = model1()
print(f"\nTranscription 1 with facebook/wav2vec2-base-960h: {model1_transcription}")

# Load audio-with-noise.wav
input_audio = load_audio('audio-with-noise.wav')
model1_transcription= model1()
print(f"\nTranscription 2 with facebook/wav2vec2-base-960h: {model1_transcription}")

Step 7: Run file models-asr.py

To run the model-ASR.py file it must be located inside the main folder of this file. Then run the command:

poetry run python models-ASR.py

An example of the project with other models uploaded to Github can be obtained here:

AlexanderG1999/Transformers-HuggingFace-ASR (github.com)

Note from author

That’s all folks, thank you for reading this article — I hope you found this article useful.

You can also follow 👍 me on my Linkedin account 😄

Do you need some help?

We are ready to listen to you. If you need some help creating the next big thing, you can contact our team on our website or at info@jrtec.io

Use of Transformers Hugging Face — Automatic Speech Recognition (ASR) with Poetry and Python

Do you need some help?

Written by Alexander Guillin