Simplify Your Speech Recognition Workflow with SparkNLP

Using SparkNLP’s state-of-the-art Speech Recognition Models

Published in

spark-nlp

4 min readMay 29, 2024

Natural Language Processing is an exciting technology with daily breakthroughs and limitless possibilities for how we express ourselves. Speech recognition, a key area of NLP, adds layers of complexity. When transcribing spoken language, we encounter a myriad of challenges: accents and dialects, background noise, varied speaking speeds, and even homophones that sound alike but have different meanings. So, how do we address these challenges and ensure accurate speech-to-text conversion?

Enter SparkNLP, an open-source library that simplifies the process and integrates seamlessly with Apache Spark. In this article, we’ll explore how SparkNLP can streamline your speech recognition workflow, making it easier and more efficient than ever before.

📌 Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the capability to automatically understand audio inputs and transcribe them to text.

📌 This allows applications in many fields such as automatic caption generation in videos, transcribing business meetings, helping people with typing messages through voice, and much more.

We currently support three types of models Whisper, Wav2Vec, and HuBERT. They are end-to-end implementations of ASR, meaning that they encode audio input and transcribe with a language model using a Connectionist Temporal Classification (CTC) decoder.

To find all pre-trained models and pipelines, you can follow these links:

List of all available ASR models
List of all available ASR pipelines

Now let's get into it!

All you have to do is install sparkNLP :

For this project, I’ll be using Colab, conveniently for which SparkNLP offers setup code ⬇️

!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

After installing we can import the required functions.

import librosa
import pandas as pd
import pyspark.sql.functions as F
import sparknlp
from IPython.display import Audio
from pyspark.sql import functions as F
from pyspark.sql.types import (
    ArrayType,
    FloatType,
    LongType,
    StringType,
    StructField,
    StructType,
)
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline

Don't worry if you’re following along with the imports, they’ll make sense later

Loading audio files

▶︎ Loading an audio file. Let’s download a sample WAV file

!wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/audio/samples/wavs/ngm_12484_01067234848.wav

▶︎ We will use librosa library to load/resample our WAV file

data, sampling_rate = librosa.load(FILE_PATH, sr=16000)

# let's convert them to floats
data=[float(x) for x in data]

► This is how we can create PySpark DataFrame from the librosa results

schema = StructType([
        StructField("audio_content", ArrayType(FloatType())),
        StructField("sampling_rate", LongType())
])

df = pd.DataFrame({
    "audio_content":[data],
    "sampling_rate":[sampling_rate]
})

spark_df = spark.createDataFrame(df, schema)

spark_df.printSchema()

Using Pretrained Pipelines

► The simplest and fastest way is to use a pre-trained pipeline for ASR

#Download a pre-trained pipeline
pipeline = PretrainedPipeline('asr_whisper_tiny_english_pipeline', lang='en')

pipelineDF = pipeline.transform(spark_df)

pipelineDF.printSchema()

pipelineDF.select("text.result", "text.metadata").show(truncate=False)

OR

Using Pretrained Models with Custom Pipeline

►You can also construct your own custom Pipeline by using Spark NLP pretrained Models. This way you have more control and flexibility over the entire pipeline.

Whisper Model

There are many pretrianed Whisper models, a good start is to use the official OpenAI models:

audio_assembler = (
    AudioAssembler().setInputCol("audio_content").setOutputCol("audio_assembler")
)

speech_to_text = (
    WhisperForCTC.pretrained("asr_whisper_tiny_opt")
    .setInputCols("audio_assembler")
    .setOutputCol("text")
)

pipeline = Pipeline(stages=[audio_assembler, speech_to_text])

pipelineDF = pipeline.fit(spark_df).transform(spark_df)

pipelineDF.select("text.result", "text.metadata").show(truncate=False)

Want to see this in action? Check out our live demo where you can try it out with your own audio files! It’s super easy and fun to see how it converts speech to text right before your eyes. Give it a go here and explore what SparkNLP can do!

For further reading and resources, consider exploring the following:

Spark NLP website
View this notebook in colab
Spark NLP GitHub
Recent Advances in Speech Recognition Research
Community Forums and Discussions
For extended examples of usage, see the Spark NLP Workshop repository.