Building a highly accurate Speech to Text API supporting dozens of languages with Python and Whisper

Adriel
Slid
Published in
12 min readNov 12, 2022
speech to text: NORDIC APIs

Did you know that you can now build a super accurate speech to text API for free? Recently a major thing happened in the industry: The Open AI team open-sourced Whisper; their Speech to Text AI model that rivals anything we had so far including paid APIs like AWS STT.

I recently participated in a Hackathon at my workplace (Slid) where we spent 18 hours building cool features. My team set out to build a feature for recording an audio segment from an online video and then transcribing the audio recording.

Anticipating that there might be some other hackers out here, interested in trying out whisper model, I figured I might write a little bit about how to set it up in a django project.

Our hackathon project would have 2 main parts:

  1. Recording a desired audio range from a video and uploading the recorded file to a cloud based storage.
  2. Invoking an API that should perform the transcription task and save the results in a database.

In this article I will focus on the 2nd part; in other words how to build an API that, given a url address of an audio recording, can produce a transcript of it. There can be a wide range of use-cases for such an API functionality. Therefore for the sake of learning, I am going to summarize the process as best as I can within my time constraints.

The following are some knowledge, that a reader would need to possess to benefit most from this article:

  1. Decent knowledge of the Python Programming Language
  2. Some knowledge of Django and Django Rest Framework
  3. Basic knowledge of the Linux operating system
  4. Some knowlEdge of AWS (EC2, S3 & IAM services)

A summary of what we will be building

We are going to assume that the audio files you wish to transcribe, have been uploaded to an internet accessible storage. Where your API can reach and download them.

Your frontend will call the API, sending it the URL of the audio file, and after that it will be pooling the API to ask if the task was done and if so retrieve the results.

Using Django & DRF we will create an endpoint that receives a POST request and kicks off the transcription task, and from there-onward receives GET requests for the getting the results for the task.

On the backend, we will set up a Celery service, to asynchronously perform the actual work, and when it is done, it will save the results in a database table.

PART 0: Setting up a Django API and deploying it

If you already have an up and running API server, you can skip this part. Since setting up a django project and deploying it are not included in the scope of this article, I will link you to 2 articles, either of whichI think is nice for getting that part done:

  • Set Up Django with Postgres, Nginx, and Gunicorn on Ubuntu 20.04 (here)
  • How To Set Up Django with Postgres, Nginx, and Gunicorn on Ubuntu 22.04 (here)

Different engineers structure their django projects differently. For the purpose of this article we are going to assume a structure like this:

Sample django project structure

PART 1: Installing the required dependencies

Assuming you have a ready and running Django & DRF project, activate your virtual environment and install redis server, celery and whisper:

sudo apt update

// redis server
sudo apt install redis
redis-server
pip install redis

// celery
pip install celery

// whisper (https://github.com/openai/whisper)
sudo apt install ffmpeg
pip install setuptools-rust
pip install git+https://github.com/openai/whisper.git

PART 2: Writing a model for the new table

We will be creating a new database table for storing the results of the transcriptions tasks. Let’s call our table STTResult

Here is an example of our table definition

# project/app/models.py

from django.db import models
from django.utils import timezone

class STTResult(models.Model):
user = models.ForeignKey(YourUserTable, models.DO_NOTHING, blank=True, null=True) # user who created the task
stt_result_key = models.CharField(unique=True, max_length=120) # unique key for each stt result
stt_result_status = models.CharField(max_length=50, blank=True, null=False, default="pending") # 'pending', 'success', 'failed'
stt_result_script = models.TextField(blank=True, null=True) # transcript
stt_start_time = models.DateTimeField(blank=True, null=False, default=timezone.now)
stt_end_time = models.DateTimeField(blank=True, null=False, default=timezone.now)
audio_source_url = models.CharField(max_length=200, blank=True, null=True) # url of the audio file

class Meta:
managed = True
db_table = "stt_result" # table name

def __str__(self):
return self.stt_result_key

Some of the fields in the model above are optional; you can choose to exclude them depending on your business logic but to the minimum you would need:

  • A unique key column for identifying each task. This is also the key that will be used by the Frontend to request the results of each task
  • A status column for tracking the progress of each task
  • A results column for storing the results of the task

PART 3: Making migration (creating the actual table in the database)

Having defined our model we can now ask Django to create a database table for us. This will be done in 2 steps:

  • First we will ask django to take a look at our model definition and write instructions for creating the database table we want.
  • Then we will ask django to run the migration, which will essentially create the actual table in the database
"""
step 0:
- in your terminal navigate to the root directory of your project
(this is the folder containing your manage.py file).
- make sure your virtual environment is activated if you are using one.
(which you should)
"""

# step 1: creating a migration file
python manage.py makemigrations
"""
this will create a new file in: project/app/migrations/
(ex: 0002_stt_result.py)
"""

# step 1.5: inspect the SQL instructions (OPTIONAL: if you know some SQL)
python manage.py sqlmigrate app 0002
"""
this will print to the screen, the SQL command that will be executed on
your database when you run the next command. This is great for a
cautious engineer.
"""

# step 2: creating the actual table
python manage.py migrate

If all went well by now your database table should have been created.

PART 4: Creating a serializer for the STTResult model

According to the DRF documentation:

Serializers allow complex data such as querysets and model instances to be converted to native Python datatypes that can then be easily rendered into JSON, XML or other content types. Serializers also provide deserialization, allowing parsed data to be converted back into complex types, after first validating the incoming data.

Let’s make a quick one:

# project/app/serializers.py

from rest_framework import serializers
from app.models import STTResult

class STTSerializer(serializers.ModelSerializer):
class Meta:
model = STTResult
fields = "__all__"

PART 5: Creating an API route

Now we need to define an API route which our frontend will call to initiate a task, and to retrieve the results.

First, let’s tell django: “for all requests to endpoints starting with api/ please refer to the file project/app/urls.py to figure out which code to run.

# project/project/urls.py
from django.urls import include

urlpatterns = [
path("api/", include("app.urls"),
]

Then we need to go into project/app/urls.py and map the requests to appropriate code

# project/app/urls.py
from django.urls import path
from app.views import stt_views

urlpatterns = [
path("stt/", stt_views.stt_general), # for the initial POST request
path("stt/<stt_result_key>/", stt_views.stt_detail), # for subsequent GET requests
# ...
]

This instructs django to run the code located in stt_views.py file. In the next step we should make that code available

PART 6: STT API Logic

In this part we will write the logic for handling the POST request that is sent initially to kick off the stt task, and the GET request that is sent to retrieve the task results.

import uuid
from django.utils import timezone
from rest_framework import status
from rest_framework.response import Response
from rest_framework.decorators import api_view
from app.utils import my_authentication_decorator
from app.models import STTResult
from app.serializers import STTResultSerializer
from app.async_task.stt_task import celery_stt_task


@api_view(["POST"])
@my_authentication_decorator
def stt_general(request):
"""
[POST] /api/stt/ - Initiate STT task
"""
data = request.data

# we difinitely need an audio url
if "audio_url" not in data:
return Response(
{"message": "'audio_url' Not Provided"}, status=status.HTTP_400_BAD_REQUEST
)
# save a new STT task in the db
task = {}
task["user"] = request.user.id # assuming your authentication decorator attaches the user information on the request
task["stt_result_key"] = uuid.uuid4().hex
task["stt_result_status"] = "pending"
task["stt_result_script"] = ""
task["stt_start_time"] = timezone.now()
task["stt_end_time"] = timezone.now()
task["audio_source_url"] = data["audio_url"]
serializer = STTResultSerializer(data=task)
if serializer.is_valid():
stt_task = serializer.save() # saves the task record to the db
celery_stt.delay(stt_task.stt_key) # kicks off the async stt task
response = generateSTTResultResponse(stt_task)
return Response(response, status=status.HTTP_200_OK)
return Response({"message": serializer.errors}, status=status.HTTP_400_BAD_REQUEST)


@api_view(["POST"])
@my_authentication_decorator
def stt_detail(request, stt_result_key):
"""
[GET] /api/stt/<stt_result_key>/ - Get STT task Results
"""
# look for the requested task in the db
try:
stt_task = STTResult.objects.get(stt_result_key=stt_result_key)
except STTResult.DoesNotExist:
return Response(
{"message": "No matching STT task found"}, status=status.HTTP_404_NOT_FOUND
)
# verify that the user owns the task they requested (if your busisness logic is as such)
if stt_task.user_id != request.user.id:
return Response(
{"message": "User is not the right owner of the stt task"},
status=status.HTTP_403_FORBIDDEN,
)
# all is good
response = generateSTTResultResponse(stt_task)
return Response(response, status=status.HTTP_200_OK)


# A helper function for generating a response
def generateSTTResultResponse(stt_task):
response = {}
response["stt_result_key"] = stt_task.stt_result_key
response["stt_result_status"] = stt_task.stt_result_status
response["stt_result_script"] = stt_task.stt_result_script
response["stt_start_time"] = stt_task.stt_start_time
response["stt_end_time"] = stt_task.stt_end_time
response["audio_source_url"] = stt_task.audio_source_url
return response

Depending on your business logic, your API might need more lines of code but generally the core behaviors are as defined above.

Notice how we assumed existence of your_authentication_decorator for handling authentication, as authentication system itself is a large topic we will not cover it in this article, but if you end up struggling with setting up one, kindly leave a comment to this article, I could publish a separate one for it.

In summary what that decorator would do is to verify and decrypt some authentication token sent by the user and figure out if they are a legit user as far as our db is concerned. If the user does exist, if would append the user to the request. This way we are able to do request.user.id to get the user id.

PART 7: The asynchronous STT task logic

Our STT API logic is almost done, except we haven’t written any code for performing actual STT 😁

The last coding part will be to write the code that gets invoked when we do celery_stt.delay(stt_task.stt_key). Our task code will do the following:

  1. Obtain the audio url
  2. Download the audio
  3. Use the open-source Whisper model to transcribe the audio
  4. Save the results in the database
from django.utils import timezone
from app.models import STTResult
from project.celery import celery
import whisper

@celery.task(queue="stt_task")
def celery_stt(stt_result_key):
"""Transcribe an audio file from the internet"""

# get the task from our db
try:
stt_task = STTResult.objects.get(stt_result_key=stt_result_key)
except STTResult.DoesNotExist:
return

# check if we have a good url
audio_url = stt_task.audio_source_url
if not isValidUrl(audio_url)
markAsFailed(stt_task)
return

# transcribe the audio using whisper model
model = whisper.load_model("base")
try:
result = model.transcribe(audio_url)
except Exception as e:
markAsFailed(stt_task)
return

# save the results
stt_task.stt_result_script = result["text"]
stt_task.stt_result_status = "success"
stt_task.stt_end_time = timezone.now()
stt_task.save()

# helper function to check if we have a valid audio url
# feel free to enhance it as desired
def isValidUrl(audio_url):
if not audio_url.startswith("http"): return False
acceptable_formats = {"mp3", "wav", "webm"} # and more...
for format in acceptable_formats:
if audio_url.endsWith(format): return True
return False

# helper function to make a task as failed
def markAsFailed(stt_task):
stt_task.stt_result_status = "failed"
stt_task.stt_end_time = timezone.now()
stt_task.save()

This is the code that will gets executed whenever a /POST request is made to initiate a new STT task. As you can see we are leveraging the open source Whisper AI model to do the work, and we are going to be setting up Celery to manage the task.

PART 8: A note on decoupling API & Celery STT Services

Because the transcribing work may take a while, we want to free our API and perform the STT tasks asynchronously and in the background.

To achieve that we will decouple the STT service from the API service. From the architecture point of view decoupling these 2 services brings a lot of benefits such us:

  1. We can scale each service independently to maximize efficiency
  2. We can select the best infrastructure upon which to deploy each service (ex: compute optimized vs memory optimized server instances)
  3. Cost optimization, etc…

Normally to decouple services we need some sort of a message queue to which the services subscribe, some as publishers and others as consumers.

In our case, we are going to use Redis as our message queue. Our API will publish messages to Redis, meanwhile a Celery service that we are about to make will consume the messages and perform the tasks.

PART 9: Setting up Celery Service configurations

In the project directory let’s create a service file named: celery_stt.service with the following content.

[Unit]
Description= STT Service worker
After=network.target

[Service]
Type=simple
User=your_user
Environment=DJANGO_SETTINGS_MODULE='project.settings.settings.py'
WorkingDirectory=/home/your_user/your_git_dir/project
ExecStart=/home/your_user/your_git_dir/venv/bin/celery -A project worker -l info --concurrency=2 -Q stt_task -n stt_task_worker.%h
Restart=always

[Install]
WantedBy=multi-user.target

Things to pay attention in the above file:

  • I have assumed a linux file system of: /home/your_user/your_git_dir/project if yours is set up differently please modify the paths accordingly
  • Make sure User is set to your linux user
  • Make sure Environment is set to your django settings file
  • Make sure the WorkingDirectory is set to your main project directory
  • Make sure ExecStart is pointing all the way into your virtual environment directory where we will have installed celery. Also on this line -Q must be followed by the task name we set in the celery task code: (remember @celery.task(queue=”stt_task”)

Having created a service file, now we need to do a few more django configurations:

First, we need to define the address of the message queue where celery will be finding messages.

# project/project/settings/settings.py
# redis location
"""
here we are using a local redis server, but notice that your message broker
could be hosted virtually anywhere
"""
CELERY_BROKER_URL = "redis://127.0.0.1:6379"
CELERY_RESULT_BACKEND = "redis://127.0.0.1:6379"

Next, we need to create a celery application instance

# project/project/celery.py
import os
from celery import Celery
from django.conf import settings

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "project.settings.settings")
celery = Celery("tasks", broker=settings.CELERY_BROKER_URL)

PART 10: Starting & Enabling the Celery service with Systemd

We want our Celery service to always be up running, and to be restarted right away if the server gets restarted. This is where Systemd comes in.

According to systemd.io:

systemd is a suite of basic building blocks for a Linux system. It provides a system and service manager that runs as PID 1 and starts the rest of the system.

The commands below are being run from the project’s root directory.

First, we need to copy our service file to the location where systemd stores service files.

sudo cp celery_stt.service /etc/systemd/system/celery_stt.service

Next, we need to start the celery_stt service

sudo systemctl start celery_stt.service

Then we need to enable this service, such that in the events where the server gets restarted, this service will be started automatically

sudo systemctl enable celery_stt.service

Finally, we can check if our service is running well.

sudo systemctl status celery_stt.service

If the service was configured correctly, you should see a very noticeable green signal reading: active (running). If not kindly retrace back and see if everything was configured properly.

If you still can not figure out the problem, please leave a comment to this article so that I can help.

With that you should have the celery service running, and when your API receives the POST requests, the celery task will kick off and its results will be saved in your database table

if you want some live visibility on the stt service, you could use the following command to see the live logs:

sudo journalctl -f -u celery_stt.service

To test that your system is running well you can run the above command for your API service as well as your stt service and see how the API receives a POST request and then the stt services gets a task. The command above will also show you how long the tasks are taking.

Conclusion & Final words

In this article I tried to cover the most important pieces in setting up a speech to text API.

We have set up and used the recently open-sources Whisper, a general-purpose speech recognition model to transcribe the audio.

We used Celery to set up an asynchronous stt task so that we could decouple our services as we please.

Lastly, I am aware that there are many details not covered in this article. If you are using this article as a guide to actually build this system, I am happy to get in touch and help you. Kindly leave a comment.

I would like to thank the Openai team for open-sourcing the whisper project.

Please consider following me and sharing this article. It offers great encouragement to me, at no additional cost to you 🤗

Thanks for reading, let me know if there is a topic you would like me to write about next time.

Until next time 👋🏽

--

--