How to create a Text To Speech API using `pyttsx3` library?

4 min readMay 12, 2022

Have you ever felt bored reading a huge chunk of texts — hoping that someone can read the content to you aloud or being able to save the texts as audio file so that you can playback anytime?

Getting Started

In this article, i will briefly guide you on how you can create your own API to convert text to speech using the Python Library pyttsx3. We will be exploring 2 different methods.

Converting text to speech real time
Taking in a text and converting it to an audio file

pyttsx3

Pyttsx module in python is used to convert text to speech, unlike other libraries it works offline. Firstly, we will need to install this module and the following requirements to create a Flask API.

You can read more on flask implementation over here.

pip3 install -r requirements.txt

Now we have the environment setup, we will start to create our first endpoint.

Converting Text To Speech Real Time

Text-to-speech real time

Let’s break down the code.

First, we had to import all the relevant modules for both flask and pyttsx3. Next, we will create a POST endpoint /text-to-speech. When this file is run you can post a request to access http://0.0.0.0/text-to-speech and get a real time voice response.

This endpoint will take in a JSON request body as shown above. In our code, we will first validate the request and do a simple string processing for the text. To optimise the conversion from text to audio we will have to remove punctuation and space from the text.

Next, we have a helper function set_up to initialise the text to speech engine and configure the speed and volume of the text. After setting up the engine, we run the say method which will play the audio when this endpoint is run.

This endpoint will return a JSON response. During the processing of the text, if an error occured it will return a status code 500 with the relevant error message.

Taking in a text and convert it to an audio file

The second method is slightly more complex as we will be returning an audio file back as a response instead of playing the text real time.

We will be using AWS S3 to store the audio file.

Prerequisites

An AWS Account with your Access Key information. Refer to this guide to set up your Access Key.

We will be adding this set of code to the above main.py file.

text to speech audio file

Let’s break down the code into 2 portion.

Processing of Text

We will create a POST endpoint /text-to-speech/audio-file. When this file is run you can post a request to access http://0.0.0.0/text-to-speech/audio-file and a response with a S3 url that will access your audio file.

This endpoint will take in a JSON request body as shown above. The text processing is the same as what we did earlier in the real-time conversion setup. The only difference is the method use here, we will be using the save_to_file(text, "filename") method instead of the play method.

You can see from the code that the filename is hardcoded with audio.mp3. When running the save_to_file() method, it will save an audio file in your current directory. But to further enhance on this implementation, we want store this audio file somewhere secure and easily accessible just by an URL. By hardcoding the audio name, S3 upload function can easily reference the file. Next, we will go into detail on how we use AWS SDK for python Boto3 to upload and retrieve the audio file.

S3 Configuration

Before uploading the audio file, we will have to set up the SDK boto3 configuration. This is where we will create the boto3.client with our access_key and secret_key.

Next we will create the upload_data function that takes in the following parameters file_name (audio.mps), object_name (filename in the request body) and bucket_name. Within the function, we will call the s3 upload file api. You will be able to see the audio file uploaded to your s3 bucket in the audio directory.

After uploading the audio file, you will want to retrieve and access the file. We have a function get_presigned_s3_url that takes in bucket_name and file_name as parameters. This function will return an URL to the audio file which expires in 60 minutes (Configurable).

Something to take note here is that the file_name is the filename pass from the request body.

Sample Response

{
  "type": "success",
  "message": "You have successfully process the text...",
  "data": "https://text-to-speech-api.s3.amazonaws.com/audio/sample.mp3?AWSAccessKeyId=AKIA6FCVG6J67DGJW2OB&Signature=5Aq%2FXIQ8z5z2D7Z9B4zg%2BiiL8xA%3D&Expires=1652376411"
}

Conclusion

So far you have learned how we can use the python library to create text-to-speech API with additional functions. I had deployed a sample swagger documentation with all the endpoints. You can test all the endpoints by accessing the documentation here. Do comment below or reach out to me via LinkedIn if you need clarification on the implementation.