How to convert PDFs into Audiobooks using OpenAI’s Text-to-Speech API

Anurag Kumar
Coinmonks
8 min readNov 29, 2023

--

Prompt: A visually compelling image representing the theme of AI transforming PDFs into audiobooks. The foreground features a large, open book with visible text and pages mid-turn, symbolizing the transition from traditional reading to modern technology. In the background, a semi-transparent waveform or audio spectrum visualization is overlaid on the pages, indicating the transformation of text into audio. The central element is a digital tablet or smartphone displaying a stylized audiobook…
Image created using DALL·E

I absolutely love audiobooks. Whether I’m driving, working out, or just unwinding with an evening walk, audiobooks make a perfect companion.

Today, I put my coding skills to work and came up with a Python program that turns PDFs into audiobooks. It’s neat to see how OpenAI’s text-to-speech model can bring a written document to life.

Let’s dive into it and see how we can turn any PDF into an Audiobook.

This problem can be divided into 5 steps:

  1. Extract text from PDF
  2. Chunking the text: Currently OpenAI only supports text-to-speech conversion in chunks of 4096 characters. Hence we would need to break the pdf in pieces of text that are less 4096 characters.
  3. Text-to-Speech Conversion
  4. Combine Individual Audio Clips into a Single File
  5. (Optional) Creating an Video File with Cover Image and Audio

And this is how we do it in Python.

You can download the code from my Github Repo. The code is self-explanatory, and I have added tons of comments.

Install the necessary Libraries

First, we need to install the necessary libraries. Here is the list of libraries we would need to accomplish the different steps mentioned above.

  • pdfplumber: For reading and extracting data from PDFs, crucial for accessing the text content within the PDF file.
  • openai: The OpenAI Python library, used here for its text-to-speech capabilities, allowing us to convert extracted text into audio.
  • pydub: Library for audio manipulation, which will be used for combining audio files.
  • moviepy: An optional library for video editing tasks, which can also handle audio files, offering additional audio processing capabilities if required.
!pip install pdfplumber
!pip install openai
!pip install pydub
!pip install moviepy

Extract text from PDF

Let’s defines the pdf_to_markdown function, which is used to extract text from a PDF file and format it in Markdown. The function takes the path of the PDF file as an input and returns the formatted text.

For this example I used a PDF file titled Kant_What_is_Enlightenment.pdf, which I downloaded from North Hampton Community College website.

import pdfplumber

def pdf_to_markdown(pdf_path):
# Open the PDF file at the given path
with pdfplumber.open(pdf_path) as pdf:
markdown_content = ""
# Loop through each page in the PDF
for page in pdf.pages:
# Extract text from each page
text = page.extract_text()
if text:
# Format the text with basic Markdown: double newline for new paragraphs
markdown_page = text.replace('\n', '\n\n')
# Add a separator line between pages
markdown_content += markdown_page + '\n\n---\n\n'

return markdown_content
# Function Usage
pdf_path = 'Kant_What_is_Enlightenment.pdf' # Replace with the actual PDF file path
markdown_text = pdf_to_markdown(pdf_path)
print(markdown_text) # Print the extracted and formatted text

Next we create markdown_to_plain_text function, which is designed to convert Markdown-formatted text into plain text. The function utilizes regular expressions (regex) to achieve this.

The purpose of this function is to clean the extracted text from any Markdown formatting, making it suitable for speech synthesis in the next steps.

# Importing the 're' module for regular expression operations
import re

def markdown_to_plain_text(markdown_text):
# Remove Markdown URL syntax ([text](link)) and keep only the text
text = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', markdown_text)

# Remove Markdown formatting for bold and italic text
text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text) # Bold with **
text = re.sub(r'\*([^*]+)\*', r'\1', text) # Italic with *
text = re.sub(r'\_\_([^_]+)\_\_', r'\1', text) # Bold with __
text = re.sub(r'\_([^_]+)\_', r'\1', text) # Italic with _

# Remove Markdown headers, list items, and blockquote symbols
text = re.sub(r'#+\s?', '', text) # Headers
text = re.sub(r'-\s?', '', text) # List items
text = re.sub(r'>\s?', '', text) # Blockquotes

return text
# Function Usage
plain_text = markdown_to_plain_text(markdown_text)
print(plain_text) # Printing the converted plain text

The text may require some more cleaning, but it should be more or less ready. In case you need to clean the text further you can use replace function. Here is an example of how.

# Further cleaning of the plain text
# Here, we are removing a specific unwanted word or artifact from the text
# Replace "artifact" with any specific word or symbol you need to remove
plain_text = plain_text.replace("artifact", "")

# Printing the cleaned text to verify the changes
print(plain_text)

Chunking the text

Given the character limit of 4096 for the OpenAI text-to-speech API, let’s create a function called split_text designed to divide the cleaned text into smaller chunks. Each chunk adheres to the maximum character limit, ensuring compatibility with the API. The process is as follows:

  1. The function splits the text into sentences.
  2. It then iteratively adds sentences to a chunk until adding another sentence would exceed the character limit.
  3. Once the limit is near, the current chunk is saved, and a new chunk starts with the next sentence.
  4. This process continues until all sentences are allocated to chunks.
def split_text(text, max_chunk_size=4096):
chunks = [] # List to hold the chunks of text
current_chunk = "" # String to build the current chunk

# Split the text into sentences and iterate through them
for sentence in text.split('.'):
sentence = sentence.strip() # Remove leading/trailing whitespaces
if not sentence:
continue # Skip empty sentences

# Check if adding the sentence would exceed the max chunk size
if len(current_chunk) + len(sentence) + 1 <= max_chunk_size:
current_chunk += sentence + "." # Add sentence to current chunk
else:
chunks.append(current_chunk) # Add the current chunk to the list
current_chunk = sentence + "." # Start a new chunk

# Add the last chunk if it's not empty
if current_chunk:
chunks.append(current_chunk)

return chunks
# Function Usage
chunks = split_text(plain_text)

# Printing each chunk with its number
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}:\n{chunk}\n---\n")

Text-to-Speech Conversion

Next, let’s create atext_to_speech function, which utilizes OpenAI's text-to-speech API to convert text into audio. The function performs the following steps:

  1. Initializes an OpenAI client to interact with the API.
  2. Sends a request to the Audio API with the specified text, model, and voice parameters. The model parameter defines the quality of the text-to-speech conversion, while the voice parameter selects the voice type.
  3. Receives the audio response from the API and streams it to a specified output file.

⚠️ Please note I have specified OpenAI API Key in my environment variables, else you will need to provide OpenAI API Key to the client.

# Importing necessary modules
from pathlib import Path
import openai

def text_to_speech(input_text, output_file, model="tts-1-hd", voice="nova"):
# Initialize the OpenAI client
client = openai.OpenAI()

# Make a request to OpenAI's Audio API with the given text, model, and voice
response = client.audio.speech.create(
model=model, # Model for text-to-speech quality
voice=voice, # Voice type
input=input_text # The text to be converted into speech
)

# Define the path for the output audio file
speech_file_path = Path(output_file)

# Stream the audio response to the specified file
response.stream_to_file(speech_file_path)

# Print confirmation message after saving the audio file
print(f"Audio saved to {speech_file_path}")

Converting Text Chunks to Audio Files

Let’s define the convert_chunks_to_audio function, which processes each text chunk through the text_to_speech function and saves the resulting audio files. The steps are as follows:

  1. Iterate over the chunks of text.
  2. For each chunk, create a filename for the output audio file, ensuring it is saved in the specified output folder.
  3. Convert each text chunk to an audio file using the text_to_speech function defined earlier.
  4. Store the path of each generated audio file in a list.
# Importing necessary modules
import os
from pydub import AudioSegment

def convert_chunks_to_audio(chunks, output_folder):
audio_files = [] # List to store the paths of generated audio files

# Iterate over each chunk of text
for i, chunk in enumerate(chunks):
# Define the path for the output audio file
output_file = os.path.join(output_folder, f"chunk_{i+1}.mp3")

# Convert the text chunk to speech and save as an audio file
text_to_speech(chunk, output_file)

# Append the path of the created audio file to the list
audio_files.append(output_file)

return audio_files # Return the list of audio file paths
# Function Usage
output_folder = "chunks" # Define the folder to save audio chunks
audio_files = convert_chunks_to_audio(chunks, output_folder) # Convert chunks to audio files
print(audio_files) # print list of all the audio files generated

Note: Make sure the folder exists before running the code. In case of our example it is called chunks.

Combine Individual Audio Clips into a Single File

The combine_audio_with_moviepy function combines multiple audio clips into a single audio file using the moviepy library. The function follows these steps:

  1. Iterate through the files in the specified folder, filtering for .mp3 files.
  2. For each audio file, create an AudioFileClip object and add it to a list.
  3. Once all audio clips are collected, use concatenate_audioclips to merge them into a single continuous audio clip.
  4. Write the combined clip to an output file.
# Importing necessary modules from moviepy
from moviepy.editor import concatenate_audioclips, AudioFileClip
import os

def combine_audio_with_moviepy(folder_path, output_file):
audio_clips = [] # List to store the audio clips

# Iterate through each file in the given folder
for file_name in sorted(os.listdir(folder_path)):
if file_name.endswith('.mp3'):
# Construct the full path of the audio file
file_path = os.path.join(folder_path, file_name)
print(f"Processing file: {file_path}")

try:
# Create an AudioFileClip object for each audio file
clip = AudioFileClip(file_path)
audio_clips.append(clip) # Add the clip to the list
except Exception as e:
# Print any errors encountered while processing the file
print(f"Error processing file {file_path}: {e}")

# Check if there are any audio clips to combine
if audio_clips:
# Concatenate all the audio clips into a single clip
final_clip = concatenate_audioclips(audio_clips)
# Write the combined clip to the specified output file
final_clip.write_audiofile(output_file)
print(f"Combined audio saved to {output_file}")
else:
print("No audio clips to combine.")
# Function Usage
combine_audio_with_moviepy('chunks', 'combined_audio.mp3') # Combine audio files in 'chunks' folder

(Optional) Creating an Video File with Cover Image and Audio

I created an image in Canva which I will render as a video while audio plays in the background.

The create_mp4_with_image_and_audio function combines an image and an audio file to create an MP4 video. This can be particularly useful for presentations or other scenarios where an audio track needs to be accompanied by a static image, like a YouTube video. The function performs the following steps:

  1. Load the audio file as an AudioFileClip.
  2. Create a video clip from the specified image using ImageClip, setting its duration to match the length of the audio.
  3. Set the frames per second (fps) for the video clip.
  4. Assign the audio clip as the audio track of the video clip.
  5. Write the final video clip to an output file, specifying the video and audio codecs.
from moviepy.editor import AudioFileClip, ImageClip

def create_mp4_with_image_and_audio(image_file, audio_file, output_file, fps=24):
# Load the audio file
audio_clip = AudioFileClip(audio_file)

# Create a video clip from an image
video_clip = ImageClip(image_file, duration=audio_clip.duration)

# Set the fps for the video clip
video_clip = video_clip.set_fps(fps)

# Set the audio of the video clip as the audio clip
video_clip = video_clip.set_audio(audio_clip)

# Write the result to a file
video_clip.write_videofile(output_file, codec='libx264', audio_codec='aac')

# Example usage
image_file = 'cover_image.png' # Replace with the path to your image
audio_file = 'combined_audio.mp3' # The combined audio file
output_file = 'output_video.mp4' # Output MP4 file
create_mp4_with_image_and_audio(image_file, audio_file, output_file)

And that’s it. Once this code finish running we have an audiobook. Here is the output for the example from the above code.

You can download the code for this from our GitHub Repo and do check out our website.

--

--

Anurag Kumar
Coinmonks

Founder, Prex Learning Studio. Sharing thoughts on ChatGPT, Midjourney & Generative AI use-cases. IITB-IIMB alumnus. ex-Wipro Global 100.