Developing an Audio Summarization Tool with ChatGPT and Python for fast knowledge digest

Converting Audio Files to Text using Python

(KJH) Kuan-Jung, Huang

Published in

KJH 新知分享

5 min readMar 2, 2023

Why I want to do this

My motivation for creating this project stems from my desire to efficiently consume valuable content that I come across daily. Despite dedicating a significant amount of time to reading and watching videos, I find myself struggling to keep up with the ever-growing backlog. I am inspired by the capabilities of ChatGPT and its ability to summarize articles quickly and effectively. As a result, I want to develop a project that leverages the power of ChatGPT to summarize audio files that I convert to text using Python code, thereby enabling me to consume content more efficiently.

Start the project

Please follow the guide so that we can complete the project, or you can fork the code from my GitHub. Please fork my project or give me start if you like it =).

GitHub has the final code, so if you just want to start using the program, you can download the code directly. But if you want to dive a little bit deeply on the tech, please see the following content.

Open a text editor on your computer.
Create a new file and name it requirements.txt.
In the file, add the names of the packages you want to install, along with their version numbers, separated by a space. For example:

tqdm==4.64.1
imageio==2.26.0
pydub==0.25.1
SpeechRecognition==3.9.0
moviepy==1.0.3

And then we can save the file.

To install the packages from the requirements.txt file, you can use the following command:

pip install -r requirements.txt

This command will install all the packages listed in the requirements.txt file with their respective versions. It's a convenient way to ensure that all the required packages are installed on your system and can be used in your Python projects.

Writing some code

To start the project, after you complete the installation of the required packages, create an index.py file in the root folder, pasting the following code:

import speech_recognition as sr 
import moviepy.editor as mp



# clip = mp.VideoFileClip(r"test.mp4") 
# clip.audio.write_audiofile(r"converted.wav")

r = sr.Recognizer()
audio = sr.AudioFile("converted.wav")

with audio as source:
  audio_file = r.record(source)
result = r.recognize_google(audio_file)

# exporting the result 
with open('recognized.txt',mode ='w') as file: 
   file.write("Recognized Speech:") 
   file.write("\n") 
   file.write(result) 
   print("ready!")

The code is performing speech recognition on an audio file that has been extracted from a video file using the moviepy package.

The speech_recognition and moviepy libraries are imported at the beginning of the script.
A Recognizer instance is created from the speech_recognition library, which is used to transcribe the audio file.
An audio file (converted.wav) is loaded using the AudioFile class from the speech_recognition library.
The record method of the Recognizer instance is called on the audio file to record the audio data from the file.
The recognize_google method of the Recognizer instance is called on the recorded audio data to perform speech recognition on it.
The resulting text from the speech recognition is stored in a variable called result. And then we will write the result to the recognized.txt file.

The issue you may need to tackle

After you download the video and work with the code, you may find that the code will throw an error

Traceback (most recent call last):
  File "C:\Python39\lib\site-packages\speech_recognition\__init__.py", line 894, in recognize_google
    response = urlopen(request, timeout=self.operation_timeout)
  File "C:\Python39\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python39\lib\urllib\request.py", line 523, in open
    response = meth(req, response)
  File "C:\Python39\lib\urllib\request.py", line 632, in http_response
    response = self.parent.error(
  File "C:\Python39\lib\urllib\request.py", line 561, in error
    return self._call_chain(*args)
  File "C:\Python39\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Python39\lib\urllib\request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Kevin\Desktop\pyvideototext\index.py", line 14, in <module>
    result = r.recognize_google(audio_file)
  File "C:\Python39\lib\site-packages\speech_recognition\__init__.py", line 896, in recognize_google
    raise RequestError("recognition request failed: {}".format(e.reason))
speech_recognition.RequestError: recognition request failed: Bad Request

This is because the audio file is too big(the audio file is larger the 10 seconds) that causing the sample code failed. To solve this, we will use pydub AudioSegment to resolve the file too big problem.

pydub provides a simple and efficient way to load, manipulate, and save audio files of various formats, such as MP3, WAV, and more. To install the library, just simply run the command:

pip install pydub

Then we can start add the code to the recent codebase

from pydub import AudioSegment
audio_file = AudioSegment.from_file("large_audio_file.mp3", format="mp3")

Split the audio file into smaller chunks to avoid memory issues:

chunk_size = 10 * 1000  # split into 10-second chunks
chunks = []
for i in range(0, len(audio_file), chunk_size):
    chunks.append(audio_file[i:i+chunk_size])

And then Modify the current Recognizer logic to read with multiple chunks.

From this:

# Create a Recognizer instance and transcribe each chunk
r = sr.Recognizer()
audio = sr.AudioFile("converted.wav")

with audio as source:
  audio_file = r.record(source)
result = r.recognize_google(audio_file)

To this:

# Create a Recognizer instance and transcribe each chunk
r = sr.Recognizer()
result = ""
for chunk in chunks:
    with sr.AudioFile(chunk.export(format="wav")) as audio:
        audio_data = r.record(audio)
        result += r.recognize_google(audio_data) + " "

The code above uses the AudioSegment class from the pydub library to split the large audio file into 10-second chunks. Then, it uses a for loop to iterate over each chunk and transcribe it using the Recognizer instance from speech_recognition. The transcribed text is appended to the text variable.

Note that the recognize_google method used in the code above requires an internet connection and uses Google's speech recognition API to transcribe the audio. If you don't want to use an internet connection or prefer to use a different speech recognition engine, you can replace recognize_google with another method provided by speech_recognition, such as recognize_sphinx.

Finally, you can execute the code by simply completing the operation. It’s as easy as that! For a sample, check out the link provided below. (I made some slight adjustments to ensure clarity.)

Additionally, feel free to ask more questions related to the topic at hand. This approach can be an effective way to quickly absorb fragmented knowledge, especially if you are seeking to learn new things.