Talking ChatBot: From chatting to talking to any source of data

Juan Abascal
LatinXinAI
Published in
7 min readNov 16, 2023

Converse to any source of data while doing any activity (walking on the treadmill, cooking, gardening, …) — a tutorial on leveraging OpenAI’s GPT and Whisper models.

Image generated: “A tourist talking to a humanoid ChatBot not in Paris” (fails to embed text on images)

In a previous tutorial Chat to any source of data, we show how to exploit Langchain and OpenAI to chat to any data (pdf, url, youtube link, xlsx, tex, …). In this tutorial, we level up and show how to transcribe text to speech and vice versa while interacting with both GPT and Whisper models. Make your API talk, as the now talking ChatGPT!

Let’s start!

Requirements

First, we install the required dependencies in the environment of choice.

# Install required libraries
!python -m venv venv
!source venv/bin/activate
!pip install -r requirements.txt

where the requirements.txt file is given below

openai==1.2.4
langchain==0.0.335
chromadb==0.3.26
pydantic==1.10.8
langchain[docarray]
gTTS==2.4.0
pvrecorder==1.2.1
playsound==1.3.0
bs4==0.0.1
tiktoken==0.5.1
pypdf

OpenAI API Key

You will need an OpenAI API key, which we will read from a JSON file (check previous tutorial if needed).

Now, that we are all set, let’s kick off with the fun stuff.

Transcribe text to audio

We will explore the functionality required to transcribe text to speech and vice versa. The easiest part is to transcribe text to natural sound. For this, we would compare two different methods:

  • gTTS Google Text-to-Speech, which leverages Google Translate speech functionality, providing text-to-speech transcription that allows unlimited lengths of text, keeping proper intonation, and abbreviations. It supports several languages and accents. To see the available languages: gtts-cli — all. For instance, we can use the following commands:
gTTS('hello', lang='en', tld='co.uk')
gTTS('bonjour', lang='fr')
  • OpenAI’s text-to-speech model tts-1. It allows for different voice options (`alloy`, `echo`, `fable`, …) and supports a wide range of languages (same as Whisper model). It also supports real-time audio streaming using chunk transfer encoding.

First, we transcribe text to speech and then write in a file with `gTTS`. To play the audio we use python’s library playsound.

from gtts import gTTS
from playsound import playsound

def play_text(text, language='en', accent='co.uk', file_audio="../tmp/audio.wav"):
"""
play_text: Play text with gTTS and playsound libraries. It writes the audio file
first and then plays it.
"""
gtts = gTTS(text, lang=language)
gtts.save(file_audio)
playsound(file_audio)

text = "Hello, how are you Today? It's a beautiful day, isn't it? Have a nice day!"
play_text(text, file_audio="../tmp/hello.wav")

We compare it to OpenAI’s TTS model

# Text to speech with openAI
from pathlib import Path
from openai import OpenAI

def play_text_oai(text, file_audio="../tmp/audio.mp3", model="tts-1", voice="alloy"):
"""
play_text_oai: Play text with OpeanAI and playsound libraries. It writes the audio file
first and then plays it.
"""
speech_file_path = "../tmp/hello.mp3"
response = openai.audio.speech.create(
model=model,
voice=voice,
input=text
)
response.stream_to_file(file_audio)
playsound(file_audio)

play_text_oai(text, file_audio="../tmp/audio.mp3")

OpenAI’s model provides a more natural language and corrects the spelling mistake! However, if you provide a foreign address, it will be wrongly transcribed.

Transcribe audio to text

This is the hard part. For this we use openai.audio.transcribe, which provides speech-to-text transcriptions for many languages and translation to English based on OpenAI “Whisper” model. It supports several audio formats (mp3, mp4, mav and others), with a limit of 25 MB, and text formats (json default).

Whisper is an automatic speech recognition (ASR) system, trained on 680,000 hours of multilingual (98 languages) and multitask supervised data collected from the web. Trained on a large dataset, it may be less accurate than other models trained in specific datasets but should be more robust to new data. It also beats many translation models. It is based on an encoder-decoder transformer architecture. Audio is split into 30s chunks, converted to a log-Mel spectrogram, and trained to predict the next token on several tasks (language identification, transcription, and to-English speech translation).

Record audio

First of all, we need to record audio and write it in a file. For recording audio, we use PvRecorder, an easy-to-use, cross platform audio recorder designed for real-time speech audio processing. For writing audio in a file, we use wave, which allows us to easily read and write WAV files. Other options are soundfile and pydub.

from pvrecorder import PvRecorder
devices = PvRecorder.get_available_devices()
print(devices)
import wave
import struct
from pvrecorder import PvRecorder

def write_audio_to_file(audio,
audio_frequency=16000,
file_audio="tmp.wav"):
"""
write_audio_to_file: Write audio to file with wave library.
"""
with wave.open(file_audio, 'w') as f:
f.setparams((1, 2, audio_frequency, len(audio), "NONE", "NONE"))
f.writeframes(struct.pack("h" * len(audio), *audio))


def record_audio(device_index=-1,
frame_length=512, # audio samples at each read
num_frames = 600, # 20 seconds
audio_frequency=16000,
file_audio="tmp.wav"):
"""
record_audio: Record audio with pvrecorder library.
"""

# Record audio
# Init the recorder
recorder = PvRecorder(frame_length=frame_length, device_index=device_index)

print("\nRecording...")
try:
audio = []
recorder.start()
for fr_id in range(num_frames):
frame = recorder.read()
audio.extend(frame)
write_audio_to_file(audio, audio_frequency=audio_frequency, file_audio=file_audio)
recorder.stop()
except KeyboardInterrupt:
recorder.stop()
write_audio_to_file(audio, audio_frequency=audio_frequency, file_audio=file_audio)
finally:
recorder.delete()
print("Recording finished.")

Now, we test the audio recording. Run the following code to record and playback the audio.

# Record audio sample and play it
play_text_oai("Please, say something (you have 5 seconds)", file_audio="../tmp/tmp.wav")
record_audio(file_audio="../tmp/audio.wav", num_frames=150, device_index=-1)
play_text_oai("You said", file_audio="../tmp/tmp.wav")
playsound("../tmp/audio.wav")

Transcribe audio file to text

Now, we are ready to transcribe audio to text using openai.audio.transcriptions. For this, we set the LLM name to `whisper-1` and specify the user language. The language is key to get good results; otherwise, it may get confused with accents.

To test it, we record some audio, play it back and print the transcribed text.

# LLM name
llm_audio_name = "whisper-1"

# Language of user speech (For better accuracy; otherwise accents lead to errors)
language_user = "en"

# Record audio sample and play it
play_text_oai("Please, say something (you have 10 seconds)", file_audio="../tmp/tmp.wav")
record_audio(file_audio="../tmp/audio.wav", num_frames=300, device_index=-1)
play_text_oai("Now, we print what you said:", file_audio="../tmp/tmp.wav")

# Read audio file and transcribe it
audio_file = open(os.path.join("../tmp", "audio.wav"), "rb")
#transcript = openai.Audio.transcribe(llm_audio_name,
# audio_file,
# language=language_user)
text = openai.audio.transcriptions.create(model="whisper-1", file=audio_file,
response_format="text")
print(f"\nQuestion: {text}")

It is not in real time, though. That would require chunk streaming and lots of optimization.

Recap on LangChain and OpenAI

Refer to the previous tutorial, Chat to any source of data, for a short introduction to Langchain on data loading, splitting data into chunks, using embeddings, creating vector database stores and creating high-level chains to easily interact with a LLM.

Build a talking ChatBot

Finally, we got to the point where we can maintain a conversation with our data. We start by defining some parameters.

# Parameters

# LLM name
llm_name = "gpt-4"
llm_audio_name = "whisper-1"

# Document
example_type = "url" # Document type: "pdf" or "url" or "youtube"
chunk_size = 1500 # Parameters for splitting documents into chunks
chunk_overlap = 150
mode_input = "file" # Mode read: "file" or "db", db if db already saved to drive (avoid reading it)

# Mode of interaction
question_mode = "audio" # "text" or "audio"
language_user = "en" # Language of user speech (For better accuracy; otherwise accents lead to errors)
language_answer = "en" # Desired language for reply speech (gTTS)

# Parameters for recording audio
audio_frequency = 16000
frame_length = 512 # audio samples at each read
num_frames = 300 # 600 for 20 seconds

Then, we specify text scripts to interact with the user.

path_tmp = "../tmp"                         # Path to save audio
name_tmp_audio = "audio.mp3"
file_audio_intro = "../tmp/talk_intro.mp3" # Audio temporal files
file_audio_question = "../tmp/question.mp3"
file_audio_answer = "../tmp/answer.mp3"

persist_path = "./docs/chroma" # Persist path to save vector database

if not os.path.exists(path_tmp):
os.makedirs(path_tmp)
file_tmp_audio = os.path.join(path_tmp, name_tmp_audio)

# Audio samples
text_intro = f"""
You are chatting to {llm_name}, transcriptions by {llm_audio_name},
about the provided {example_type} link.
You can ask questions or chat about the document provided, in any language.
You have 10 to 20 seconds to make your questions.
Answers will be played back to you and printed out in the language selected.
To end the chat, say 'End chat' when providing a question.
"""

text_question = "Ask your question"

Finally, we are ready to chat to our data.

# Start interaction
play_text_oai(text_intro, file_audio=file_audio_intro)
qa_on = True # Ask questions to the user
while qa_on == True:
# Prompt the user to introduce a question
# Play prompt question
print(text_question)
#play_text(text_question, language=language_answer, file_tmp_audio=file_audio_intro)
play_text_oai(text_question, file_audio=file_audio_question)

# Record audio
record_audio(device_index=-1,
frame_length=frame_length, # audio samples at each read
num_frames = num_frames, # 20 seconds
audio_frequency=audio_frequency,
file_audio=file_tmp_audio)

# Transcribe audio
audio_file = open(file_tmp_audio, "rb")
#transcript = openai.Audio.transcribe(llm_audio_name,
# audio_file,
# language=language_user)
question = openai.audio.transcriptions.create(model="whisper-1", file=audio_file,
response_format="text", language=language_user)
#question = transcript['text']
print(f"\nQuestion: {question}")

if question.lower() == "End chat":
break

# -------------------------
# Run QA chain
result = qa_chain({"question": question})
print(f"Answer: {result['answer']}")

# Text to speech
if question_mode == "audio":
#play_text(result['answer'], language=language_answer, file_tmp_audio=file_audio_answer)
play_text_oai(result['answer'], file_audio=file_audio_answer)
# -------------------------

We got to the end of this tutorial! The full code is available in the link below!

If you like the tutorial, give it a thumbs up, share it and subscribe for more!

References

This notebook has been inspired by DeepLearning.AI course “LangChain: Chat with Your Data” and used the following sources:

Code

A notebook for this tutorial: talk_to_your_data_medium.ipynb

LatinX in AI (LXAI) logo

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Don’t forget to hit the 👏 below to help support our community — it means a lot!

--

--

Juan Abascal
LatinXinAI

Researcher committed to improve health and quality of life through innovative solutions