AI Audio Conversations Using OpenAI Whisper

David Richards
4 min readMar 11, 2023

--

Goal: Create an AI Chat Bot that you can talk to

Tools: Javascript (React), Python (Backend API Alternate), NodeJS (Backend API)

Frontend Client

Lets start by setting up our frontend service to record audio and forward those recordings to our backend service. Once the backend service receives the audio data it can be transcribed using OpenAI’s Whisper API. With the newly transcribed data we can use this for our prompt. Completing the loop from audio to question asked by the OpenAI Completions API.

Here are few resources I found to get started on browser audio recording:

Hark — Recognizes when someone starts and stops speaking using elevated audio and intervals. This library can be used to generate speaking/stopped triggers for your audio stream. Once you know when speaking started and stopped you can send snippets of audio to your backend service for transcription.
https://github.com/latentflip/hark

React Speech Recognition — This library can be used to record audio and send to native browser transcription service. It is noted that Chrome desktop browser works the best for this service. This library could be expanded to include Whisper API as the backend for transcription.
https://www.npmjs.com/package/react-speech-recognition

MediaRecorder — This built-in browser library can be used to record audio or video from the browser to a webm format. This webm format is already supported by OpenAI so no need to convert. https://developer.mozilla.org/en-US/docs/Web/API/MediaRecorder

var mediaRecorder: MediaRecorder;
export default () => {
const [messages, setMessages] = useState([])

const sendRecording = async (audioData) => {
// first convert the audio data to base64
var reader = new FileReader();
reader.readAsDataURL(audioData)
return new Promise((resolve, reject) => {
reader.onloadend = function() {
// Send base64 string data backend service
axios.post('localhost:3000/whisper', {audio: reader.result})
.then(res => {
resolve(res.data)
})
.catch(err => {
reject(err)
})
}
})
}

const sendPrompt = async (prompt) => {
return axios.post('localhost:3000/completions', { prompt })
}

const record = async () => {
if (navigator.getUserMedia) {
console.log("Starting to record");
// Get audio stream
const stream = await navigator.mediaDevices.getUserMedia({
audio: true,
video: false,
});
// Generate the media recorder with stream from media devices
// Starting position is paused recording
mediaRecorder = new MediaRecorder(stream);
// Also pass the stream to hark to create speaking events
var speech = hark(stream, {});
// Start the recording when hark recognizes someone is speakign in the mic
speech.on("speaking", function () {
console.log("Speaking!");
mediaRecorder.start();
});
// When hark recognizes the speaking has stopped we can stop recording
// The stop action will generate ondataavailable() to be triggered
speech.on("stopped_speaking", function () {
console.log("Not Speaking");
if (mediaRecorder.state === "recording") mediaRecorder.stop();
});
//
mediaRecorder.ondataavailable = (e) => {
sendRecordedAudio(e.data).then((newMessage) => {
sendPrompt(newMessage).then(aiRes => {
setMessages([...chat, newMessage, aiRes.data.message]);
})
});
};
} else {
console.log("recording not supported");
}
};

const stopRecording = async () => {
if (mediaRecorder) {
if (mediaRecorder.state === "recording") mediaRecorder.stop();
mediaRecorder.stream.getTracks().forEach((track) => track.stop());
}
};

return (
<view>
<button onClick={record}>Record</button
<button onClick={stop}>Stop</button>
{chat.map(message =>
<p>message.text</p>
)}
</view
)
}

Backend

Our backend service will be used to proxy the requests from our browser so the OpenAI key is not exposed. We will show two different options for backend service so you only need to run one.

Why are we writing base64 data to file and reading? The API for whisper requires a file object so to use the library we need to provide a file object. We could also send the API request without the OpenAI library like below.

function createFormDataFromBase64(base64String, fieldName, fileName) {
const byteString = atob(base64String.split(',')[1]);
const mimeType = dataUri.split(';')[0].split(':')[1];

const arrayBuffer = new ArrayBuffer(byteString.length);
const intArray = new Uint8Array(arrayBuffer);

for (let i = 0; i < byteString.length; i += 1) {
intArray[i] = byteString.charCodeAt(i);
}

const blob = new Blob([intArray], { type: mimeType });

const formData = new FormData();
formData.append(fieldName, blob, fileName);

return formData;
}

axios({
method: 'post',
url: 'https://api.openai.com/v1/audio/transcriptions',
data: createFormDataFromBase64(base64Str, 'file', 'audio.webm'),
headers: {
'Content-Type': 'multipart/form-data',
'Authorization': 'Bearer {OPEN_AI_API_KEY}'
},
})
.then(function (response) {
//handle success
console.log(response);
})
.catch(function (response) {
//handle error
console.log(response);
});

NodeJS

const express = require('express')
const fs = require('fs')
const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
const app = express()
const port = 3000

app.post('/whisper', (req, res) => {
await fs.writeFileSync(
"/tmp/tmp.webm",
Buffer.from(
body.audio.replace("data:audio/webm;codecs=opus;base64,", ""),
"base64"
)
);
return openai
.createTranscription(fs.createReadStream("/tmp/tmp.webm"), "whisper-1")
.then((res) => {
console.log("audio res", res.data.text);
return res.json(res.data)
})
.catch((err) => {
console.log(err);
console.log(err.response.data.error);
});
}

app.post('/completions', (req, res) => {
const response = openai.createCompletion({
model: "text-davinci-003",
prompt: req.body.prompt,
max_tokens: 100,
temperature: 0
});

return res.json(response.data)
})

app.listen(port, () => {
console.log(`Example app listening on port ${port}`)
})

Python (Flask)

import os
import flask
import openai
import base64
from flask import Flask, request

openai.api_key = os.environ.get('OPENAI_API_KEY')
app = Flask(__name__)

@app.route('/whisper', methods=['POST'])
def completion_api():
json = request.json
decoded_data = base64.b64decode(json['audio'])
with open('/tmp/audio.webm', 'wb') as f:
f.write(decoded_data)
audio_file = open('/tmp/audio.webm', "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
return flask.Response(transcribe)

@app.route('/completions', methods=['POST'])
def completion_api():
json = request.json
completion = openai.Completion.create(engine="text-davinci-003", prompt=request.prompt)
return flask.Response(completion)

Thank you for reading! Stay tuned for more.

--

--

David Richards
David Richards

Written by David Richards

Founder @ parallellabs.app // davidrichards.tech // Principal Software Engineer. 10+ years working in Bay Area tech, most recently TikTok and Salesforce.