Seamless Multilingual Audio Transcription with OpenAI Whisper and Node.js

3 min readJun 24, 2024

Introduction

In today’s fast-paced digital world, the ability to accurately and efficiently transcribe audio across multiple languages is invaluable. Whether for customer service, content creation, or accessibility, having a robust transcription system can save time and enhance communication. In this post, I will walk you through creating a Node.js application that leverages OpenAI’s Whisper model for multilingual audio transcription.

Setting Up the Project

To get started, you will need Node.js installed on your machine. We will use Express.js to handle the server-side operations and Formidable for parsing incoming form data, including file uploads.

Step 1: Initialize Your Node.js Project

First, create a new directory for your project and initialize it with npm init:

mkdir openai-speech-to-text
cd openai-speech-to-text
npm init -y

Next, install the necessary dependencies:

npm install express formidable openai fs path ffmpeg fluent-ffmpeg

Step 2: Create the Server with Express.js

Next, we set up an Express.js server to handle file uploads and initiate the transcription process. Create a file named app.js and add the following code:

const express = require('express');
const formidable = require('formidable');
const fs = require('fs');
const path = require('path');
const ffmpeg = require('fluent-ffmpeg');
const { transcribeAudioWithWhisper } = require('./OpenaiSpeechToText');
const app = express();
const port = 3000;

app.post('/upload_voice', (req, res) => {
  const form = new formidable.IncomingForm();
  form.parse(req, async function(err, fields, files) {
    if (err) {
      console.error('Form parsing error:', err);
      return res.status(500).json({ message: "Form parsing error", success: false });
    }
    if (!files.file) {
      return res.status(500).json({ message: "No file uploaded", success: false });
    }
    const languageCode = fields.lang_cd;
    const uploadedFile = files.file[0];
    const uploadsDir = path.join(__dirname, 'uploads');
    if (!fs.existsSync(uploadsDir)) {
      fs.mkdirSync(uploadsDir, { recursive: true });
    }
    const fileName = uploadedFile.originalFilename;
    const mp3FileName = fileName.replace(/\.[^/.]+$/, "") + ".mp3";
    const mp3FilePath = path.join(uploadsDir, mp3FileName);
    ffmpeg(uploadedFile.filepath)
      .toFormat('mp3')
      .on('error', function(err) {
        console.error('An error occurred: ' + err.message);
        return res.status(500).json({ message: "Transcription error", success: false });
      })
      .on('end', async function() {
        try {
          const transcription = await transcribeAudioWithWhisper(mp3FilePath, languageCode);
          fs.unlink(mp3FilePath, function(err) {
            if (err) {
              console.error('Error deleting the mp3 file:', err);
            }
          });
          return res.status(200).json({
            message: transcription ? transcription.text : "",
            data: transcription ? {
              language: transcription.language,
              duration: transcription.duration
            } : null,
            success: true
          });
        } catch (err) {
          return res.status(500).json({ message: "Transcription error", data: null, success: false });
        }
      })
      .save(mp3FilePath);
  });
});
app.listen(port, () => {
  console.log(`Server running at http://localhost:${port}`);
});

Step 3: Create the OpenAI Speech-to-Text Module

In this step, we create a module to handle audio transcription using OpenAI’s Whisper model. Create a file named OpenaiSpeechToText.js and add the following code:

const fs = require('fs');
const OpenAI = require("openai");
const openai = new OpenAI({
  apiKey: "Your OpenAI APIKEY"
});
async function transcribeAudioWithWhisper(filePath, languageCode) {
    const transcriptResponse = await openai.audio.transcriptions.create({
      model: "whisper-1",
      file: fs.createReadStream(filePath),
      response_format: "verbose_json",
      temperature: 0.1,
      language: languageCode
    });
    if (!transcriptResponse) {
      return "Error: No speech detected in the audio file.";
    } else {
      return transcriptResponse;
    }
}
module.exports = {
  transcribeAudioWithWhisper,
};

Step 4: Run the Server

To start the server, run the following command in your terminal:

node app.js

Your server should now be running on http://localhost:3000. You can test the audio transcription by sending a POST request to /upload_voice with an audio file and language code.

Conclusion

In this post, we created a Node.js application that transcribes multilingual audio using OpenAI’s Whisper model. By leveraging Express.js and Formidable, we handled file uploads and seamlessly converted audio files to MP3 format for transcription. This setup can be further enhanced and integrated into larger systems to provide robust and accurate transcription services.

Feel free to customize and expand this project to suit your specific needs. Happy coding!