Transcribing Conference Audio Using the Deepgram WebSocket API

7 min readFeb 28, 2023

A Missed Opportunity

Back when I was laid off nearly four years ago, I had an idea for a business: teleconferences with transcription. I worked for Nuance and wanted to use their software; however, they didn’t appear to have a publically available API with an easy sign-up process. So… I abandoned the idea.

Boy, that was a missed opportunity. There’s now an entire industry of meeting note-takers using AI: Otter.ai is perhaps one of the most notable. Otter.ai however does a lot more than transcription. It also summarizes meetings and extracts keywords. However, the basis for all its features is the transcription of meetings. In any case, it was an opportunity I did not pursue.

Recently, I did think of meeting transcription again as I applied for a job at Deepgram. Out of curiosity, I signed up for access to their API, began to look at the API docs, and experimented. I then thought of how to use the API for transcribing meetings, particularly for Jitsi. This all inspired me to see how Deepgram could be used to transcribe Jitsi meetings.

Researching Existing Transcription Support

Since Jitsi is open-source, I figured there must be some way to alter its code to use Deepgram for transcription. That led to me learning of Jitsi’s existing support for transcription here:

GitHub - jitsi/jigasi: Jigasi: a server-side application acting as a gateway to Jitsi Meet…

Jigasi: a server-side application acting as a gateway to Jitsi Meet conferences. Currently allows regular SIP clients…

github.com

In studying the codebase, I learned that Jigasi supported Google transcription as well as Vosk transcription.

GitHub - alphacep/vosk-server: WebSocket, gRPC and WebRTC speech recognition server based on Vosk…

This is a server for highly accurate offline speech recognition using Kaldi and Vosk-API. There are four different…

github.com

I attempted to run an instance of Vosk-server, but it appeared to need more memory than I want to pay for. In any case, the configuration of Jigasi to enable Vosk transcription is instructive:

# Vosk server
org.jitsi.jigasi.transcription.customService=org.jitsi.jigasi.transcription.VoskTranscriptionService
org.jitsi.jigasi.transcription.vosk.websocket_url={"en": "ws://localhost:2700", "fr": "ws://localhost:2710"}

This suggested to me: all that’s needed is to implement some interface and use its fully qualified class name as the value in place of org.jitsi.jigasi.transcription.VoskTranscriptionService. Simple enough, eh?

So I then looked over VoskTranscriptionService to see if I could use it as a guide. Its code is very straightforward:

jigasi/VoskTranscriptionService.java at master · megafarad/jigasi

Jigasi: a server-side application acting as a gateway to Jitsi Meet conferences. Currently allows regular SIP clients…

github.com

I used its structure to build a new class with the following key differences:

1. I needed to send the authentication token. To do this, I needed to set a header on a client upgrade request.

        DeepgramWebsocketStreamingSession(String debugName, String apiKey)
                throws Exception
        {
            this.debugName = debugName;
            WebSocketClient ws = new WebSocketClient();
            ws.start();
            ClientUpgradeRequest clientUpgradeRequest = new ClientUpgradeRequest();
            clientUpgradeRequest.setHeader("Authorization", "Token " + apiKey);
            ws.connect(this, new URI(websocketUrl), clientUpgradeRequest);
        }

2. I needed to parse the message from the server differently (as documented).

3. Stop sending the sample rate in a string message to the server.

4. Send a different message for closing the WebSocket.

    private final static String EOF_MESSAGE = "{ \"type\": \"CloseStream\" }";

Building a Custom Transcriber

I built DeepgramTranscriptionService with these four changes. While testing, I encountered the following problems:

The response coming back from Deepgram doesn’t match their documentation at all:

2023-02-08 00:48:09.906 INFO: [121] JvbConference$JvbCallChangeListener.callStateChanged#1439: [ctx=16758172889271508243393] JVB conference call IN_PROGRESS.
2023-02-08 00:48:12.526 INFO: [256] DeepgramTranscriptionService$DeepgramWebsocketStreamingSession.onMessage#161: extraservantssentencemainly/bb9e988c Received response: {"transaction_key":"deprecated","request_id":"e4d3e6f1-eaec-4842-a3f3-1f70ca351a60","sha256":"510a01b3a3933de364ab0caaa6406d5f0b19e61f41327683c9960a0082e062e4","created":"2023-02-08T00:48:07.358Z","duration":0.0,"channels":0}
2023-02-08 00:48:12.531 SEVERE: [256] DeepgramTranscriptionService$DeepgramWebsocketStreamingSession.onError#181: Error while streaming audio data to transcription service
org.json.JSONException: JSONObject["is_final"] not found.
        at org.json.JSONObject.get(JSONObject.java:587)
        at org.json.JSONObject.getBoolean(JSONObject.java:628)

Deepgram’s example response is:

{
  "metadata": {
    "transaction_key": "string",
    "request_id": "uuid",
    "sha256": "string",
    "created": "string",
    "duration": 0,
    "channels": 0,
    "models": [
      "string"
    ],
  },
  "results": {
    "channels": [
      {
        "search": [
          {
            "query": "string",
            "hits": [
              {
                "confidence": 0,
                "start": 0,
                "end": 0,
                "snippet": "string"
              }
            ]
          }
        ],
        "alternatives": [
          {
            "transcript": "string",
            "confidence": 0,
            "words": [
              {
                "word": "string",
                "start": 0,
                "end": 0,
                "confidence": 0
              }
            ]
          }
        ]
      }
    ]
  }
}

…which did not match the response schema documentation, nor does it match what Deepgram sends back:

{
 "transaction_key": "deprecated",
 "request_id": "e4d3e6f1-eaec-4842-a3f3-1f70ca351a60",
 "sha256": "510a01b3a3933de364ab0caaa6406d5f0b19e61f41327683c9960a0082e062e4",
 "created": "2023-02-08T00:48:07.358Z",
 "duration": 0.0,
 "channels": 0
}

So… I adjusted the parsing of the message to what Deepgram is actually sending back.

2. When testing, I just got the following logging:

2023-02-09 00:36:06.612 INFO: [548] DeepgramTranscriptionService$DeepgramWebsocketStreamingSession.sendRequest#188: sendRequest bytes: 48000
2023-02-09 00:36:07.112 INFO: [548] DeepgramTranscriptionService$DeepgramWebsocketStreamingSession.sendRequest#188: sendRequest bytes: 48000
2023-02-09 00:36:07.612 INFO: [548] DeepgramTranscriptionService$DeepgramWebsocketStreamingSession.sendRequest#188: sendRequest bytes: 48000
2023-02-09 00:36:08.112 INFO: [548] DeepgramTranscriptionService$DeepgramWebsocketStreamingSession.sendRequest#188: sendRequest bytes: 48000
2023-02-09 00:36:08.250 INFO: [498] DeepgramTranscriptionService$DeepgramWebsocketStreamingSession.onMessage#161: embarrassingestimatesagestrictly/d8f7253d Received response: {"transaction_key":"deprecated","request_id":"3414ddab-9b3b-4484-9aec-e6d451a67b76","sha256":"a7bfb505d3323e099931d60f5ad91f2283aac59597d8d182eb438c244a1ca156","created":"2023-02-09T00:36:06.020Z","duration":0.0,"channels":0}

…with this as the response from Deepgram:

{
 "transaction_key": "deprecated",
 "request_id": "3414ddab-9b3b-4484-9aec-e6d451a67b76",
 "sha256": "a7bfb505d3323e099931d60f5ad91f2283aac59597d8d182eb438c244a1ca156",
 "created": "2023-02-09T00:36:06.020Z",
 "duration": 0.0,
 "channels": 0
}

We send a few seconds of audio before we get back a response… then nothing else happens. After asking for assistance from the Jitsi community, I had the idea of kicking up the logging level. Searching for “Deepgram”, I found the following log entries:

2023-02-09 02:02:32.090 FINE: [84] org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.closeConnection: closeConnection() {1011=SERVER_ERROR,NET-0001} WSCoreSession@1595f2f0{CLIENT,WebSocketSessionState@619f783a{CLOSED,i=NO-OP,o=NO-OP,c={1011=SERVER_ERROR,NET-0001}},[wss://api.deepgram.com:443/v1/listen?language=en&interim_results=true,null,true.[]],af=true,i/o=4096/4096,fs=65536}->JettyWebSocketFrameHandler@12f04174[org.jitsi.jigasi.transcription.DeepgramTranscriptionService$DeepgramWebsocketStreamingSession]
2023-02-09 02:02:32.090 FINE: [84] org.eclipse.jetty.websocket.core.internal.WebSocketCoreSession.abort: abort(): WSCoreSession@1595f2f0{CLIENT,WebSocketSessionState@619f783a{CLOSED,i=NO-OP,o=NO-OP,c={1011=SERVER_ERROR,NET-0001}},[wss://api.deepgram.com:443/v1/listen?language=en&interim_results=true,null,true.[]],af=true,i/o=4096/4096,fs=65536}->JettyWebSocketFrameHandler@12f04174[org.jitsi.jigasi.transcription.DeepgramTranscriptionService$DeepgramWebsocketStreamingSession]

This told me that Deepgram was closing the connection with the close frame {1011=SERVER_ERROR,NET-0001}. Looking over the documentation, I saw the following:

This did not make any sense. Logging clearly showed that Jigasi is sending 48,000 bytes four times. I then took a wild guess and theorized that Jigasi is sending raw, headerless packets. That means I need to include the encoding and sample rate in the URL. Once I did that, I encountered another problem:

3. While I did start getting transcription, its format was not like what I received before including the encoding and sample rate:

{
 "channel_index": [
  0,
  1
 ],
 "duration": 4.3,
 "start": 0.0,
 "is_final": true,
 "speech_final": true,
 "channel": {
  "alternatives": [
   {
    "transcript": "this is a test of transcription",
    "confidence": 0.9892578,
    "words": [
     {
      "word": "this",
      "start": 2.1648612,
      "end": 2.3634722,
      "confidence": 0.99609375
     },
     {
      "word": "is",
      "start": 2.3634722,
      "end": 2.5620835,
      "confidence": 0.9892578
     },
     {
      "word": "a",
      "start": 2.5620835,
      "end": 2.8004167,
      "confidence": 0.8413086
     },
     {
      "word": "test",
      "start": 2.8004167,
      "end": 3.0784724,
      "confidence": 0.94140625
     },
     {
      "word": "of",
      "start": 3.0784724,
      "end": 3.5784724,
      "confidence": 0.46875
     },
     {
      "word": "transcription",
      "start": 3.6743057,
      "end": 4.071528,
      "confidence": 0.99902344
     }
    ]
   }
  ]
 },
 "metadata": {
  "request_id": "40d29e06-66bc-4553-8a9c-1629dea52770",
  "model_info": {
   "name": "general",
   "version": "2022-01-18.1",
   "tier": "base"
  },
  "model_uuid": "c12089d0-0766-4ca0-9511-98fd2e443ebd"
 }
}

It’s closer to the documented schema than it is to the example in the documentation, but it’s still not the same. For example, the documented schema indicates “transcript” is a member of the channel object when it is a member of ResultAlternative objects in the “alternatives” array.

After noting the differences between documentation and actual behavior, I coded the parsing of each message as follows:

        @OnWebSocketMessage
        public void onMessage(String msg)
        {
            if (logger.isDebugEnabled()) {
                logger.debug(debugName + " Received response: " + msg);
            }
            JSONObject obj = new JSONObject(msg);
            boolean isFinal = obj.has("is_final") && obj.getBoolean("is_final");
            String result = obj.has("channel") && obj.getJSONObject("channel").has("alternatives") ?
                    obj.getJSONObject("channel").getJSONArray("alternatives").getJSONObject(0)
                            .getString("transcript") : "";
            if (logger.isDebugEnabled()) {
                logger.debug(debugName + " parsed result " + result);
            }
            if (!result.isEmpty() && (isFinal || !result.equals(lastResult)))
            {
                lastResult = result;
                for (TranscriptionListener l : listeners)
                {
                    l.notify(new TranscriptionResult(
                            null,
                            uuid,
                            !isFinal,
                            transcriptionTag,
                            0.0,
                            new TranscriptionAlternative(result)));
                }
            }

            if (isFinal)
            {
                this.uuid = UUID.randomUUID();
            }
        }

The rest of this class is available here, in a friendly fork of Jitsi’s Jigasi.

Deploying to a Test Server

I have deployed Jitsi Meet and my friendly fork of Jigasi. To deploy, I:

1. Installed Jitsi Meet on an Ubuntu instance per documentation.

2. Checked out the code

git clone https://github.com/megafarad/jigasi.git

3. Built the Debian package

cd jigasi/script
./build_deb_package.sh

4. Installed the package (your .deb file may have a different name)

cd ~
sudo apt install ./jigasi_1.2-12-g47458e7-1_all.deb

5. Made some adjustments to the configuration in /etc/jitsi/jigasi/sip-communicator.properties

org.jitsi.jigasi.ENABLE_TRANSCRIPTION=true
org.jitsi.jigasi.ENABLE_SIP=true

...

# delivering final transcript
org.jitsi.jigasi.transcription.DIRECTORY=/var/lib/jigasi/transcripts
org.jitsi.jigasi.transcription.BASE_URL=https://omoiomoi.org/transcripts
org.jitsi.jigasi.transcription.jetty.port=9000
org.jitsi.jigasi.transcription.ADVERTISE_URL=true

# save formats
org.jitsi.jigasi.transcription.SAVE_JSON=false
org.jitsi.jigasi.transcription.SAVE_TXT=true

# send formats
org.jitsi.jigasi.transcription.SEND_JSON=true
org.jitsi.jigasi.transcription.SEND_TXT=true

...

# Deepgram server
org.jitsi.jigasi.transcription.customService=org.jitsi.jigasi.transcription.DeepgramTranscriptionService
org.jitsi.jigasi.transcription.deepgram.websocket_url=wss://api.deepgram.com/v1/listen
org.jitsi.jigasi.transcription.deepgram.api_key=<<ENTER_YOUR_API_KEY_HERE>>

6. Added the following location to my site’s Nginx config, so that it acts as a reverse proxy to 127.0.0.1:9000 for serving transcripts.

    # Transcripts
    location ~ ^/transcripts/(.*) {
        proxy_pass http://127.0.0.1:9000/$1;
    }

Try it for yourself

1. Go to omoiomoi.org

2. Start a meeting

3. At the bottom, click “more actions,” then “invite people”

4. Next to “invite your contacts,” type “jitsi_meet_transcribe”

5. Then click “Telephone,” then click “Invite”:

The transcriber will then join the room and begin transcribing.

Final Thoughts

While functional, this solution is hardly something that can compete with Otter.ai. With some UI improvements and some additional features (like emailing a transcript of the meeting to participants), it can provide conference transcription at a fraction of the cost of Otter.ai.

Aside from its usefulness, this was a good exercise in Jetty WebSocket programming. I did run into some inconsistencies between Deepgram documentation and its actual behavior.

As for that missed opportunity: who knows? Perhaps I can propose some improvements to the Jitsi community, and make its transcription feature much better.

Transcribing Conference Audio Using the Deepgram WebSocket API

A Missed Opportunity

Researching Existing Transcription Support

GitHub - jitsi/jigasi: Jigasi: a server-side application acting as a gateway to Jitsi Meet…

Jigasi: a server-side application acting as a gateway to Jitsi Meet conferences. Currently allows regular SIP clients…

GitHub - alphacep/vosk-server: WebSocket, gRPC and WebRTC speech recognition server based on Vosk…

This is a server for highly accurate offline speech recognition using Kaldi and Vosk-API. There are four different…

jigasi/VoskTranscriptionService.java at master · megafarad/jigasi

Jigasi: a server-side application acting as a gateway to Jitsi Meet conferences. Currently allows regular SIP clients…

Building a Custom Transcriber

Deploying to a Test Server

Try it for yourself

Final Thoughts

Written by Chris Carrington