Building a transcript editor for the web

Allison King
Published in
6 min readApr 13, 2022

Cortico’s mission of bringing underheard voices to the center of a stronger dialogue means we have a lot of transcribed words. Though we run two passes of transcription (one automated transcription via AssemblyAI and one human transcription via Rev), with over 14 million words in our corpus, there are always a good chunk of words that will be mis-transcribed. Sometimes this is because of ambient noise, a misspelled name, or location specific words that only a local would know (i.e. the name of an elementary school).

Now that we’ve worked with over 65 partner organizations to convene and make sense of nearly 2,000 hours of audio data, it has become clear we need a way to give our partners the ability to make edits to their transcripts. So we’ve spent the past few months on the technically challenging problem of building a ✨ transcript editor ✨.

What we started with

A simplified version of Cortico’s data pipeline looks like:

  1. Audio is uploaded to our servers
  2. We send audio out for automated transcription through AssemblyAI and save to S3
  3. We send audio out for human transcription through Rev and save to S3
  4. We force align the words and timings between the two transcripts. This allows us to highlight words as they are playing in a piece of audio and gives us more word accuracy.
A play button is clicked, then words are highlighted as they are spoken in audio
Demo of our embed emphasizing words as audio plays

5. We save the transcripts as “snippets” in our database

6. We display the transcript for members of LVN to read, listen, and make highlights which can then be shared.

A screenshot of a conversation in featuring a conversation’s transcript along with audio controls.
Sample conversation transcript from the Local Voices Network

We knew that building out a proper transcript editor would require a big backend refactor, so two years ago we decided we would handle transcript corrections manually until we hit enough scale where it would no longer be feasible. During the manual phase, to correct a transcript, our flow looked like:

  • Users submit a request for a transcript edit
A form on a website showing highlighted text and a field to submit what the text should actually say
Original workaround for submitting transcript corrections
  • The Cortico team receives a notification that a transcript request has been submitted
  • Our amazing Partner Support Lead Kelly painstakingly addresses every request by fixing the transcript through the Rev editor
  • Kelly clicks a special secret button which tells our pipeline to fetch the latest version of the transcript from Rev to store in S3
  • Changes eventually propagate to our site.

This flow was slow to update and put a lot of manual labor on Kelly. We’re excited that now that we have enough conversations flowing through the product, we can justify building our own transcript editor and making everything easier for both Kelly and our partners!

The problem

Building a transcript editor required a large amount of frontend, backend, and pipeline work for the team. We’ll write about the different parts in more detail in separate posts, but this post will serve as an overview of the problem in general. Here are just some of the problems we had to face:

  • Our speech pipeline operated on a periodic interval. This needed to be refactored to be totally event driven so that users could submit a correction and see their results within a few milliseconds, instead of on i.e. a five minute interval.
  • Our force alignment algorithm was slow, taking sometimes 20 minutes on our longer conversations. This was fine for a pipeline that chugged through conversations in the background, but not fine for a transcript edit that a user expects feedback on right away. We were able to get this down to below one second, a journey which deserves its own blog post.
  • We needed to build a whole new interface on the frontend for text editing. Luckily there were other open source libraries which helped pave the way for us (more info on these later in this post).
  • We needed to handle the event of multiple users trying to edit the same transcript at the same time. Collaborative editing was beyond the scope of this first iteration, and so we approached the problem a different way. We created a transcript_edits table which has a unique constraint on version number and conversation id. When the frontend submits a request, it attempts to increment the version number. If that version number already exists in the database, a constraint error is thrown, and the user is asked to refresh the page. This way the transcript that the user sees is always up to date.
  • We needed to handle whole paragraph insertions and deletions while preserving order. Since we stored all paragraphs in our snippets table, doing an insert or delete could potentially mess with our IDs, and so we needed a different way of ordering them, via a column index_in_conversation .

Researching existing frontend tools

Very fortunately for us, the transcript editor problem is not an entirely new one. We learned a lot from going through the source code of BBC’s React Transcript Editor and at first thought we would implement ours in a similar way. Their library uses Draft.js, a framework for text editing created by Facebook. However, one of their pinned issues was about poor performance in audio over an hour long. This was a big deal for us as almost all of our conversations are over an hour long.

Luckily the fine engineers over at the BBC also tried out building a transcript editor with similar features using Slate, a different, less opinionated text editing library. Slate is still in beta, but its performance proved a lot better both in the BBC’s example as well as in our own Storybook tests. Slate was also much easier to pick up since it did not require you to learn all of Draft.js’ models, and you can use whatever schema you want.

We are so grateful to these open source libraries which helped us out. To give back, one of the features we didn’t find in these libraries was a find/replace feature. So we published our own method of doing this. The code can be found here, and a demo here.

The force alignment problem

The BBC wrote up a good summary of this problem in their repo notes, but it basically comes down to:

  • A transcript starts off with word timings on each word. If I start a conversation with “Welcome to this conversation of the Local Voices Network”, on the backend, that data structure might look like
{ word: "Welcome", start: 0, end: 0.5 },
{ word: "to", start: 0.5, end: 0.6 },
{ word: "this", start: 0.6, end: 0.8 }
  • Now if I correct the transcript to say “Welcome to this amazing conversation of the Local Voices Network”, we have to figure out a start and end time to the new word amazing. This isn’t the worst problem—we can do a linear interpolation, for example.
  • However, if you start deleting words and adding back words, it’s all too easy to start losing timing information. In our product in particular, lost timing data affects how highlights show up and how the words are emphasized as audio plays, and so it becomes a very noticeable problem.
  • As such, we want to run a proper force alignment on the server whenever we can in order to keep words in good shape.

Putting it all together

We split our small but mighty team into a few streams of work:

  1. Refactoring our speech pipeline to be event driven using Celery
  2. Refactoring our force alignment algorithm to be fast ⚡️
  3. Building out a frontend transcript editor capable of adding words/paragraphs, deleting words/paragraphs, changing speaker names, adding speakers, playing audio while editing, find/replace/replace all, undo/redo, etc.
  4. Building out the backend API which would accept a list of snippets and figure out all of the edits to save to the database. We ended up using Hypothesis to help us write unit tests to make sure this was all working correctly—it was a lifesaver!

It took pretty much all of the first quarter of 2022 for our team to build this out, but we are so excited to roll it out to our partners and give them better control over their data. We’re very pleased with how it has turned out, and had fun working on this challenging problem.

User editing some words and speaker names in our transcript editor
The transcript editor in action!

If you are interested in this kind of work, check out our Careers page!