Mercury Milestone One

A web based tool for video and audio transcription, translation, captioning and video creation.

Published in

glitch digital

7 min readAug 4, 2017

Less than a month ago we released the first public version of our automated video transcription and translation tool Mercury, supported by prototype funding from the Google Digital News Initiative.

We now have hundreds of registered users and hundreds of videos uploaded. This is a progress update on how things are going — including challenges and the sort of feedback we have had so far.

A raw computer generated transcription and translation ready for editing

We launched with support for computer assisted transcription of videos in English, Chinese, Japanese, Arabic, French, Spanish and Portuguese and within a couple of weeks had added transcript translation (without audio translation yet) and exporting to close caption (VTT) and improved the transcription editor.

In contrast to commercial projects we have been prepared to launch publicly at a very early stage to get as much feedback as soon as possible (starting with publishing the source for a proof of concept back in April 2016).

Summary of feedback and challenges

Negative feedback has fallen under one of three headings.

1. Videos being slow to upload

Videos being slow to upload was an early complaint we started getting back in pre-launch phrase.

The problem wasn’t so much the upload speed to the servers but the size of raw videos being uploaded — and then we started hitting Google Drive performance issues and quota limits, which we are still working on.

2. Video not playing once uploaded

Video formats come in all shapes and sizes and in a public facing tool things get wild fast, and browser video handling support varies wildly too.

We rolled out a quick fix to help address this (we now convert all videos to H.264/MPEG-4 AVC on upload) but there is more we can do here.

3. Transcriptions of interviews can be hilariously bad

The intent of the transcription and translation functionality is to create a workflow tool that can help create transcriptions, translations and captions for broadcast quality content.

This works fairly well when dealing with clean audio (typically read from a script, spoken by a presenter and captured in a studio on professional equipment) especially if the source video is in English.

However even current state-of-the-art speech to text tends to fail hilariously when dealing with low-fi audio of typically casual conversations, often recorded on phones placed on desks, with ambient background noise.

We need to a better job of setting expectations, and providing tips to help people record useable audio in interviews where possible.

The version we sent out to closed a group in the first couple of weeks in July was also mangling translations when displaying them due to a bug. Ouch.

Videos come in all shapes and sizes

The expectation (and guidance in help documentation) suggests MP4 or Mp3 format for videos and that typically shorter segments — with a 10 minute Standard Definition video being about 50 MB.

Mercury doesn’t stop you from uploading larger videos though, so of course people started uploading large raw videos over 1 GB in size, which worked — Mercury is hosted on AWS and we don’t impose rate limits on uploads— but it probably took a while unless they were on high speed connection.

Videos being in formats other than MPEG4 was also the reason why people reported not being able to play them back in their browser. What video formats are supported in the browser varies widely depending on your browser, Operating System and your media handlers and browser plug-ins.

Chrome on a Mac can use QuickTime (and so QuickTime Plugins) and can play back a huge range of things–but support on older, unsupported operating systems like Windows XP and older versions of Internet Explorer is more limited.

While MPEG4 is now broadly supported, some platforms — including some Android devices–are also quite specific about the encoding options.

Clearly we need to do a better job of providing guidance here, and include steps to check media format and video size as part of the upload process.

We are also considering shifting some of the work client side, either doing more in the browser or in a companion desktop tool.

As an immediate fix, we quickly added in a step to normalize all files to H.264/MPEG-4 AVC as part of the upload process. This change did not impact videos already uploaded or actually improve upload times though.

This was always planned but just how big a problem not doing it would be was unexpected and only really became apparent after launching.

Unfortunately this normalisation process can in some cases—chiefly depending on the format of the original video — take some time to process, which can slow the upload process down further. At the very least, we need to manage the experience better by indicating this is a background task.

Google Drive performance and quota limits

We want Mercury to be as affordable and accessible to everyone as possible— so we took the step of handling the complexity of integrating with Google Drive, so that users could take advantage of cheap, flexible storage — and Google Drive’s generous free quota of ~15 GB.

The plan has always been to support alternative storage options for paid subscribers, but the performance issues and quota limits have been more severe in impact than were anticipated; we are using Google Drive in an unconventional way and for much larger files than is typical.

We are looking at migrating all metadata and transcripts from using users personal Google Drive storage to our own database to improve performance and responsiveness of the application.

This is the sort of thing the prototype is intended to help shake out and we are working to rectify it. We may end up opening up premium storage platform to everyone the next few weeks.

Google Drive storage will still be available as an option and we’ll keep Google+ for sign-in (and will eventually enable other options).

Setting expectations

We are trying to create a tool to make things easier and to provide a computer assisted workflow, so that there is ultimately less work to do to transcribe and translate content, meaning publishing content in other languages becomes cheap enough to be economically feasible.

We want people to be able to access content from sources they might not have heard from before, wether that’s from foreign broadcasters or regional stories from global publishers that are not always published in their language. If we can make it cost effective, it can open up new markets and new revenue streams to publishers.

With the editor still a work in progress, and without full audio translation—as seen in the working open source proof-of-concept we published—and with the video creation tools still in development it’s perhaps hard to get a grasp on quite what the tool is for when jumping into it for the first time.

One thing journalists are clamouring for is a tool to help transcribing interviews. It’s a time consuming task and there is strong demand for a tool that would make it less odious.

Unfortunately the best speech-to-text software in the world currently doesn’t work well with low-fi recordings done in noisy environments and this leads to major disappointment when journalists upload their interviews and get back useless garbage transcripts.

We can do some work to help here, like noise reduction and audio cleanup, and provide guidance on how best to record audio (such as using a couple of cheap clip mics when recording and asking subjects to speak clearly) but there are limits to what is technically possible right now and to what journalists might be willing to ask folks to do when being interviewed.

There is wide expectation gap here as iPhones, Android phones and devices like Amazon Echo or Google Home seem to work reasonably well, right?

Of course, when we speak to our devices we tend to make a point of speaking clearly and directly to them, and it’s easier for them to predict the sorts of things we might be saying — because we are issuing a specific commands and services combine the probability of recognized words with the most likely search query.

We need to communicate better what Mercury is and isn’t good for and to help people make the best use of it.

In the immediate future, a professional transcriber will continue to be more accurate than speech-to-text software, a professional translator will provide a better quality translation than a machine translation, a professional presenter will provide the best sounding voiceovers and a professional video editor will be able to create the best video packages.

If we can provide a tool where individual multilingual journalists can quickly and easily translate and re-dub (using synthetic voices or by recording new spoken audio in the browser) and publish videos without any special training or complicated software then we can help newsrooms, media outlets and independent journalists reach new and wider audiences.

What’s next

We are going to address the issues we have and work on improving performance, features and workflow and publishing API documentation and will keep you updated with progress over the next few months.

Our support from Google DNI generously covers hosting, admin and other staff costs on the project, with development self-funded by glitch.digital through our other work.

If you are interested in supporting our work, in partnering with us or at getting access to some advanced features get in touch with mercury@glitch.digital.

What else is out there?

There is some really interesting related work being done by folks like Pietro Passarell on autoEdit and by Mark Boas and Laurian Gridinoc (both of whom have worked on Trint).

Work on tools like oTranscribe by Elliot Bentley has also been extremely helpful at understanding some of of challenges and no-longer maintained projects like PopcornJS from Mozilla continues to be invaluable.

BBC News Labs and Deutsche Welle are also doing work in this area, both together and on projects of their own.

If you are working on video transcription, translation or web based video or audio editing tools or libraries leave a comment and a link to your project!