Screenshot showing an extract of the spreadsheet with the columns, auto CRAP and correct CAPS, and the words, side-by-side in a Google Sheet and colour coded as green, orange and red.

YouTube automatic craptions score an incredible 95% accuracy rate!

TheDeafCaptioner
4 min readJul 25, 2015

--

For more than 5 years I’ve been waging a battle with Google and YouTube on the poor quality of automatic craptions.

Today, I would like to offer a very humble and sincere apology.

Sorry Google and YouTube.

You are getting better and I’m ready to “man up” and admit that I had my mind totally blown away by the accuracy of one YouTube video earlier today which I scored at just over 95% accuracy*

This is the video that achieved this incredible feat of automatic captioning accuracy, which also marks a significant milestone — as it’s the very first YouTube video I’ve been able to comprehend just by watching with the automatic captions.

YouTube video screenshot showing Australia’s Prime Minister, Tony Abbott

This is how I conducted my calculations.

I first created a Google Sheet and entered each word in the automatic craptions file in the first column. Then I repeated this step in the adjacent column, entering in each word of the corrected caption file.

Then I did some basic colour coding to score each word for accuracy on the simplistic basis of whether it was comprehensible to me (or not).

Green colour code = no error or difference between the auto craptions file and the corrected caption file.

Orange colour code = means that there is a difference between the two files but it’s usually just a matter of punctuation (i.e. fullstops at the end of a sentence) or grammar (i.e. Capitalization at the start of a sentence). I did not treat these as errors in any way.

Note: Words that were coded as “Orange” were not treated as errors as they did not detract from my ability to comprehend the video.

Red colour code = errors or differences between the auto craptions and the corrected captions. i.e. where Google and YouTube’s voice recognition and deep NLP tools got it wrong.

The first 20 words of the spreadsheet, with the Auto CRAP and Correct CAPS columns side-by-side
14 words side by side in the Auto CRAP and Correct CAPS columns with an error coded in red — that versus than.

Then I used a very simple calculation of accuracy (which completely ignores any errors in the timing and / or syncing of the captions).

There should have been 236 words in total in the automatic craption file.

And there was only 11 errors (or rows colour coded as red) in the automatic craption file.

This represents an error rate of 4.66% or more importantly, an accuracy rate of 95.34%!

Memo to Google and YouTube — Don’t rest on your laurels just yet

This YouTube video was chosen as it represents the best possible chance for Google and YouTube to “get it right” using their voice recognition and deep neural language processing (NLP) tools as it:

  • only has a single speaker, who speaks clearly and in a slow, consistent, monotonic and deliberate manner
  • has good quality audio and it doesn’t have any background noise or sound effects and / or music
  • has a speaker whose speaking style often involves repetition of some phrases (although interestingly the phrase “this means” was correctly picked up once by the automatic craptions but on the second occasion it was “this Maine’s”!)
  • it was published within the last month — so it should be using the very latest iterations of Google and YouTube’s voice recognition / neural language processing (NLP) toolset.

Most YouTube videos do not have these elements.

Rather they tend to have multiple speakers, rapid fire and often unintelligible dialogue, sound effects and background music and the list goes on and on.

In these circumstances, Google and YouTube’s auto-generated craptions will simply not be able to get anywhere near the magic 95% level that we have demonstrated above.

And it’s up to YouTube and Google to be much more proactive in coercing content creators to manually review and correct the automatic craptions or upload the correct transcripts than is currently the case, whether it’s by using punitive measures, such as reducing AdSense revenues paid to content creators who don’t provide captions or perhaps by paying a little bit extra to reward those channels that actually do the right thing!

In the interests of balance, I’ve also done the calculations on the first minute or so of another recent YouTube video, that has two speakers — just to demonstrate that we very quickly end up back in the 60%/70%/80% “accuracy” zones once we move away from automatic craptioning’s ideal operating conditions.

YouTube video screenshot showing Malcolm Turnbull and Paul Shetler
Screenshot which includes “I’m here with Paul Shetler” but auto craptions thought it was “I’m he would pull shit like”!

There should have been 196 words in total in the automatic craption file for the first part of the clip with Malcolm Turnbull and Paul Shetler.

But there was noticeably more errors this time, a total of 51.

This represents an error rate of 26.02% or more importantly, an (in)accuracy rate of 73.98%!

Resources:

Link to the Google Sheet calculations

--

--