Make OCR great again

Published in

Machine Box

5 min readFeb 15, 2019

OCR, or Optical Character Recognition, is an awesome tool for turning printed text into digital deliciousness. Existing open source libraries and tools work great… on documents. Where they fall down is in just about every other scenario… until now (hopefully).

If you are a lawyer, and have been for 30 or so years, you’ve probably got lots and lots of file cabinets filled to the brim with paper as well as a general malaise about being a lawyer for so long. I’m sure there are contracts, agreements, trusts, memos, letters, and even the occasional random menu from the local Chinese take-out place stuffed into hanging file folders and other fascinating file cabinet accoutrement. What is annoying is that you’ve spent your life working very hard to keep things organized so that you can find relevant docs again when you need them.

You can scan your docs and make a digital copy of everything so that when you spill coffee on your file cabinet you don’t lose your life’s work.

But what good is having all of those digital files if you can’t search them? Imagine replacing all those files with PDFs and Word docs in your (hopefully secure) DropBox account?

This is where OCR comes in. When you scan your documents, instead of just being a photo of the fascinating prose typed on white legal paper, you actually get a document that contains text that the computer can understand. It is the difference between taking a picture of a printed piece of paper and having a word doc editable in your computer.

Ok, you get it, you know what OCR and computers are, let’s move on to when OCR doesn’t really work.

Videos

Video, a series of photos put together and… no no no you know what video is. So why does OCR have a hard time with it? Well, consider the following example:

https://www.geekwire.com/2017/amazons-first-nfl-live-stream-overcomes-early-glitches-long-weather-delay/

If you run this through one of the many cloud APIs for OCR out there, this is the kind of data you get back:

{
      "description": "10",
      "boundingPoly": {
        "vertices": [
          {
            "x": 400,
            "y": 628
          },
          {
            "x": 419,
            "y": 628
          },
          {
            "x": 419,
            "y": 662
          },
          {
            "x": 400,
            "y": 662
          }
        ]
      }
    },

10? 10 what? Is that the score? Is that how many minutes are left on the clock? Is it the down or the distance until a first down? Is that a player’s jersey number?

Every number, and in some cases, every letter has its own entry in the results. This data isn’t very useful at scale because I don’t know where it came from.

I guess we can’t do it

No so fast. We can still glean meaningful information off this kind of video, but we have to think long and hard about why we’re running OCR on this video first. Then, we can use a tool like Objectbox (brand new capability from Machine Box) to narrow in on the numbers we really want.

So, let’s assume we want to extract the game clock from the screen, that way we have reproducible time indexes for all the hundreds of thousands of hours of NFL content being managed in my rhetorical company.

First thing I need to do is grab some sample videos from my collection and download them to my computer.

Next, after downloading Docker, I just need to follow the simple instructions on running Objectbox here.

Annotating

Now comes the fun part. When you run Objectbox, you have this little tool that lets you draw boxes around things on a video, which is then used to train the object detection engine inside the box. This is machine learning at its best folks!

What I want to do is draw boxes around the time clock only, and label it as such. Start with 4 or 5 examples, preferably with 2 or 3 different videos.

Fortunately, you’ll know if you’ve gotten enough examples pretty quickly with the little feedback the tool gives you on detection.

You’ve just won at machine learning

Congratulations! You win! You’ve just accomplished something that used to take data scientists and ML engineers months to do manually. Somebody should probably give you some money or something.

Now that you have a model trained to detect the play clock on your NFL videos… you can use that model to detect the play clock on NFL videos. ALL YOUR VIDEOS.

Create image files of just the play clock for every frame you sample using the bounding box information you get back from Objectbox in the JSON response.

Then, build a little service that sends just those JPEGs to your favorite OCR tool.

See how much better your results are?

You’ve just built an ML pipeline

My goodness you’re on fire. Not only did you train a single machine learning model, but you’ve built a machine learning pipeline, using AI on top of AI to bring about change in the world.

You know what the data coming back from OCR is, it has context and meaning, and you can use that to solve your use cases.

How do I get this into production?

Good question. If you’re like me, then you’re not really comfortable writing code, especially code that gets used by other people for purposes. That is ok, what you’re doing here is experimenting and proving that something is possible. That is always the first step.

If you’re actually looking to solve this in production, without having to build your own deployment schemes around machine learning pipelines, you can always try Veritone’s aiWare platform. It has Objectbox in it, as well as some very capable OCR engines that you can pick from.