Automated text extraction at Bolt

Francesco Pochetti
Bolt Labs
10 min readSep 2, 2022

--

What do we use text extraction for?

Bolt is a multimodal transportation company operating ride-hailing, delivery, and rental services in hundreds of cities across Europe and Africa. On the one hand, we need to know who the drivers, couriers, and riders are to provide a safe and high-quality service. On the other hand, we need to know who’s renting our scooters, bikes, and cars to prevent damage and theft.

To know our customers and partners, we collect various documents to identify them — an ID card, driving licence, or passport — but also ensure they have a valid taxi licence and insurance should they need it.

Manually processing all these documents is an immense effort at our scale and a substantial part of our operating cost. Data entry quality is laborious to manage as you need multiple people to enter the same data (double-entry), including the accompanying training and management resources to produce high-quality records.

Clinical studies are well examined for manual data entry errors. Most of the results are recorded in a laboratory setting, where people take notes in their notepads and key them later. As the data points are generally few and the errors may cost pharmaceuticals dearly, it has been an interest for academics, and external benchmarks are available from different studies.

The expected error rate per text field type — e.g. numeric, date, text — can vary significantly. Hong MKH et al.¹ found that across all fields, the error rate was 2.8%, while individual field error ranges from 0.5% to 6.4%. Fields in text formats were significantly more error-prone than those with direct measurements or involving numerical figures.

Getting to zero data entry mistakes is a worthy yet often unattainable goal. This is universal among tech companies, government agencies, and pharmaceuticals. Data entry is a manual, repetitive job; people get tired and make mistakes.

For context, Goldberg, Saveli I et al.² found that the detected error rates in existing medical databases ranged from 2.3 to 26.9%, significantly higher than expected. So, what can be done about it?

Optical Character Recognition (OCR) systems can considerably lower the human resources required for data entry and, simultaneously, reduce the expected error rate.

What do we mean by OCR exactly? As per Wikipedia, optical character recognition is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text.

OCR systems have been around for a while, so there’s nothing much to invent here. We just submit the document we want to scan to one of those OCR engines, and we’re good to go, right? No.

Let us show you the kind of pictures we receive.

If your first reaction was, “wait, the first one isn’t a document!” then you’re both right and wrong. That’s an image a user uploaded as a supposed driving licence. So, in our systems, it was tagged as such when it’s not. The other images speak for themselves.

Some of those issues can be detected at the device level, prompting the user to retake a picture. However, given that we don’t always collect images from our app, we don’t fully control the image quality and just have to deal with what we gather.

We can apply the OCR system in these cases, but not as a standalone procedure. It needs to be part of a more complex text extraction process, composed of multiple steps aimed at cleaning the image, identifying only the parts of the doc requiring our attention and then, only then, feeding those regions to a character recognition system.

OCR and text extraction aren’t synonyms. The former is part of the latter, just a step in a much more complex pipeline.

In the following sections, we’ll explain how we implemented automated text extraction from ID documents at Bolt. So fasten your seatbelts as we get started!

Document processing pipeline

The following diagram shows the document processing pipeline we put in place.

Let’s dissect it step by step.

For this exercise, we’ll use an actual document we recently added automated support for the Proof of Insurance Car Sticker (PICS) in Ghana. For obvious reasons, the PICS is part of a set of mandatory documents Ghanaian drivers must provide Bolt when signing up.

We extract multiple text fields from the PICS, but for privacy and simplicity in this post, we’re going to see what it takes to read the expiry date only, e.g. “Expiry: 2022–09–27”.

Sample image of a Proof of Insurance Car Sticker (PICS) in Ghana

1. Document type detection

As previously shown, when the user tells us they’ve uploaded a PICS, but to ensure they provided the correct image of the document, we have to check it before moving on. So the first step is a document type detector, which, behind the scenes, is nothing more than a binary classifier figuring out whether the image we received is a PICS or not.

If it’s not the correct document, we send it to our internal Human Review team for them to process it manually.

In case of a match, we know we’re dealing with the expected document type and move to the next step.

For completeness, the binary classifier is a simple (and also accurate and effective) pre-trained ResNet34 Convolutional Neural Network we fine-tune on less than 1000 pictures.

2. Image preprocessing

Now that we’re sure a driver sent us a PICS, what do we do with it? We preprocess it to make it easier for text extraction to work.

Specifically, we go through the following steps:

  • Detect the region of the photo occupied by the document and crop it. We detect it via a semantic segmentation model (U-Net with a pre-trained ResNet34 encoder), which classifies each image pixel as either document or background. The rationale behind this is that often the actual document occupies a relatively small part of the input image. The rest is the background that’s of no interest to us. The background noise can introduce false positive readings for the text box detection and OCR downstream, so we cut it out.
  • Bird’s eye view. This operation detects four reference points (typically corners) of the document in question. It applies a perspective transform to obtain a top-down image view, taking care of warping and rotation. The four corners aren’t always easy to spot, given the quality of our input. If we can’t find them, we skip this affine transform and go ahead.
  • Recognise the expected document dimensions. For example, if we’re aware we’re dealing with an ID card format, then we also know the aspect ratio that a correct document is supposed to have. This information allows us to check for orientation. If the aspect ratio of the cropped mask is inversed for the expected one, e.g. it’s portrait instead of landscape, we can be pretty sure the document is rotated by 90 degrees and fix it. You can see the example below with a vehicle document in South Africa.
  • Text deskewing (optional). The previous steps generally fix most of our issues. If this isn’t true, we can opt for playing the final card of deskewing, consisting of estimating the “skew” (angle in degrees) and rotating by the inverse of it to obtain text running across the page rather than at an angle. There are multiple problems with the hidden skewed text. The clearest is that when we try to locate regions of interest across the document (see section 3 for details), the bounding boxes are loose and tend to overlap with adjacent pieces of text. Let’s check the fictitious example, a crop from the PICS doc. As you can see, when detecting the expiry date in the skewed image, the bounding box inevitably includes a chunk of the Inception field, which can cause problems for the OCR engine performing the read. On the unskewed image, instead, this issue is not present.

Let’s check the result of image preprocessing on our PICS sample document.

Nice! The segmentation mask detector did great with the bird’s eye view transform. It looks clear enough to keep going.

3. Detect and crop regions of interest

Now that we’ve taken care of preprocessing, we can focus on the exact text fields we need to read. As stated before, we’ll focus on the expiry date only.

We can detect the location of the needed field within the document by applying an object detection model trained on text fields from the same document type.

It might surprise many, but we opted for the good old COCO pre-trained Faster RCNN. Its performance is pretty much spot on, at least for our needs. Check out how it manages to detect the expiry date in this case. This level of accuracy in placing the bounding box is achieved with less than 1000 training images — just a reminder of how powerful transfer learning is!

Once detected, we can crop the region of interest (ROI) using the bounding box coordinates.

The result is precisely what we need to finally apply the OCR engine magic. Let’s see how this works in the next section.

4. Text recognition

Can we run the previously cropped region of interest through an OCR system as is? We could, yes, but there are generally a couple of transformations it’s almost always a good idea to perform.

Turn the image to grayscale (or binarise it altogether), denoise it via filtering, increase contrast, etc. These steps dramatically improve the quality of OCR predictions, as pointed out by the official Tesseract docs.

The following results from converting the original ROI into grayscale and denoising it with a 5x5 blurring kernel in OpenCV. As you can see, it’s much cleaner than what we started with.

Now that we have an excellent crop of the image we want to read, which OCR engine shall we use? The choice is overwhelming here.

From our experiments, we can tell that Tesseract, EasyOCR (both open source), or paid options such as Amazon Textract or Google OCR all provide satisfactory results at different levels. It depends on what your needs are.

In our case, we’re using a combination of Tesseract as the main engine and Amazon Textract as a fallback option invoked when Tesseract fails to return a read or returns a confidence value lower than a specific threshold. For the expiry date in question, Tesseract alone extracts 2022–09–27 with 95% confidence.

An important detail to mention is that, before returning, we also post-process the OCR raw outputs, using a couple of tricks. The most notable one is regex pattern matching, which we apply at the text field level. In the case of dates, for instance, we know that days of the month, months of year, and years must follow specific rules. We can be even more precise and refine the pattern when dealing with expiry dates, given we know the date cannot be from the past.

Are we done?

Not yet. A final critical detail. If the OCR engines either fail to extract the text or do so with low confidence, we send the image to our internal Human Review team. Their job consists of manually checking the document contents and filling the gaps.

However, our primary goal is to maximise automation. We want to build reliable, robust, and accurate OCR pipelines, so our agents review only tough cases. Agents’ work is super important. It closes the loop, guaranteeing we always return the best possible results to our customers.

Conclusion

In this post, we explained how we implemented a fully automated text extraction pipeline at Bolt.

Starting from why we need an OCR solution, we deep-dived into details of the document processing production flow, peeling the onion step by step.

Everything begins with a document classifier to ensure the image we receive from users is what we expect. We then preprocess the raw input by removing any background and cutting out the document only.

We crop images one by one and run the OCR engine to eventually read their contents only when an object detector precisely identifies the location of the text fields we’re interested in.

I hope you found this walkthrough helpful. Stay tuned for more!

Join Bolt!

Bolt is serving over 100 million customers in 45 countries across Europe and Africa.

You can bet that there are fascinating engineering challenges involved with this kind of growth!

If you’d like to join us in building the future of urban transport, visit our careers page.

Sources

[1] Hong MKH, Yao HHI, Pedersen JS, et al Error rates in a clinical data repository: lessons from the transition to electronic data transfer — a descriptive study BMJ Open, 2013

[2] Goldberg, Saveli I et al Analysis of data errors in clinical research databases. AMIA Symposium vol. 2008 242–6, 2008

--

--

Francesco Pochetti
Bolt Labs

AWS ML Hero. ML Engineer at Bolt. Failed Chemist. Blogging about my ML/DL journey. IceVision core-dev.