Pipeline to process and OCR historical news archive

Published in

SCMP — Inside the Wonton

8 min readDec 15, 2020

This is a follow-up article to the previous blog post: Using AI to denoise scanned newspapers.

TL;DR

The idea of using Denoising Autoencoder (DAE) to denoise scanned newspaper has proven to be very successful. In order to further improve the OCR accuracy, we have employed traditional image processing into the pipeline to further remove unwanted elements in the image, e.g. borders and graphics.

The following image shows the original scanned copy versus the end result of the pipeline.

Figure 1. Original image (left) and final result (right) processed by the pipeline.

We will go through each step of the pipeline, and see how we gradually improve the image for OCR.

The main components of the pipeline include:

upload: The stage where we parse the segmented data from XML to JSON, and upload source files to a central repository for processing in the later stages.
prep: Using traditional image processing to remove borders and graphics.
ocr: Using DAE AI model to denoise and perform OCR by Tesseract.

Because we have over 100 years of news archive to process, the pipeline will use Celery to manage the task queue, and Kubernetes to run the tasks in scale.

API and Storage

The OCR text and PDFs will eventually be migrated into our in-house archiving system: “Nexus”. In order for Nexus to access and other pipeline components to upload the artifacts, we need a central repository: The API.

The API is based on FastAPI, with MongoDB as a database and AliCloud Object Storage Service (OSS) as file storage. We use API to generate signed URL to directly manage files on OSS, and API itself doesn’t have to handle file upload and download, which allows the API to scale smoothly by Kubernetes Horizontal Pod Autoscaler (HPA).

FastAPI is very easy to use and build. Because it is based on OpenAPI, it automatically generates the JSON schema on the fly, and developers can browse the API endpoints using Swagger UI. FastAPI is very fast and below is a simple benchmark against a listing of OCR records.

siege -c10 -r100 https://example.com/scmp-ocrs/?model=s152-2020-11-09.h5&year=1980&limit=20
{
  "transactions": 1000,
  "availability": 100,
  "elapsed_time": 6.06,
  "data_transferred": 18.45,
  "response_time": 0.06,
  "transaction_rate": 165.02,
  "throughput": 3.04,
  "concurrency": 9.11,
  "successful_transactions": 1000,
  "failed_transactions": 0,
  "longest_transaction": 0.35,
  "shortest_transaction": 0.02
}

During the pipeline development, we also found a useful project: OpenAPI Generator. It takes the OpenAPI schema, and generate the SDK of your preferred programming language. We used the generated SDK to write the pipeline components and everything works without any problem.

Task: “upload”

We had two vendors to help with the historical archives. One of them helped to scan microfilms and newspapers into TIFF image format. The other vendor helped to segment the pages into articles. Metadata are saved in XML format, articles are stored in PDF, and graphics found inside articles are stored in JPG format.

Because these two repositories are only available within the company network through SMB, I have to upload the following data to the API from my workstation:

Upload scanned TIFF images to OSS.
Parse article XML to JSON and store in API.
Upload article PDF and graphic JPG files to OSS.

This is the only step that can not be scaled, but luckily it only has to be performed once.

Task: “prep”

Getting the article from the scanned TIFF image

At the beginning of this project, the idea is to desnoise the article PDF image and run the OCR again. But we noticed there are some character details lost in the article PDF when compared to the scanned page TIFF. The image below shows that the characters lost details in the PDF image, while the text in the TIFF image is still clear to read.

Figure 2. The PDF image from OCR vendor (left) and image from scanned TIFF (right).

In this case, we have to use the article PDF as a clue and find the corresponding article on the page TIFF. We use the OpenCV ORB algorithm to perform feature extraction, and then use FLANN to find the best matches between the two images. After getting the matches, we are able to find the homography of the two images, and get the article from the scanned page TIFF.

Figure 3. The green lines show the matching features.

Detailed tutorial please refer to OpenCV official documentation.

Removing graphics

If we apply DAE on articles with graphics, the graphics tend to be “washed”. It blurs or erases part of the graphics, and Tesseract may try to perform OCR on them. This will produce a lot of meaningless text. Therefore, we use the same technique as above. Take the graphics JPG image as a clue to find their location in the article and remove them.

Figure 4. The green lines show the matching features of both images.

At this stage, we can get the article from the scanned TIFF image and remove all graphical elements.

Figure 5. Before (left) and after (right) graphic removal.

Removing borders

According to Tesseract best practices, we need to remove the borders to improve accuracy. So the next step in the pipeline is to find all vertical and horizontal lines in the article image.

A common technique is to use erode and dilate to find all horizontal lines (100x1) and vertical lines (1x100), and subtract the lines with the original image in order to remove them.

def del_lines(img: np.ndarray, horizontal: int, vertical: int, enlarge: int = 0):
    # Invert the image.
    img = cv2.bitwise_not(img)    # Get all horizontal lines.
    horizontal_structure = cv2.getStructuringElement(cv2.MORPH_RECT, (horizontal, 1))
    img_horizontal = cv2.erode(img, horizontal_structure)
    img_horizontal = cv2.dilate(img_horizontal, horizontal_structure)    # Get all vertical lines.
    vertical_structure = cv2.getStructuringElement(cv2.MORPH_RECT, (1, vertical))
    img_vertical = cv2.erode(img, vertical_structure)
    img_vertical = cv2.dilate(img_vertical, vertical_structure)    # Combine the two images.
    img_combined = cv2.add(img_horizontal, img_vertical)    # Make the lines thicker to cover more pixels.
    if enlarge:
        rectStructure = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (enlarge, enlarge))
        img_combined = cv2.dilate(img_combined, rectStructure)    # A simple subtraction to remove the lines from the article image.
    result = cv2.subtract(img, img_combined)
    result = cv2.bitwise_not(result)
    return result

Figure 6. The detected lines (top). Before (bottom left) and After (bottom right) border removal.

This is the final step of using traditional image processing. In the later stage, we will use DAE to denoise the images and perform OCR on them.

ps: You may notice some processed images differ in the figures. It is because we have adjusted the settings in the pipeline during the course of writing this article. The whole workflow is still the same.

Task: “ocr”

Use DAE to denoise the article image

Once we finish the steps in the “prep” task, we should be able to get a satisfying “pre-cleaned” article image. It is then predicted by the DAE model to produce an image without noise. For more details on the DAE model, please feel free to visit the previous blog post.

Before Tesseract starts the OCR process, it will use Otsu algorithm to binarize the image. In our case, we found that simple thresholding is good enough to create a clean and clear image.

Figure 7. Before (left) and after (right) noise removal by DAE + simple threshold.

Perform OCR by Tesseract

At this stage, all the image processing steps are finished. The denoised image will be processed by Tesseract. The OCR outputs are a text file and a “text-only” PDF. The text file will be parsed and stored in API. The transparent “text-only” PDF will be merged with a PDF containing the original article image, to produce a searchable PDF.

Google Colab

The DAE model was first trained on an “on-demand” compute instance in AliCloud. We used a GPU-type instance and had to manually install graphic card drivers before TensorFlow can utilize the GPU. When new datasets were available, we had to manually copy the files to the instance. From time to time, we had to wait for an available instance to start the machine for training. Because of these, we switched to use Google Colab.

Google Colab provides a Jupyter like online service that has integration with Google Drive. We can simply mount our datasets from Google Drive onto the Google Colab instance. And it provides free GPU instances (Nvidia K80s, T4s, P4s and P100s) to run the training. That makes a perfect place for us to experiment with different models and datasets efficiently.

Our first model was trained from datasets prepared manually. Thanks to my colleague who help to remove the noise by hand. The amount of samples is just enough for the PoC: “To proof that DAE is the right approach to help us clean up the images”. But we need a large number of samples to train a model for production usage, in order to handle the noise patterns over the 100 years news archive.

We decided to programmatically augment the number of datasets. First, we pick article images that are clean enough to produce high OCR accuracy. Then we randomly select some noise samples from the news archive. During the training process, we use ImageDataGenerator and use a custom preprocessing_function to randomly mix these noise samples into the image for training. This way, our model will be general enough to handle different noise types.

Orchestration

Since all Newsroom related services are hosted on Kubernetes, it becomes a natural choice to put API and pipeline-related services on Kubernetes.

Once we finished the upload step, we use a simple python script to go through lists of record IDs from the API, and these IDs are then passed to Celery using RabbitMQ as a message broker. The service prep and ocr subscribes to the RabbitMQ channel, pick up the record ID and go through the above mentioned processes.

Horizontal Pod Autoscaler (HPA) is set on the “prep” and “ocr” services, with a maximum of 200 pods running at the same time. Because we are using AliCloud Elastic Container Instance (ECI), we are able to create pods outside our cluster nodes. AliCloud will manage to provision compute instances to run those pods, and the cost is billed per second. That gives us a very cost-efficient way to run a large amount of short-lived jobs.