Improving the latency of an OCR-based system
Are you stuck in a slow OCR-based system? Here are some tips and tricks.
Just recently, I had to deal with a system that was focused heavily on OCR. And when I checked the internet, there were almost no practical guides on improving the latency of an OCR-based system, just some common tips, and tricks that were repeated repeatedly.
In this article, I aim to provide you with all the insights, tips, and tricks that I have learned in my experience with the OCR-based system that had a lot of implementation restrictions( no GPU, can’t use cloud-based APIs, can’t use 3rd party software, etc) that can help you improve the latency of your system.
In this article, we are going to discuss the following points.
- Choosing the Right OCR Engine
- Using Multi-processing and Multi-threading
- Appropriate pre-processing
- Appropriate parameters/hyperparameters for your OCR engine
- Using better hardware
- Do you really need to OCR all frames?
- Quantized models
- Is OCR the worst part of your system?
You won’t be needing all of these techniques most probably if you do not have some restrictions. If you have a good GPU or cloud-based API, your OCR will already be running in almost real-time. But the combination of one or more of these techniques alongside your good hardware/API will most probably boost the overall performance of your system.
1. Choosing the Right OCR Engine
Choosing the right OCR Engine for your problem is important. Are you working with PDFs or scanned documents? Go for tesseract as tesseract works really well with these kinds of documents(with some appropriate pre-processing). Are you working with some hand-written/bill-boards/tables-related kind of data? You can try EasyOCR, PaddleOCR, or KerasOCR. If you have cloud infrastructure, go for some fast cloud-based OCR such as Google Cloud Vision OCR, Amazon Textract, or the DropBox OCR. Here are some of the comparisons between Tesseract, EasyOCR, and KerasOCR.
2. Using Multi-processing and Multi-threading
Multiprocessing and Multithreading are a very useful set of techniques that you can use to tackle CPU-bound tasks and I/O-bound tasks respectively. While many engines come up with built-in multiprocessing/multi-threading capabilities, you can always use your own multiprocessing/multi-threading to further improve the system capabilities.
In EasyOCR, you can simply pass the number of workers to your read-text parameter.
ocr.readtext(
img,
workers=4,
)
This will perform the OCR on the image in multiple processes.
Similarly, tesseract also provides multiple threads that can execute in parallel as well as separate multi-processing. You can set it via OMP_THREAD_LIMIT
environment variable. In python, it is easy to do using the OS module.
import os
os.environ[
"OMP_THREAD_LIMIT"
] = "8"
Similarly, you can use tesseract’s built-in tessedit_parallelize
. As far as my experience with tesseract is concerned, the built-in multi-threading/processing for tesseract is not worth using. You can check the last message in this thread, this answer, and this answer on Stackoverflow which shows that having external multi-processing will give a significant boost instead of using tesseract’s built-in module.
You can also use Joblib’s reusable workers if you can formulate your problem that way :). It would make your work a lot easier.
3. Appropriate pre-processing
Using appropriate pre-processing techniques can boost both your accuracy as well as your latency. Some of the common techniques that are widely used in pre-processing images for OCR-based systems are
- Gray-scaling image
Removing the RGB color channels, and having a black and white image reduces the latency of the OCR system and often results in improving the accuracy of the OCR.
gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
It is good to have the image pixel values at a single scale. This often fastens the OCR process.
mn = np.min(frame)
mx = np.max(frame)
norm = (frame - mn) * (1.0 / (mx - mn))
Alternatively, you can also divide each pixel by 255 to bring it down to the range 0 and 1.
frame = frame / 255.0
You can use Erode and Dilate from OpenCV to perform Salt and Pepper noise removal. You can refer to this article for more details and python implementation.
Binarization and Threshold holding removes the extra noise above the threshold and converts the image to a binary image. OpenCV documentation covers it really well.
- Image Resizing
Resizing while maintaining the aspect of the image(using fx, fy arguments in OpenCV resize, and appropriate interpolation method.)
resized = cv2.resize(gray_frame, fx=1.5, fy=1.5, cv2.INTERCUBIC)
- Here is a great article by tesseract that dives into different techniques that you can use according to your use case.
4. Appropriate parameters for your engine
Each OCR engine has tons of different parameters. Tesseract alone does provide 600+ parameters that you can explore. Here is a list of all the parameters that are present in the tesseract. You won't be needing most of them, but some of them are useful such as tessedit_do_invert
and provide a significant speed and accuracy boost or an appropriate Page Segmentation Method based on your input types and output requirements.
Similarly, EasyOCR provides the option to use different models based on your preference such as you can use english_g2
model if you don't have high compute power. You can also provide the GPU option and quantization parameter to boost up the process. Similarly, you can select the number of workers to run in parallel in-order to hasten up the process.
TLDR; Read the documentation for your OCR engine, and see how other people are using it for different use cases.
5. Using better hardware?
Do you have the option to use better hardware? Are you using a CPU and does have the option for hardware acceleration? Then it would be best if you were utilizing it. OCRs that are implemented using famous Deep Learning frameworks such as Keras, PyTorch, or TensorFlow i.e EasyOCR, KerasOCR, etc. can utilize the NVIDIA GPUs using CUDA. Sadly Tesseract lacks this option of using CUDA. Some people have been using OpenCL support to accelerate the Tesseract but setting that up is very difficult and the support is also experimental in the tesseract so it is not recommended until you have a good grip on OpenCL.
6. Do you really need OCR on all frames?
If you are working with videos and want a real-time OCR system, you need to analyze whether you actually need to perform OCR on all the frames or not. Most probably you won't need to perform an OCR call on every frame.
You can use several different techniques to reduce the OCR calls and increase the system latency. Some of them are
- Skip a few frames and then OCR. You can easily skip frames in OpenCV via
frame_counter = 0
ret, frame = cap.read()
if ret:
frame_counter += 1
if frame_counter % 10 == 0: # process every 10th frame
results = ocr(frame) # OCR the frame
- Perform the OCR call(localization + recognition) and then use some tracker for the few next bounding boxes such as
- CSRT Tracker
- Optical Flow
- Template Matching for bounding boxes etc.
In this way, you can have the localization as well as OCR. - Perform the OCR call based on any trigger. Let’s say you want to perform the OCR call when the difference between 2 frames is more than a threshold. The difference and the threshold can be subjective to your application, but looking at Structural Similarity Index or Optical Flow is a good starting point. Alternatively, you can look into how similar 2 frames are and if they are less similar than a certain threshold then you can perform the OCR. You can use different similarity techniques such as Cosine Similarity or a Siamese Network to see how similar 2 frames are.
7. Quantized Models
One of the very effective ways to increase the latency is to use Quantized models. You can learn more about Quantization at this link. Every big OCR engine does provide quantized models. For example, Tesseract does provide fast models which are quantized up to 8 bits integers. These are the fastest models while maintaining good accuracy.
Similarly, EasyOCR also provides the quantization option. You can simply add it to the init
method.
ocr = easyocr.Reader(
["en"],
gpu=True,
quantize=True,
)
8. Is OCR the worst part of your system?
Is your system really slow because of OCR? You can use a profiler tool such as cProfile that can help you see how much time of your code is being spent on OCR and how much time is being spent on other parts.
Having a good perspective on where is the maximum time of your code being spent can help you change your narrative and thus you can change the pipeline or architecture of the system to optimize other portions of the code hence improving the complete system.
Learning Outcomes
In this article, I have briefly touched upon some of the different ways that you can use to improve the latency of your OCR system. There are definitely more details to each of these points, and you can find excellent resources to deeply discover each of these mentioned points. The main goal of this article was to give you a different narrative to think about these problems.
Let me know your thoughts in the comment section.