Tesseract OCR implementation in .NET Core & Spring Boot

Fatih Yıldızlı
turkcell
Published in
3 min readOct 4, 2019

My Purpose :

This article was written for How to implement Tesseract OCR with .net core and with spring boot. Also, both of these projects was coded for proofing of concept without any high level architecture or any software pattern. Project can quickly explain main implementation of Tesseract OCR. Because of it , I preferred two enterprise software languages which are .net core and JAVA. I was coded both of these in Rest API format.This introduction is enough. Let’s begin ↩

What is Tesseract OCR ( Optical Character Recognition ) ?

Tesseract OCR is open source. Since 2006 it is developed by Google.🤙

Basically, this technology recognises text inside images, such as scanned photos,documents, screenshots and pdf. OCR technology is used to convert virtually any kind of images containing scanned /written /taken text into machine-readable text data.

Basic schema for OCR

History

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP.

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages “out of the box”.

Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.

( Reference: https://github.com/tesseract-ocr/tesseract#brief-history )

🚀.NET CORE IMPLEMENTATION

🔗Dependencies

System.Reflection.Emit - Version=4.6.0

Tesseract -Version=3.3.0

📌Tesseract OCR implementation code block in .NET Core

Repository link: https://github.com/fatihyildizli/dotnetcore-tesseract-ocr

📌 Input Image:

📌 Result:

9.83 seconds elapsed

🍃 SPRING BOOT IMPLEMENTATION

🔗Dependencies

net.sourceforge.tess4j -Version = 3.4.0 (Pom.xml)

java -Version =1.8

📌 Tesseract OCR implementation code block in Spring boot

Repository link : https://github.com/fatihyildizli/springboot-tesseract-ocr

📌 Input Image:

📌 Result:

23.52 seconds elapsed

--

--