Mastering Optical Character Recognition (OCR) with Tesseract on Your Local System

Shreyash
2 min readOct 22, 2023

--

Introduction
In this tutorial, we’ll dive into the world of Optical Character Recognition (OCR) with Tesseract, a powerful and open-source OCR engine. You’ll learn how to set up Tesseract on your local system, extract text from scanned PDFs, and harness the full potential of this versatile tool. Let’s get started!

Step 1: Installing Tesseract

Before you can use Tesseract, you need to install it on your local system. The installation process varies depending on your operating system. Here’s how you can do it on a few common systems:

For Windows:
- Download the Tesseract installer from the [Tesseract GitHub Releases page](https://github.com/tesseract-ocr/tesseract/releases).
- Run the installer and follow the installation instructions.

For macOS (Homebrew):
- Open your terminal and run the following command to install Tesseract using Homebrew:

```bash
brew install tesseract
```

For Linux (Ubuntu/Debian):
- Open your terminal and run the following command to install Tesseract:

```bash
sudo apt-get install tesseract-ocr
```

Step 2: Installing Required Python Packages

Now that you have Tesseract installed, let’s set up the Python environment for OCR. You’ll need the following Python packages:

```bash
pip install pdf2image pytesseract pillow
```

Step 3: Writing the OCR Code

You can now use Tesseract to extract text from scanned PDFs. Here’s an example Python script to get you started:

```python
from pdf2image import convert_from_path
import pytesseract
from PIL import Image

def extract_text_from_scanned_pdf(pdf_path):
# Convert PDF pages to images using pdf2image
pages = convert_from_path(pdf_path)

# Initialize Tesseract OCR
pytesseract.pytesseract.tesseract_cmd = ‘/path/to/your/tesseract/executable’ # Update this path

# Initialize an empty string to store the extracted text
extracted_text = “”

# Process each page image
for page_num, page_image in enumerate(pages):
# Perform OCR on the image
page_text = pytesseract.image_to_string(page_image, lang=’ara+eng’) # You can specify the language

# Append the extracted text to the result
extracted_text += page_text

return extracted_text

if __name__ == “__main__”:
pdf_path = ‘your_scanned_document.pdf’ # Replace with the path to your scanned image PDF
extracted_text = extract_text_from_scanned_pdf(pdf_path)

# Print or save the extracted text
print(extracted_text)
```

Step 4: Running the Code

To use the script, replace `’your_scanned_document.pdf’` with the path to your own scanned PDF document. Run the script, and it will extract the text and print it to the console. You can also save the extracted text to a file for further use.

Conclusion:

With Tesseract and this tutorial, you now have the knowledge to unlock the hidden text in scanned PDFs and images. Whether it’s for data extraction, content analysis, or information retrieval, OCR with Tesseract is a valuable skill that can be applied to various domains.

--

--