Member-only story
Document Parsing with Python & OCR
Detect and extract text, figures, tables from any type of document with Computer Vision
Summary
In this article, using Python and Computer Vision, I will show how to parse documents, such as PDFs, and extract information.
Document Parsing involves examining the data in a document and extracting useful information. It is essential for companies as it reduces a lot of manual work. Just imagine having to go through 100 pages manually searching for a table, just to copy and paste it somewhere else… how cool would it be having a program that does it in 1 second?
A popular strategy for parsing is to convert the document into an image and employ Computer Vision. Document Image Analysis refers to techniques applied to images of documents to obtain information from pixel data. It can be tricky as in several cases there is no clear answer to what the expected result should look like (text, images, charts, numbers, tables, formulas, …). The most used technique is OCR.
OCR (Optical Character Recognition) is the process of detecting and extracting text in images through Computer Vision. It was invented during World War I, when Israeli scientist Emanuel Goldberg created a machine that could read characters and convert them into…