We are currently at the brink of discovering how document digitization can help us unlock the value trapped inside documents. The technology startup turicode is specialized in data extraction from documents. In the subsequent four-part series of articles, we are going to examine the opportunities and challenges arising from document digitization from different perspectives:
Part I — Business Solution: “Documents to Value”
Part II — Machine Learning
Part III — User Experience
Part IV — Software Architecture
Part I — Business Solution: „Dear document, please provide me with all IBAN numbers!“
Imagine sitting in front of a big pile of printed invoices, accompanied by a few more bills you received per email. Your task is to enter the information contained in these documents into your payment software, which up until now entails a lot of cumbersome and error-prone manual typing. What if you could simply say: “Dear document, please provide me with all IBAN numbers — ASAP!” The value is evident: data retrieval is faster and more reliable.
Digitization allows us to approach this wishful thinking with big steps. The automatic process of grasping the meaning of texts, images and tables from unstructured documents opens up completely new dimensions:
- Firstly, tasks that previously required manual intervention can now be automated
- Secondly, new possibilities for analytics are created at an unprecedented level of detail thanks to a smart linking with the company’s own data model.
Phase 1: From paper to a digital scan
In the first step of digitization, paper documents are transformed into an electronic copy. Scanned documents can be spread faster and can be viewed independently of space and time — naturally in compliance with the general protection rights. However, for both business and science, scanned documents are of limited use because the data is only available as an image (e.g., in TIFF or JPG format). Although we now have collected the stack of paper in an electronic format, the search for account information as well as any further processing is still associated with considerable manual efforts.
Phase 2: From a scan to a ‚searchable‘ PDF
A first remedy in this case promises the transformation of the scan into a searchable PDF (Portable Document Format) by means of OCR (Optical Character Recognition). In this automatic process, letters and numbers are recognized as such, therefore enabling a keyword search within the document. In recent years, OCR technologies have evolved significantly, and today’s solutions cover a wide range of price and quality requirements.
Based on our example, the keyword search for “IBAN” now possibly yields 72 hits. However, the manual task of “copying and pasting” of individual account numbers remains. The truly relevant information is still stuck in the documents and cannot be retrieved automatically.
Phase 3: From a searchable PDF to an intelligent document
The deciding factor for automatic analysis of unstructured data is whether the machine is able to understand information in its context — similar to how humans do. Thanks to technological advances and new methods based on artificial intelligence, we now have the tools that enable an additional dimension of information retrieval. Bridging the gap from a searchable PDF to an intelligent document requires interaction between humans and machines on two levels.
- Machine empowers human: In a first step, the machine prepares the document so that the previously flat representation is hierarchically structured. For instance, the textual part of the document is broken down into individual letters and numbers (so called glyphs), enriched with information on position, color, font, etc., and then reassembled into words, sentences and entire text sections. Other content, such as images, spreadsheets or vector graphics, undergoes similar structuring processes. The document therefore transforms into a structured data tree that the machine now wants to understand.
- Human empowering machine: In a second step, the user gives meaning to the elements by telling the machine what information is relevant and how to retrieve it (e.g. IBAN numbers). This is done either with traditional rule-based approaches, such as keywords or the position in the document. Alternatively, the use of modern machine learning algorithms enables the machine to understand semantic content in a cognitive fashion. What’s left is to decide on the desired export format to integrate the structured output into existing systems, or to publish the information obtained via a web or mobile application.
Back to our example, we can now extract all IBAN numbers or any other information from a large document body within seconds and save them, for instance, together with the invoice number and the amount to be paid in an Excel spreadsheet.
New technologies and a better interaction between human and machine free up the inherent value of information in documents more efficiently and more effectively. In the future, connected thinking and a healthy dose of creativity will replace diligence and patience in data retrieval and data processing.
turicode’s solution unlocks the full potential of documents. In Part II, we are explaining in more detail how humans can train machines to grasp the meaning of content in documents.