Comprehensive Analysis of OCR Solutions for High-Volume French Documents Processing: Performance, Accuracy, and Cost Evaluation

Published in

Inside Doctrine

10 min readApr 24, 2024

Imagine a world where navigating through volumes of detailed corporate documents is as easy as a simple keyword search, or combing through a Wikipedia page. Every legal alteration, significant business event, or corporate structural amendments neatly catalogued and made readily accessible with a few clicks. This isn’t an anticipated feature of a distant futuristic movie but a contemporary reality that’s reshaping how businesses manage and retrieve critical corporate documents.

Company deeds

Company deeds, or actes d’entreprises as known in French, serve as the backbone of legal and corporate governance, documenting crucial activities and changes within a company. These documents range from company bylaws/statutes and modifications, general meetings, contracts, to mandatory legal formalities like registrations and filings of trademarks and patents. Traditionally, handling these vital records has been predominantly manual, labor-intensive, and paper-based — a method fraught with inefficiencies, prone to errors, and inconveniently slow.

But what if there was a way to transform this cumbersome process into an efficient, streamlined, and error-minimized operation?

Doctrine is at the forefront of this revolution. By digitizing the content of company deeds and making them available online, Doctrine is making strides in simplifying the huge task of managing and navigating through these documents.

Challenges and uncertainties

Despite the advantages, the path to digital transformation is not without its challenges. Company deeds are lengthy and complex documents — some extending to dozens of pages and containing multiple sub-documents. These documents are frequently stored as image-based PDF files, making the extraction of data particularly challenging. When you consider the sheer volume — over 33 million public company deeds — this task can seem cumbersome.

Moreover, specific data points, like financial balance sheets or capital distribution, are often presented in tables, a format that traditional data extraction software struggles to handle due to its complexity.

Critical challenges include selecting appropriate tools for content extraction, balancing computation time against cost, extracting complex table formats, etc…

These demands necessitate rigorous benchmarking of tools and deep evaluations focusing on performance, accuracy, and adaptability.

Listing all OCR solutions

Content Extraction from PDF Documents

For extracting content from PDFs, various publicly available Optical Character Recognition (OCR) modules were reviewed. OCR modules are used to extract text from image documents. This includes:

Public libraries such as Tesseract, EasyOCR, and PaddleOCR, which are prominent for their text extraction capabilities.
Commercial APIs from major players like AWS Textract and Google Vision were also tested to gauge their effectiveness in complex OCR tasks.
Scientific literature provided additional insights into evolving OCR techniques and their applications.

Table Extraction from PDFs

Extracting tables from PDF documents poses unique challenges due to the variety of formats and structures encountered:

Tools like PaddleOCR and TableTransformer were examined for their ability to efficiently navigate and extract tabular data.
Commercial solutions such as Google Document AI and AWS Textract Tables were scrutinized to understand their proprietary capabilities.
Innovations from scientific research were also explored, offering new approaches to table extraction.

Kindly acknowledge that this analysis was conducted in July 2023. Since then, there may have been developments in performance metrics, costs, and inference times, as well as the release of new tools.

OCR performance

Defining OCR Evaluation Metrics

In OCR technology, accuracy is pivotal. Errors during text extraction can broadly be categorized into three types: substitution errors, deletion errors, and insertion errors. A substitution error occurs when a character from the document is incorrectly recognized as another character. A deletion error is noted when a character present in the document is not recognized at all. Lastly, an insertion error takes place when a character not present in the original document is erroneously added to the extracted text.

Given a target text (the accurate transcription of the document) and the OCR-extracted text, we can assess the OCR performance by comparing these two texts. The evaluation involves counting the number of substitutions (S), deletions (D), and insertions (I).

Character Error Rate (CER)

One crucial metric we employ is the Character Error Rate (CER), defined as the sum of these three errors (S, D, and I) divided by the total number of characters (N) in the target text. This metric provides a granular view of the errors at the character level between two textual documents.

Formula to compute Character Error Rate (S+D+I)/N

Word Error Rate (WER)

Similarly, the Word Error Rate (WER) measures errors at the word level. It involves the sum of word substitutions (Sw), word deletions (Dw), and word insertions (Iw), divided by the total number of words (N) in the target text. Both metrics aim to provide a detailed account of the quality of text transcription rendered by OCR tools.

Formula to compute WORD Error Rate (Sw+Dw+Iw)/Nw

Challenges with CER and WER

While CER and WER are widely used and offer valuable insights, they also have their limitations. For instance, they are sensitive to word segmentation issues. If two words are conjoined without a space, it severely impacts these metrics. Additionally, if the text is recognized correctly, but the order of lines or words is slightly altered, these metrics could unfairly penalize the OCR output despite all target text words being present.

Jaccard Similarity

To complement these traditional metrics and to address some of their limitations, we have incorporated the Jaccard Similarity index into our evaluation framework. This metric measures the proportion of common words between the target text and the OCR output, effectively assessing text similarity irrespective of word order.

Evaluating OCR Solutions

To assess the performance of various OCR modules, we established a dataset containing documents which were transcribed verbatim by humans. These provide the ‘gold standard’ target texts against which OCR outputs are evaluated.

Selected OCR Solutions for Evaluation:

1. AWS Textract by Amazon Web Services
2. Google Vision by Google Cloud
3. ABBYY Finereader by ABBYY
4. pytesseract, an open-source Python tool based on Tesseract, configured for French
5. PaddleOCR, an open-source solution
6. EasyOCR, another open-source solution

Transformer based solutions, such as Donut or TrOCR, were not selected due to very poor performance on french documents.

We executed these OCR solutions on our dataset and calculated the three chosen metrics: CER, WER, and Jaccard Similarity.

Preliminary results indicate that AWS Textract is notably robust across various metrics, demonstrating high performance. Google Vision also shows strong performance levels. Interestingly, the open-source PaddleOCR competes closely with these paid solutions, particularly in Character Error Rate, though it lags slightly in Word Error Rate and Jaccard Similarity.

Insights and Limitations

From a qualitative standpoint, it was observed that PaddleOCR sometimes struggles with French accents, often confusing grave accents with acute ones. This significantly impacts the Word Error Rate and Jaccard Similarity, as it affects entire words, whereas its impact on CER is minimal since it affects only a single character.

OCR inference time

In our evaluation of OCR solutions, we also considered the processing time required to extract content from a single page. This is crucial for large-scale usage, like processing 30 million documents, which, at an average processing time of 3 seconds per page, would take about 16,000 days.

Processing Time Evaluation

Among the OCR tools tested, PaddleOCR stood out due to its exceedingly fast inference times per page, greatly benefitted by its optimization for GPU usage. This makes it particularly suitable for rapid processing of substantial datasets.

Parallel Processing Capabilities

For paid services like AWS Textract, Google Vision, and ABBYY Finereader, we explored their ability to execute multiple requests concurrently. This parallel processing can significantly cut down total computation time. However, despite the ability to make several API calls simultaneously, there are built-in limits to these services that may restrict full utilization of parallel processing, such as API rate limits and cost considerations which could affect scalability and efficiency.

Table extraction performance

Following the strategy we adopted for text extraction, evaluating different table extraction solutions was the next step. The selection of evaluation metrics again required a deep dive into scientific literature to determine how table extraction should be assessed. Unlike text extraction, reliable and robust metrics for evaluating the quality of table transcription are harder to define due to several complications:

1. Complex Table Structures: Tables can vary widely, featuring merged cells, missing rows or columns, and repeated headers. This diversity in table structures complicates the creation of universal metrics that can adequately handle such variability.
2. Recognition Errors: Besides structural inaccuracies, OCR recognition errors within tables (e.g., misreading text within a cell) further challenge the assessment process.
3. Existing Metrics: While certain metrics like the Tree Edit Distance-based Similarity (TEDS) have been proposed, they often fall short in practical applications, especially when dealing with merged cells or other structural anomalies.

Given these challenges, we opted for a qualitative evaluation approach for table extraction, scoring each tool on a scale from 0 to 100%. We penalized any structural errors, such as cell merging, misalignments in rows/columns, as well as OCR mistakes within the table content.

To conduct this evaluation, we compiled a dataset comprising around twenty tables, deliberately including a diverse mix of table styles — from simple grids to complex tables with merged cells and unclear separators between columns and lines. We tested both free and paid table extraction solutions:

Selected Table Extraction Solutions:
1. AWS Textract by Amazon Web Services, which includes a specific module for table extraction.
2. Google Document AI by Google Cloud.
3. ABBYY Finereader by ABBYY, also featuring a table extraction module.
4. PaddleOCR, an open-source solution that includes a table extraction module named PPStructure.

TableTransformer was also considered, but the performance was not as good as other solutions on our dataset.

Performance Outcomes

Our findings indicate that AWS Textract appears to be the undisputed leader in table extraction performance. Its only significant shortcoming was its handling of a particular table oriented horizontally, which might require pre-processing to adjust the image orientation. Despite this, its overall capability remained unmatched.

On the other hand, the free, open-source module from PaddleOCR couldn’t compete with the performance levels of top-tier paid solutions. Nonetheless, it deserves recognition for its accuracy in detecting tables within documents, which is corroborated by scientific publications related to the PPStructure module and public evaluations.

This qualitative approach, while more subjective than quantitative metrics, allowed for a comprehensive assessment of each tool’s ability to handle real-world table complexities. This evaluation methodology enabled us to better understand which solutions were suitable for various scenarios, providing valuable insights for businesses and researchers dealing with data extraction from structured documents.

Costs assessment

In selecting the ideal OCR and table extraction solution for our needs, evaluating the cost implications of each potential option was essential. For paid solutions, the costs are typically calculated based on the price per API call, which includes fees for both OCR processing and table extraction functionalities. In contrast, for free solutions like PaddleOCR, while there is no direct cost per use, there remains the overhead of hosting the required server infrastructure to run the solution, which includes maintenance and operational expenses.

Cost Calculation Overview

We undertook a comprehensive assessment of the total cost necessary for processing all the documents in our dataset, factoring in the respective pricing models of the selected solutions.

For paid solutions, we take into account the cost of processing each page. Extracting tables adds in a additional cost per page. For table estimation, we only count page that actually contain tables (which is around 11% of pages).

For free solutions, we take into account the number of pages, the time required to process each page, and the cost per hour of a GPU instance on AWS (ml.g4dn.xlarge in our case).

Cost Analysis Results

Upon analyzing the costs, it became evident that the expenses associated with extracting tables are generally higher than those for mere text OCR. This is likely due to the additional complexities involved in accurately detecting and structuring tabular data which often requires more advanced processing capabilities.

Interestingly, the free PaddleOCR solution appeared to be exceptionally cost-effective when compared to paid alternatives. While the initial setup and ongoing server costs need to be considered, it offers a compelling value, especially for large volumes of documents where API call costs of paid services can accumulate significantly.

Economic Considerations

The economic feasibility of PaddleOCR makes it an attractive choice for organizations with tighter budget constraints or those who prefer more control over their processing infrastructure. However, for businesses requiring the highest accuracy and sophistication in table extraction, investing in more costly, robust solutions like AWS Textract might be justified by the superior performance and less hands-on maintenance they provide.

Conclusion

Ultimately, the decision on which OCR and table extraction service to employ should balance both cost and performance needs. This financial analysis adds a vital dimension to our evaluation process, ensuring that our chosen solution not only meets our technical and accuracy requirements but also aligns with our budgetary constraints and long-term operational strategies.

For organizations prioritizing accuracy and requiring high-level table extraction capabilities under tight operational timelines, investing in AWS Textract or similar solutions seems justified. For entities with more modest budgets, or where full control over data processing is necessary, PaddleOCR presents a viable alternative, provided there is capacity to manage the infrastructure. Due to the volume and cost constraints, for company deeds, we settled with PaddleOCR, as it has the best tradeoff between cost and performance.

The adoption of OCR technologies and table extraction tools needs to align with specific organizational needs, balancing cost, accuracy, performance, and operational capacity. As technology evolves, so too will the capabilities of these tools, promising future enhancements that may address current limitations.

Our comprehensive analysis establishes a foundation for informed decision-making, ensuring that the selected OCR and table extraction tools not only fulfill technical requirements but also align with financial strategies and long-term business goals.