Open Sourcing Our PDF Parser

Our engineering team at Funding Societies has open sourced a library in hope of helping developers, especially those working in finance sector, to parse PDF files more reliably and be able to use these as a source of information for subsequent efforts.

At Funding Societies, as a peer-to-peer digital lending company, a significant part of our efforts is spent on automating the assessment of credit risk of borrowers. PDF files represent a pain point in this process. They are often the only source of information for financial reports and financial statements, which need to be read manually. We developed a tool to extract data from PDF files in an accurate, reliable manner, using tabula-java. As the usage of this library was complicated initially, we decided to write a wrapper on top of tabula-java, calling it tabula-plus. We have open sourced this wrapper for the benefit of other developers as of today.

How tabula-plus works

Similar to tabula-java, tabula-plus also focuses on getting data of tables in PDF files. To be able to extract data of a table in a PDF file with tabula-plus, it needs to be provided with necessary information so that it can determine where the table starts and where the table ends. The top identifiers and the bottom identifiers need to be defined whereas the left identifiers and the right identifiers are optional. A top identifier can be thought of as a word or phrase that marks the start of the table in the PDF file. Similarly, the bottom identifier can be thought of as a word or phrase that marks the end of the table. After identifiers has been defined, tabula-plus can start the extraction. The result returned will be a two dimensional array that contains data from the table.

This approach make it easier for developers to indicate which table they are interested in and where it is. With tabula-java, developers need to define the area of the interested table, which is not going to be apparent.

How to use tabula-plus library

To demonstrate how to use tabula-plus, we will go through an example of extracting data of a table named Table 7 from sample-tables.pdf.

The sample code below extracts data for the aforementioned table.

PdfSection section = new PdfSection("Table 7"); String[] topIdentifiers = {"Table 7:"}; section.setTopIdentifiers(topIdentifiers); section.setTopIncluded(false); String[] bottomIdentifiers = {"Table 8:"}; section.setBottomIdentifiers(bottomIdentifiers_1); section.setBottomIncluded(false); PdfSection[] sections = {section}; PdfParser pdfParser = new PdfParser(sections); Map<String, NormalizedTable> tableMap = pdfParser.parse("sample-tables.pdf");

In the code snippet above, the top identifiers and the bottom identifiers are defined as lists, which means that there might be multiple top identifiers and bottom identifiers. This makes sense when data from multiple PDF files need to be extracted for the same type of table, but there are differences in the top identifiers and the bottom identifiers for that table in different PDF files. In that case, the table is detected as soon as any top identifier is encountered and the table is ended as soon as any bottom identifier is encountered.

tabula-plus also lets developers to indicate if identifiers are parts of the table. For example, to indicate that the top identifier is not a part of the table, developers can do section.setTopIncluded(false);.

The extracted data for table 7 is as follow:

Besides defining identifiers programmatically, tabula-plus also lets developers to defined them with a schema file. Below is a sample schema file:

Table_7: top: Table 7 | false bottom: Table 8 | false Table_10: top: layout problems) | self-contained year-end | false bottom: Table 11 | false

For Table_10, it has two top identifiers layout problems) and self-contained year-end. false indicates that these top identifiers are not parts of the table’s data.

To let tabula-plus knows that it should collect identifiers from a schema file, do as following:

PdfParser pdfParser = new PdfParser("example_2.schema"); Map<String, NormalizedTable> tableMap = pdfParser.parse("sample-tables.pdf");

The result returned is the mapping between tables’ names and tables’ data.

How to run the sample code

This library requires a Java Runtime Environment compatible with Java 7 (i.e. Java 7, 8 or higher).

  • Mac OS X, Linux
  • Go to the project directory, and build the project by running the command ./gradlew build
  • Run the sample code with the command ./gradlew run
  • Windows
  • Go to the project directory, and build the project by running the command ./gradlew.bat build
  • Run the sample code with the command ./gradlew.bat run

What the sample code does is that it defines identifiers for two tables named Table 7 and Table 10, extracts data for these two tables and then prints out the result.

End note

By open sourcing this library we hope that developers can consider PDF files as a source of information and be able to reliably get data out of them.

About Funding Societies

Funding Societies | Modalku is the largest digital peer-to-peer (P2P) lending platform in Southeast Asia. It enables small and medium-sized enterprises (SMEs) by providing business financing which is crowdfunded by retail and institutional investors. It is currently licensed and operating in Singapore, Indonesia and Malaysia, and is the most well-funded P2P lending platform in Southeast Asia, backed by SoftBank Ventures Korea and Sequoia India amongst other investors. It is dedicated to the vision of funding underserved SMEs and improving societies in Southeast Asia.

Visit us at www.fundingsocieties.com | Follow us on Facebook | LinkedIn | Check out our blog & forum!

--

--