Complex Financial PDF Documents to Excel. A Case Study.

Published in

CodeX

8 min readMay 19, 2024

Goal

The goal of this article is to accurately extract data from both structured and unstructured financial PDFs into Excel using Python and Excel functionality to achieve this.

PDF Types

I classify financial PDFs into 2 categories — a structured PDF and a non-structured PDF.

A structured PDF — This is a structured PDF, meaning its layout makes the content clear. When someone looks at it, they can easily see how the elements are related because of how they’re positioned. Elements which refer to the same record are all grouped on a single row.

Example:

Image source: https://support.microsoft.com/en-us/office/overview-of-excel-tables-7ab0bb7d-3a9e-4b56-a3c9-6c94334e492c.

From this table, it is easy to tell that each row is referring to a single product and its price. We do not need to make any inferences with this table structure.

A non-structured PDF — This PDF lacks structure, meaning that someone who hasn’t worked with it before won’t be able to easily see how the different parts relate to each other.

Example:

Image source: https://www.excelstemplates.com/income-statement-template

The PDF is a bit confusing at first glance, especially when trying to figure out what the 900,000 refers to. The irregular spacing doesn’t help either, making it harder to see how everything fits together. Plus, it’s not clear why things are grouped as a, b, and c. We need someone with expertise to organize the PDF into a logical structure before we can put it into Excel.

PDF Handling

For structured PDFs, utilizing Excel’s in-built functionality is logical because Excel is designed to efficiently manage structured data, recognize patterns, and offer extensive capabilities for data manipulation. On the other hand, for unstructured PDFs, employing coordinate mapping is suitable as it allows for flexible data extraction from diverse formats, adaptation to varying layouts, and customized extraction criteria based on specific coordinates within the document.

Let’s dive in!

Unstructured PDF — Coordinate Mapping

Please note that sample data has been utilized consistently throughout. Bear this in mind while attempting to execute the code.

Here is the process overview for coordinate mapping:

Retrieve the document’s source from the financial software system that generated it.
Convert the document into an image format and utilize Optical Character Recognition (OCR) to extract both the graphical components and textual content.
Utilize outlines or structural analysis to identify and define distinct text groups within the Excel table.
Import the extracted dataset into Excel, arranging it according to the identified text groups.

Pro tip: if your PDF contains underlined text, check out this article https://medium.com/towards-data-science/get-underlined-text-from-any-pdf-b7b068ca3d64

PDF Software Underlayer

To ascertain the document’s origins and potential structure, our initial step is to access the software responsible for generating the PDF. This inquiry will discern whether the document was manually crafted or generated through software. Typically, financial PDFs are software-generated, simplifying the structuring process.

A code snippet to print out the software that was used to create the PDF file.

PDF Component Extraction

We then use the PDF query library to extract the fields and the text from the PDF file.

For an in-depth explanation of how PDF query works and the best way to use it, have a look at this article: https://medium.com/dev-genius/building-a-high-precision-financial-pdf-extraction-tool-part-1-c6db2073eeee

Code to extract the boxes around PDF components and text from a PDF file.

We meticulously craft rectangular components that encapsulate each element of the page, seamlessly integrating text within them. Through this process, we faithfully replicate the entire structure of the document.

Extracting Data Into a Dataframe

Each of our elements is defined by 4 coordinates — x0, y0, x1, y1 . If 2 elements align, their x or y coordinates would be equal, for example:

Image source: https://www.investopedia.com/terms/f/financial-statements.asp, image edited by the author.

In the image above, we can see a few types of aligment — left, right and bottom aligment. The elements which are positioned on the red lines would have the same x1, y1 and x2 coordinates respectively.

Use the code below to get the aligned components in a more complex scenario:

Element alignment finder. Please adjust this code according to your XML coordinates and desired values.

Code Breakdown

To begin, we extract essential data from the file, such as identification codes, positions, shareholders, and coding dates. These details can be located within the PDF XML file.

We’ve constructed two lists: ‘bene’ (short for beneficiaries) and ‘pos’ (short for positions). In this instance, these lists delineate shareholders and their respective shareholdings, but their usage can extend to various contexts (e.g., products and prices). When encountering a termination condition, denoted by the string ‘End’ in our case, we then check if there is any data present in the arrays.

Our objective is to merge the two datasets. To achieve this, we begin by rounding the coordinates. Subsequently, we proceed to update the Excel file with the new values.

The key part of this code which exports the data from the processed PDF and into the Excel is:

Though I am using a temporary Excel, you can use a non-temporary file as well. Here is what the output from your PDF would look like:

The data has been extracted and placed into the Excel columns

Structured PDF — Excel Import

Now we will consider a structured financial document, for which we will use the Excel native import.

1. Creating an Empty Excel File

To begin, we’ll utilize the xlwings library in Python for interacting with Excel. First, install the library using pip install xlwings. Then, we'll write a function to handle this initial step:

import xlwings as xw

def create_excel_from_pdf(pdf_file_path):
    # Start Excel in the background
    app = xw.App(visible=False)

    # Create a new workbook
    wb = app.books.add()

    # Implement your logic to extract data from the PDF and populate the Excel file
    # Example:
    ws = wb.sheets.active
    ws.range('A1').value = 'Example Data'

    # Save the workbook
    excel_file_path = "output.xlsx"
    wb.save(excel_file_path)

    # Close the workbook and quit Excel
    wb.close()
    app.quit()

    return excel_file_path

    def process_pdf_unstructured(self, pdf_file_path):
        # Start Excel in the background
        app = xw.App(visible=False)

        # Create a new workbook
        wb = app.books.add()

        macro_code_template = '''

Now, we will:

Start Excel in the background: Initiates an instance of Excel in the background, which is invisible to the user.
Create a new workbook: Generates a new Excel workbook within the Excel instance.
Prepare the macro code template: A template for the VBA macro code is defined. This macro is responsible for importing data from a PDF file into the Excel workbook.
Replace placeholder with the actual file path: Inserts the actual file path of the PDF into the macro code template.
Add VBA module and write macro code: A VBA module is added to the workbook, and the modified macro code is written into this module. This macro, named “getData,” includes instructions to import data from the PDF into the Excel workbook and manipulate it.
Run the macro if it exists: Checks if the macro named “getData” exists in the workbook. If it does, the macro is executed.

Here is what the code looks like:

    def process_pdf_unstructured(self, pdf_file_path):
        # Start Excel in the background
        app = xw.App(visible=False)

        # Create a new workbook
        wb = app.books.add()

        macro_code_template = '''SOME VBA MACRO'''
        # Fill in the placeholder with the actual file path
        macro_code = macro_code_template.replace('{}', pdf_file_path)

        print(macro_code)
        # Add a VBA module to the workbook and write the macro code to it
        wb_vba = wb.api.VBProject.VBComponents.Add(1)  # 1 means vbext_ct_StdModule, which is a standard VBA module
        wb_vba.CodeModule.AddFromString(macro_code)

        # Run the macro if it exists
        if wb.macro('getData') is not None:
            wb.macro('getData')()

        # Generate the file name for the resulting workbook
        base_name = os.path.basename(pdf_file_path)

        file_name = f"{base_name}output.xlsx"
        print(base_name)
        target_directory = r"SOME/PDF/PATH"
        excel_file_path = os.path.join(target_directory, file_name)

        # Save the workbook to the target directory
        wb.save(excel_file_path)
        print("Path to the resulting workbook:", excel_file_path)

        # Close the workbook
        wb.close()

        # Quit Excel
        app.quit()

        # Print the path to the resulting workbook
        print("Processed PDF file path is:", pdf_file_path)

        return excel_file_path

In VBA code, the fundamental components are:

Uploading the PDF file.
Employing the Excel vision algorithm to extract table data from the PDF file.

The VBA code:

Sub getData()
    Dim pdfFilePath As String
    pdfFilePath = "actual_pdf_file_path.pdf"  ' Placeholder for the actual PDF file path

    ' Import data from the PDF into the Excel workbook
    ActiveWorkbook.Queries.Add Name:="Table001 (Page 1)", Formula:= _
        "let" & Chr(13) & "" & Chr(10) & "    Source = Pdf.Tables(File.Contents(""" & pdfFilePath & """), [Implementation=""1.3""])," & Chr(13) & "" & Chr(10) & "    Table001 = Source{[Id=""Table001""]}[Data]," & Chr(13) & "" & Chr(10) & "    #""Changed Type"" = Table.TransformColumnTypes(Table001,{{""Column1"", type text}, {""Column2"", type text}, {""Column3"", type text}, {""Column4"", type text}, {""Column" & _
        "5"", type text}, {""Column6"", type text}, {""Column7"", type text}})" & Chr(13) & "" & Chr(10) & "in" & Chr(13) & "" & Chr(10) & "    #""Changed Type"""

    ' Create a new worksheet and import data into it
    ActiveWorkbook.Worksheets.Add
    With ActiveSheet.ListObjects.Add(SourceType:=0, Source:= _
        "OLEDB;Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=""Table001 (Page 1)"";Extended Properties=""""" _
        , Destination:=Range("$A$1")).QueryTable
        .CommandType = xlCmdSql
        .CommandText = Array("SELECT * FROM [Table001 (Page 1)]")
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .PreserveColumnInfo = True
        .ListObject.DisplayName = "Table001__Page_1"
        .Refresh BackgroundQuery:=False
    End With

    ' Display a message box
    MsgBox "Hello from Python!"
End Sub

Overall, the Python function to transfer a structured PDF into an Excel is:

def process_pdf_unstructured(self, pdf_file_path):
    # Start Excel in the background
    app = xw.App(visible=False)

    # Create a new workbook
    wb = app.books.add()

    macro_code_template = '''
    Sub getData()
        Dim pdfFilePath As String
        pdfFilePath = "{}"

        ActiveWorkbook.Queries.Add Name:="Table001 (Page 1)", Formula:= _
            "let" & Chr(13) & "" & Chr(10) & "    Source = Pdf.Tables(File.Contents(""" & pdfFilePath & """), [Implementation=""1.3""])," & Chr(13) & "" & Chr(10) & "    Table001 = Source{[Id=""Table001""]}[Data]," & Chr(13) & "" & Chr(10) & "    #""Changed Type"" = Table.TransformColumnTypes(Table001,{{""Column1"", type text}, {""Column2"", type text}, {""Column3"", type text}, {""Column4"", type text}, {""Column" & _
            "5"", type text}, {""Column6"", type text}, {""Column7"", type text}})" & Chr(13) & "" & Chr(10) & "in" & Chr(13) & "" & Chr(10) & "    #""Changed Type"""
        ActiveWorkbook.Worksheets.Add
        With ActiveSheet.ListObjects.Add(SourceType:=0, Source:= _
            "OLEDB;Provider=Microsoft.Mashup.OleDb.1;Data Source=$Workbook$;Location=""Table001 (Page 1)"";Extended Properties=""""" _
            , Destination:=Range("$A$1")).QueryTable
            .CommandType = xlCmdSql
            .CommandText = Array("SELECT * FROM [Table001 (Page 1)]")
            .RowNumbers = False
            .FillAdjacentFormulas = False
            .PreserveFormatting = True
            .RefreshOnFileOpen = False
            .BackgroundQuery = True
            .RefreshStyle = xlInsertDeleteCells
            .SavePassword = False
            .SaveData = True
            .AdjustColumnWidth = True
            .RefreshPeriod = 0
            .PreserveColumnInfo = True
            .ListObject.DisplayName = "Table001__Page_1"
            .Refresh BackgroundQuery:=False
        End With
        MsgBox "Hello from Python!"
    End Sub
    '''
    # Fill in the placeholder with the actual file path
    macro_code = macro_code_template.replace('{}', pdf_file_path)

    # Add a VBA module to the workbook and write the macro code to it
    wb_vba = wb.api.VBProject.VBComponents.Add(1)  # 1 means vbext_ct_StdModule, which is a standard VBA module
    wb_vba.CodeModule.AddFromString(macro_code)

    # Run the macro if it exists
    if wb.macro('getData') is not None:
        wb.macro('getData')()

    # Generate the file name for the resulting workbook
    base_name = os.path.basename(pdf_file_path)
    file_name = f"{base_name}output.xlsx"
    target_directory = r"YOUR\OUTPUT\DIRECTORY"
    excel_file_path = os.path.join(target_directory, file_name)

    # Save the workbook to the target directory
    wb.save(excel_file_path)

    # Close the workbook
    wb.close()

    # Quit Excel
    app.quit()

    return excel_file_path

Throughout this article, we streamlined the extraction of financial data from PDFs, whether they’re structured or unstructured. We categorized financial PDFs into two main types: structured and unstructured. In doing so, we elucidated the distinction between them, highlighting how structured PDFs have a clear and organized layout that facilitates easy comprehension of the interrelation between different elements. By focusing on this clarity, we were able to devise methods to efficiently extract data, ensuring that elements pertaining to the same record are neatly grouped together for seamless integration into Excel.

PS to see more of my articles, you can follow me on LinkedIn here: https://www.linkedin.com/in/sasha-korovkina-5b992019b/