An Alternative PDF Data Extraction Method in MuleSoft RPA Using MuleSoft Anypoint and Python

Published in

Another Integration Blog

9 min readMay 22, 2024

In today’s digital era, businesses manage vast amounts of data daily, making the accurate and efficient extraction of information from digital documents like PDFs essential. To address this need, organizations seek robust solutions to streamline their processes. MuleSoft RPA, a leader in automation, offers unparalleled capabilities in data management.

MuleSoft RPA often utilizes AWS’s integrated OCR service, which employs machine learning (ML) to extract data from PDFs. However, this service comes with a price tag and requires complex configuration.

In this article, we present a simple and cost-effective alternative specifically for digital PDFs (which contain selectable and copyable text). This method leverages the combined power of MuleSoft RPA, MuleSoft Anypoint, and Python. We will explore how this approach can efficiently extract data, providing a viable alternative to relying on OCR services.

The Scenario

The process I’m going to implement starts with extracting text from a PDF file. Then, I’ll invoke a MuleSoft API, which will utilize a Python script to extract the necessary data and send it back as a response. Finally, the response will be organized and stored in variables within RPA.

The Business Process Model and Notation (BPMN) process

This is the PDF which we will use to extract the necessary data from:

The Process:

Step 1: Extracting Text from PDF

The initial step entails extracting text from a PDF document. By employing the “Read PDF” tool, we convert the PDF contents into plain text. This extracted text is then stored in a global variable for subsequent utilization as input for our API.

The configuration of the “Read PDF” tool is simple and requires specifying the directory path and the PDF file name. Additionally, we choose the option to read the entire PDF.

Next, we use the “Set Variable” feature to store the extracted text in the global variable “input.” This allows us to utilize the text across all workflows seamlessly.

Configuration for “Set Variable” for storing the extracted text

Here is the text that was extracted from the PDF and stored in the input variable:

Step 2: Invoking API to Extract Data

→ Implementation of Workflow:

In this step, we will use the “REST call” component to invoke an API that we will create in Anypoint Studio. Upon receiving the response from the API, we will store the output in a variable for further use in the process.

To invoke the API, we need to configure the “REST call” component. Here are the key configuration steps:

Method Type: We select the POST method to send data in the body of the request.
Base URL: This is the base URL of the API, which we will configure when creating the API in Anypoint Studio.
URL Extension: This part of the URL will also be configured during the API creation to handle the request appropriately.
Request Body: For the request body, we choose to use URL encoded format since the JSON option does not support multiline text.

In the request body, we will write {A}, where A represents our global variable “input” which will contain the text to be extracted from the PDF.

Next, we utilize the “Set Variable” feature to store the extracted data in the global variable “output”.

Configuration for “Set Variable” to store data

→ Implementation of the API in Anypoint Studio:

1. Configuration of API Setup

After creating a project in Anypoint Studio, we need to configure the pom.xml file to support the implementation of Python code within our API.

I included the following configuration to be added into the <plugins> section of the pom.xml file to include Jython for executing Python code inside our API:

<plugin>
    <groupId>org.mule.tools.maven</groupId>
    <artifactId>mule-maven-plugin</artifactId>
    <version>${mule.maven.plugin.version}</version>
    <extensions>true</extensions>
    <configuration>
        <sharedLibraries>
            <sharedLibrary>
                <groupId>org.python</groupId>
                <artifactId>jython-standalone</artifactId>
            </sharedLibrary>
        </sharedLibraries>
    </configuration>
</plugin>

Additionally, I added the following two dependencies to the <dependencies> section of the pom.xml file:

the Jython standalone dependency:

<dependency>
    <groupId>org.python</groupId>
    <artifactId>jython-standalone</artifactId>
    <version>2.7.3</version>
</dependency>

the Mule Scripting Module connector:

<dependency>
    <groupId>org.mule.modules</groupId>
    <artifactId>mule-scripting-module</artifactId>
    <version>2.1.0</version>
    <classifier>mule-plugin</classifier>
</dependency>

Alternatively, if the connector version in the example above is no longer available, you can add the appropriate version of the Scripting Module dependency by searching for it in the Exchange.

In our Python script, we’ll utilize the ‘regex’ module, which isn’t a built-in module in Python, to extract the desired data. To incorporate this optional Python module into our Mule project, we create a package named ‘lib’ within the ‘src/main/resources’ folder to house our module, and then install it locally by invoking the following command from a shell: pip install regex.

Once installed, the module can be found in the directory C:\Python3.3\Lib\site-packages. We then copy it from this location and insert it into our Mule project, specifically within the ‘lib’ package beneath the ‘src/main/resources’ folder.

The regex module installed within the lib package under the src/main/resources folder

In addition, it is necessary to add the following configuration value to the ‘VM arguments’ field of the ‘Arguments’ tab for the project’s ‘Run Configuration’: -M-Dpython.path= path_of_lib_folder_inside_Anypoint_Studio. This is a crucial step in ensuring that the project can access the ‘lib’ package folder where our ‘regex’ module resides.

Updating the Arguments for the Run Configuration for the project

2. The implementation of API Logic

Our Mule flow initiates with an HTTP POST listener capturing URL-encoded input data. Sequentially, the Transform Message operators decode the data, paving the way for a Python script to extract pertinent information. Finally, the extracted data is transformed into JSON format before being sent as a response to MuleSoft RPA.

HTTP Listener Configuration:

The listener is set up to handle POST requests. It will receive data encoded in URL format from MuleSoft RPA.

First Transform Message:

The first transformation extracts the keys from the payload and creates an array, outputting it as JSON. The reason for this step is that the data received from the RPA is passed as a key in the URL-encoded format, not as a value. Therefore, extracting the keys allows us to access the actual data content.

%dw 2.0
output application/json
var keysArray = keysOf(payload)
---
{
    keys: keysArray
}

The following is what the data looks like before transformation:

and post-transformation:

Second Transform Message:

The second transformation replaces \r\n character sequences within the multiline text from the URL-encoded payload with\n, and then outputs the result as text.

%dw 2.0
output text

fun decodeMultiLineText(encodedText: String): String =
    encodedText replace "\r\n" with "\n"
---
decodeMultiLineText(payload.keys[0])

Data after transformation:

Execute Component:

In this step, we configure the “Execute” component to run our Python script using Jython.

— General Configuration:

Engine: Set to “jython” to specify that the script is written in Python and will be executed using the Jython engine.
Code: Specifies the path to the Python script. In this case, ${file::script/extraction.py} points to the script located in the ‘script’ package folder under the ‘src/main/resources’ directory within the project.
Parameters: Here, {text:payload} indicates that the payload from the previous step is passed to the script as a parameter named text.

General settings for the Execute component

The Python script extraction.py is placed in the ‘script’ package folder under the ‘src/main/resources’ directory.

Location of the Python script within the project

This script uses regular expressions to extract specific data from the invoice text, such as items, invoice number, invoice date, and total amount.

The data must be stored in a variable named result to be available in the payload afterward.

import re

# Regex patterns

items_re = "(\w+(?: \w+)*) (\d+) \$ (\d{2}) (\d+) \$"
invoice_number_re = 'Invoice Number:\s*(\d+)'
invoice_date_re='Date: (\d+ \w+, \d{4})'
total_re='TOTAL : ([\d\s]+)'

##################items#############################

# Find all matches in the invoice text
matche_items = re.findall(items_re,text)

# Process and print the extracted items
items = []
for match in matche_items:
    item = {
        'name': match[0],
        'price': match[1],
        'quantity': match[2],
        'total': match[3]
    }
    items.append(item)

##################invoice_number#############################
matche_invoice_number = re.findall(invoice_number_re,text)
invoice_number = matche_invoice_number[0]

##################invoice_date#############################
matche_invoice_date = re.findall(invoice_date_re,text)
date_livraison = matche_invoice_date[0]


##################total#############################
matche_total = re.findall(total_re,text)
total = matche_total[0]


result = {'items': items, 'invoice_number': invoice_number,'date_livraison':date_livraison,'total':total}

Third Transform Message:

The third transformation converts the received payload into JSON format.

%dw 2.0
output application/json
---
payload

3. The execution output:

Step 3: Storing data into variables

Step 3 of the Process: store data to variables

In this step, we’ll utilize the output of our API to fetch each desired piece of data and store it into global variables for later use.

(Example) Fetching Items Data with JSON Query Component:

We’ll extract the desired data from the API response, using the JSON Query component. This tool allows for targeted data extraction by enabling users to define exact JSON queries, ensuring only relevant information is retrieved.

Configuration for “JSON Query” to extract data from JSON

The fetched value will be stored in another global variable for further use.

Configuration for “Set Variable” to store data in a global variable

Limitations of the Methodology

While the proposed methodology offers an innovative approach to data extraction, it is important to consider its limitations to ensure its applicability in various scenarios.

Suitability for Digital PDFs Only: This method is exclusively applicable to digital document PDFs. Scanned image PDFs will necessitate alternative approaches such as OCR or emerging solutions like MuleSoft IDP. Consequently, workflows involving non-digital documents may find this method unsuitable.
Dependency on Consistent PDF Structures: While effective for PDFs with consistent structures across versions, this method faces challenges when dealing with significant structural variations. While the method uses regular expressions to extract desired data, it relies on specific parts of the PDF for extraction. Thus, for PDFs with frequently changing structures, this method may not be as effective.

Conclusion

In summary, the methodology presented provides a comprehensive approach to data extraction from digital document PDFs using MuleSoft RPA, MuleSoft Anypoint, and Python. By integrating these technologies, organizations can streamline their data extraction processes and enhance overall efficiency.

Bibliography

“RPA Overview” MuleSoft RPA Documentation. Available: [https://docs.mulesoft.com/rpa-home/].
“Exploring MuleSoft RPA with Vikas!” YouTube video. Available: [https://www.youtube.com/watch?v=ViK1dN1_N3M].
“MuleSoft — Using Python” YouTube video. Available: [https://www.youtube.com/watch?v=78ijkOBzeHw]

An Alternative PDF Data Extraction Method in MuleSoft RPA Using MuleSoft Anypoint and Python

Written by Omar ee