Alteryx Python Tools to Extract Text from Searchable PDF

Astry Citrasari
4 min readNov 10, 2023

There is always a first timer for everything and so do I. This year time, exactly on February, it’s the first time I touch with Alteryx.

Alteryx has built in Python Tools with Jupyter Notebook as its interface. I am so glad at this point because I have already familiar with Jupyter and a little bit basic python. People may familiar to python, Jupyter and Alteryx, yet, there’s still a gap. Only a few knowledge in the internet explaining how to bridge the Alteryx to python.

I’ve learnt along the way with my great teammates and trying to connect the dot. In this article, I will demonstrate how Python Tools in Alteryx behave and how it can be use in Alteryx main workflow.

Python Tools Icon

Alteryx Python tools have the same icon as usual python and there are 5 number as it depicts 5 possible output from Jupyter. Below is the standard code automatically put there by the tools.

# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
from ayx import Alteryx
# read in data from input anchor as a pandas dataframe
# (after running the workflow)
df = Alteryx.read("#1")

# display the pandas dataframe
df

Python tools directly put the input data stream into dataframe. For business case extracting text from PDF, the input data we feed into Jupyter is the full file path and name of the PDF.

Input Data into Python Tools

After filename is stored into dataframe in Jupyter, we can start writing the code to extract text. PDF that I used is mostly in the table format as below. So, the most suitable python package for this case is Tabula Package. For reminder, this Tabula Package have dependency on Java Package, make sure your system also installing Java Package to run the whole package.

The details explanation to install Tabula can be found in below links

As usual, we need to import the package

import tabula as tb
from tabula import read_pdf
import pandas as pd

Then from the dataframe we can draw filename only for better looping.

FileName = df['FileName']
FileName
Coding Result

PDF text extraction using Tabula package. For reminder, you can find the details explanation about area and columns in the link above. To get the exact number of columns that suits the PDF columns, it takes trial and error. I have done my trial and error and here’s what I got.

for f in FileName:
df2= tb.read_pdf(f, pages = 'all', area = (0, 0, 800, 900), columns = [115, 260, 340, 360, 420], pandas_options={'header': None}, stream=True)[0]
Result from Tabula package

Then we just need to feed the dataframe into Python tools output. You can choose any number of output, I choose output number 2

Alteryx.write(df2,2)
Result of feed dataframe into Python Tool output
Result from Alteryx Designer side

The picture above is the result from Alteryx main workflow. You can take a look in python tools that I click number 2 as the output number.

The thing is I actually not only working with 1 PDF at the time. There are several PDF’s and I need to give the data identification from which PDF files is the data coming from. To accommodate such case, I am using Append Fields tool from Alteryx, to get back the file path that is originally become the input.

Append Fields tool have 2 input streams. First is the Target ( T symbol) and second is Source (S symbol). Basically, I would like to append data from Source to Target, with the Source will be filename path and Target will be all the data from PDF. The result is filename path will be the new column after column 5. Thus, it can identify this data coming from that filename.

New column as FileName identifier

Save the workflow as macro workflow and this macro will be ready to use many times in our automation for extracting PDF.

--

--

Astry Citrasari

Developed automation using Alteryx and Python. If you love my article don't forget https://www.buymeacoffee.com/astrycitrao