Alteryx Python Tools to Extract Text from Searchable PDF

4 min readNov 10, 2023

There is always a first timer for everything and so do I. This year time, exactly on February, it’s the first time I touch with Alteryx.

Alteryx has built in Python Tools with Jupyter Notebook as its interface. I am so glad at this point because I have already familiar with Jupyter and a little bit basic python. People may familiar to python, Jupyter and Alteryx, yet, there’s still a gap. Only a few knowledge in the internet explaining how to bridge the Alteryx to python.

I’ve learnt along the way with my great teammates and trying to connect the dot. In this article, I will demonstrate how Python Tools in Alteryx behave and how it can be use in Alteryx main workflow.

Alteryx Python tools have the same icon as usual python and there are 5 number as it depicts 5 possible output from Jupyter. Below is the standard code automatically put there by the tools.

# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
from ayx import Alteryx
# read in data from input anchor as a pandas dataframe
# (after running the workflow)
df = Alteryx.read("#1")

# display the pandas dataframe
df

Python tools directly put the input data stream into dataframe. For business case extracting text from PDF, the input data we feed into Jupyter is the full file path and name of the PDF.

After filename is stored into dataframe in Jupyter, we can start writing the code to extract text. PDF that I used is mostly in the table format as below. So, the most suitable python package for this case is Tabula Package. For reminder, this Tabula Package have dependency on Java Package, make sure your system also installing Java Package to run the whole package.

PDF - image

This example injects an image (the DataTables logo) into the PDF. PDFMaker uses based64 encoded images (online tools…

datatables.net

The details explanation to install Tabula can be found in below links

How to Scrape and Extract Data from PDFs Using Python and tabula-py

You want to make friends with tabula-py and Pandas

towardsdatascience.com

As usual, we need to import the package

import tabula as tb
from tabula import read_pdf
import pandas as pd

Then from the dataframe we can draw filename only for better looping.

FileName = df['FileName']
FileName

PDF text extraction using Tabula package. For reminder, you can find the details explanation about area and columns in the link above. To get the exact number of columns that suits the PDF columns, it takes trial and error. I have done my trial and error and here’s what I got.

for f in FileName:
    df2= tb.read_pdf(f, pages = 'all', area = (0, 0, 800, 900), columns = [115, 260, 340, 360, 420], pandas_options={'header': None}, stream=True)[0]

Then we just need to feed the dataframe into Python tools output. You can choose any number of output, I choose output number 2

Alteryx.write(df2,2)

Result of feed dataframe into Python Tool output

The picture above is the result from Alteryx main workflow. You can take a look in python tools that I click number 2 as the output number.

The thing is I actually not only working with 1 PDF at the time. There are several PDF’s and I need to give the data identification from which PDF files is the data coming from. To accommodate such case, I am using Append Fields tool from Alteryx, to get back the file path that is originally become the input.

Append Fields tool have 2 input streams. First is the Target ( T symbol) and second is Source (S symbol). Basically, I would like to append data from Source to Target, with the Source will be filename path and Target will be all the data from PDF. The result is filename path will be the new column after column 5. Thus, it can identify this data coming from that filename.

Save the workflow as macro workflow and this macro will be ready to use many times in our automation for extracting PDF.

Alteryx Python Tools to Extract Text from Searchable PDF

PDF - image

This example injects an image (the DataTables logo) into the PDF. PDFMaker uses based64 encoded images (online tools…

How to Scrape and Extract Data from PDFs Using Python and tabula-py

You want to make friends with tabula-py and Pandas

Written by Astry Citrasari