Converting PDFs to EXCEL using Python: A Step-by-Step Guide

Mikayla Morgan
5 min readMay 3, 2024

--

This is a step-by-step tutorial on scrubbing data and putting valuable data points and information into self-created Excel files for accounting-specific uses.

Why is this useful?

This tutorial is specifically tailored for corporate tax preparation, ensuring relevance to your needs because accounting firms often receive client files in PDF form.

Have you ever found yourself in a situation where essential data arrives from clients in an inconvenient PDF format? It’s a common challenge, wishing for the simplicity of Excel for easier handling. But fear not, as there’s a Python solution to convert those PDFs into Excel spreadsheets effortlessly.

Let’s dive in and simplify your data workflow!

Introduction:

When working at my public accounting firm last summer, I was often faced with the daunting task of extracting tabular data from various sources, including intricate financial reports, new client tax returns, and IRS documents. One particularly memorable instance involved a new client and their 1120 tax form. This document is a vital report to understand a company’s tax position as well as helps set up our software to make the tax return process smoother, using prior year information. Recognizing the importance of this data, I set out to extract and analyze it to ensure accuracy and analyze it to ensure accuracy and compliance.

In this tutorial, I will show you how to analyze a PDF and convert it into an Excel file. You will need a PDF document containing data you want to extract and sift through to accomplish this task.

Gathering the Data:

For this guide, we’ll be using “PDF B,” which you can download below to follow along. Let’s dive in and unlock Python’s potential for PDF data extraction!

Setting up to Create a Python Code:

When deciding where to run the code, you can use Google Colab or your local computer. In this tutorial, we’ll choose Google Colab because of its simplicity and accessibility.

However, it is important to note that uploading data to Google Colab means it’s stored on their server, which may raise privacy concerns.

Step #1: Opening up Google Collab & Finding Data

o Link to Google Collab

o Click here to download our Tutorial Example and rename it to “PDF B”

Step #2: Loading in Data to Google Collab

o On the left-hand side of your screen, you’ll notice a file icon with an arrow pointing upwards. Click on this icon to upload your PDF file. This will allow you to follow our tutorial and apply the techniques we discuss directly to your document. Let’s get started by accessing the PDF file you’d like to convert to Excel or CSV format!

o If you’re following the tutorial, your left-hand side should resemble the following layout.

Create the Python Code:

Step #1: Install PDFplumber

o PDFplumber is a Python library designed for extracting text, tables, and images from PDF files.

!pip install pdfplumber

Step #2: Import pandas, namedtuple, and pdfplumber

o This will read each page of the PDF file “B.PDF” (the PDF file example being imputed), extract the text content from each page, and concatenate all the extracted text into a single string stored in the variable ‘all_text.’

from collections import namedtuple
import pandas as pd
import pdfplumber
import re
all_text = " "

with pdfplumber.open('B.pdf') as pdf:
for page in pdf.pages:
all_text = all_text + page.extract_text()

What is this code doing — A line-by-line explanation

o from collections import namedtuple 

o This line imports the ‘namedtuple’ class from the ‘collections’ module.

o ‘namedtuple’ is a factory function for creating type subclasses.

o import pandas as pd

o This line imports the ‘pandas’ library and assigns it an alias ‘pd,’ a standard convention.

o ‘pandas’ is a powerful data manipulation and analysis library in Python.

o import pdfplumber

o This line imports the ‘pdfplumber’ library, which extracts text, tables, and images from PDF files.

o import re

o This line imports the “re” module, which supports regular expressions. Regular expressions are used for pattern matching and string manipulation.

o all_text = " "

o This line initializes an empty string variable named ‘all_text.

o This variable will store all the text extracted from the PDF pages.

o with pdfplumber.open('B.pdf') as pdf:

o This line opens the PDF file named “B.pdf” using pdfplumber and assigns it to the variable ‘pdf’.

o The ‘with’ statement ensures that PDF files are appropriately closed after use.

o for page in pdf.pages:

o This line starts a loop that iterates over each page of the PDF file

o all_text = all_text + page.extract_text() 

o Inside the loop, this line extracts the content from the current page using the ‘extract_text( )’ method provided by ‘pdfplumber’.

o It then appends this text to the ‘all_text’ string variable.

Step #3: Extract all of the PDF data

o Use the variable “all_text.”

o This variable container stores all the text extracted from the imported PDF pages.

all_text

Step #4: Analyze the Data

o This is where we will “ paraphrase” the text document line by line, extract specific information using a regular expression pattern, and store the extracted data in a structured format using named tuples.

from collections import namedtuple
import pandas as pd
import pdfplumber
import re

with pdfplumber.open('B.pdf') as pdf:
page = pdf.pages[0]
text = page.extract_text()

allinfore = re.compile(r'(\d{5}) (\d{2}\/\d{2}\/\d{4}) Printed (.+?) (\d{1,3}(?:,\d{3})*\.\d{2})')
allinfo = allinfore.findall(text)

checkregister= namedtuple('CheckRegister', ['CheckNumber', 'CheckDate', 'VendorName', 'Amount'])

line_items = []
for line in text.split('\n'):
if allinfore.match(line):
vend_num, *vend_name = line.split()
vend_name = ' '.join(vend_name)

line = allinfore.search(line)
if line:
check = line.group(1)
check_date = line.group(2)
vendor = line.group(3)
check_amt = line.group(4)
line_items.append(checkregister(check, check_date, vendor, check_amt))

Step #5: View the line items

line_items

Step #6: Put the Named Tuples into a Data Frame

df = pd.DataFrame(line_items)
df

Step #7: Save Your Extractions to an Excel File

df.to_excel('ExtractedPDFB.xlsx')

Step #8: Access Saved Excel File

o On the left-hand side of your screen, where you uploaded your original PDF, click on the file icon with an arrow pointing upwards.

o This is where you will see your newly created Excel file.

o You can download this file in any format (Excel, CSV, etc.)

Conclusion

This tutorial taught us how to convert PDF documents to Excel documents using Python. Scrubbing a PDF into an Excel document offers significant everyday benefits by streamlining data extraction and analysis. By automating the conversion, Python eliminates the need for manual data entry, saving time and reducing errors.

--

--