How To Easily Extract Text From Any PDF With Python

Easier than ever

Published in

Analytics Vidhya

4 min readFeb 3, 2021

Data Scientists often have to deal with information contained in PDF’s, although some of them will just copy and paste the data they need, this is a terrible practice, not to say the slowest and least effective way to work in the longterm and depending on the PDF it may not even be possible to do so.

Before we start, thanks to Carlos Melo — Sigmoidal for allowing me to use fake PDF reports created for his Data Science course, in which I am a student and love it very much. If you don’t know him I highly encourage you to follow him on Instagram, Blog and YouTube, it’s my favourite source of Data Science knowledge.

If you want to follow along with this project and not just the functions from PDF Plumber, make sure to take a look at my Google Colab Notebook in which I cover everything that I talk about in this post and you can also see the whole project I am referring to.

The tool we are using in this tutorial is PDF Plumber, an open-source python package, it’s great, simple and powerful.

Click here if you want to check out the PDF I am using in this example.

1. Import your module.

pip install pdfplumber -qimport pdfplumber

Now let’s take a look at the main functions PDF Plumber has:

2. open(‘path/to/directory’)

This function will open the file that you passed the directory as an argument, imagine you had a variable called ‘‘pdf’’ and it contained the directory to a file:

pdf = pdfplumber.open('/content/file.pdf')

3. pages[ ]

After you opened your file, you want to select the page you want to extract the information you’re looking for, let’s say the information you want is on the first page, the index will be 0 because Python starts counting from 0:

page = pdf.pages[0]

Imagine you’re reading a book, the first step is to open the book, then you look for the page you want to read and then you read it (i.e extract information from it), Python works the same way.

4. extract_text()

Now that you’ve opened a page you need to extract the text from it:

text = page.extract_text()

If you call the variable text in a print() statement you would have an output of something like this:

However, if you use the print function your text will be formatted like this:print(text)SIGMOIDAL 
   
Relatório Diário 
 
Data: 10/08/2020 
 
RECEITA: R$ 1.397,00 
DADOS ATUALIZADOS POR CARLOS MELO
 
 
 Visitantes: 1367 
A quantidade de visitantes diz respeito a visitantes únicos visitando qualquer 
página do domínio ou subdomínio sigmoidal.ai. Compreende, então, cursos, 
blogs e landing pages. 
 Inscritos: 33 
É considerado aqui o número de leads gerados por meio de cadastro 
voluntário nos formulários do cabeçalho, rodapé ou materiais ricos (como 
eBook, infográficos, entre outros). 
 Assinantes: 6 
Clientes assinantes da Escola de Data Science, considerando-se o plano 
renovável de assinatura mensal.

The print() function recognizes the ‘\n’ as a line breaker and ‘\t’ as a tab, so your text is formatted. By the way, that’s the extracted text I am using to write this post, your output will be different than mine.

However, if you just call the variable your output would be:

SIGMOIDAL \n \nRelatório Diário \n \nData: 10/08/2020 \n \nRECEITA: R$ 1.397,00 \nDADOS ATUALIZADOS POR CARLOS MELO\n \n \n Visitantes: 1367 \nA quantidade de visitantes diz respeito a visitantes únicos visitando qualquer \npágina do domínio ou subdomínio sigmoidal.ai. Compreende, então, cursos, \nblogs e landing pages. \n Inscritos: 33 \nÉ considerado aqui o número de leads gerados por meio de cadastro \nvoluntário nos formulários do cabeçalho, rodapé ou materiais ricos (como \neBook, infográficos, entre outros). \n Assinantes: 6 \nClientes assinantes da Escola de Data Science, considerando-se o plano \nrenovável de assinatura mensal. \n \n \n

And that’s how you want to start working with your text. Imagine we want the profit value that this file contains, which is ‘1397,00’, we would have to clean this output until we got to ‘1397.00’ as a string and then we’d have to cast it to a float. If you want to see this process step-by-step you can have look at the notebook I made for this project. Anyways, the code would be:

float(text.split("\n")[6].replace("\t", "").split("R$")[1])
1397.00

Imagine you have lots of files that follow the same pattern of text, you could make a ``for loop`` and then Python would iterate over all of them and return the profit value of each one.

sum = 0 #make a counter#making the functionfor reports in week_files:
     report = pdfplumber.open(reports)
     page = report.pages[0]
     text = page.extract_text() #extracting the text
     value = text.split("\n")[6].replace("\t", "").split("R$")[1]
     value = float(value)
     sum += valueprint("{} ----> {}".format(reports, value))

If you liked this tutorial please share it with your friends and leave a comment on what you liked the most and what I could have done better, don’t forget to add me on LinkedIn and GitHub and don’t hesitate to reach out if you got any question.

Reference:

Background vector created by starline — www.freepik.com

Google Colaboratory

Edit description

colab.research.google.com

https://sigmoidal.ai/blog-sigmoidal/