QUANTRIUM GUIDES

Encrypted PDF Identification and Decryption using Python

Identifying encrypted PDF documents and decrypting password protected PDFs using Python

Bhargav Sridhar

Published in

Quantrium.ai

5 min readJul 2, 2022

Introduction

Portable Document Format, popularly known as PDF, is the most commonly used data storage formats in modern day and serves numerous benefits. Encryption of the document and its data is one of the major feature which keep the information safe. However, this feature becomes a snag in AI based automation where you want to automate tasks like document identification, extraction of useful information from pdf documents, etc.

At Quantrium, we faced the same issue while working on our Payslip digitisation platform where the encrypted pdf payslips were uploaded by the loan applicants.

In this article, I will speak of the following topics:

Identify encrypted PDFs using Python.
Some PDF encryption types and how to identify them.
How to decrypt password protected PDFs using Python.

Types of PDF Encryption

There are majorly two types of PDF encryption:

Password Protection: The password protected PDF cannot be opened/viewed straight away since it is locked and hence requires a password to unlock/open. This is the most common type of encryption that is used for the PDF document.
Text Encryption: You can open these PDFs and view them normally, but you cannot copy or edit the text/data stored in the PDF for analysis or any other purpose. To identify this, if you select the PDF text and try to copy it to a text document, you will not be able to paste the selected text.

Some PDFs may also have both the encryptions implemented. Now, let us discuss how we can identify these types of PDF documents using Python.

Identifying Encrypted PDFs using Python

Here, we will use a module called PyMuPDF, which is one of the powerful PDF processing and management libraries available in Python. In fact, I had used the same module in one of my previous articles to build a PDF Classifier using Python. The article can be found here and the PyMuPDF documentation can be found here. We will use the same inbuilt fitz module to identify encrypted PDFs.

First, let’s install the required module:

!pip install PyMuPDF

Now, import the module in your Python script:

import fitz

This installation and import is common for both the tasks that we are going to performed later in this article.

Identification of Password Protected PDFs

You have two methods to identify password protected PDFs using Python:

Method 1

Define and implement the is_password_protected_pdf() function as follows:

def is_password_protected_pdf(pdf_file_path):
    doc = fitz.Document(pdf_file_path)
    if doc.needs_pass:
        return True
    return False

You can test the above function with the following code snippet:

pdf_file_path = "test_pdf.pdf" #Specify your PDF path here
print(is_password_protected_pdf(pdf_file_path))

Here the doc.needs_pass will return True if the PDF is password protected and False if it is ordinary/decrypted PDF. In this way, you can easily identify password protected PDFs from other PDF files.

Method 2

We will define and implement the is_password_protected_pdf() function here similar to method 1 but slight differently:

def is_password_protected_pdf(pdf_file_path):
    doc = fitz.Document(pdf_file_path)
    if doc.metadata is None:
        return True
    return False

Here, the doc.metadata returns the PDF metadata such as PDF format, creator, date of creation etc. It also includes encryption data (if any). But, if the PDF is password protected, then it can’t be opened by fitz module and hence the meta data of the PDF cannot be extracted. Thus, the doc.metadata value will be None for a password protected PDF.

Now if a sample PDF is tested in a similar way as Method 1 and if the method returns True, then the PDF is password protected and if meta data is not None and method returns False then the PDF is not password protected.

Now, let us identify PDFs where the text is encrypted.

Identifying Text Encrypted PDFs

Let us define the is_pdf_text_encrypted() function:-

def is_pdf_text_encrypted(pdf_file_path):
    doc = fitz.Document(pdf_file_path)
    if doc.metadata["encryption"] is not None:
        return True
    return False

This method is very similar to the second method of password protected PDF identification.

In case of an ordinary PDF which is not password protected, the meta data will be a non-null dictionary containing the PDF details (you can verify it by printing the PDF metadata). The encryption key of the meta data will have a null/None value if there is no encryption scheme used within the PDF. In such cases, the PDF data can be extracted efficiently using Python modules like pdftotext. But if there is a text encryption in the PDF, the encryption key will have the name of the encryption scheme used, using which we can try decrypting the PDF data.

When the above method was tested with the text encrypted PDF, the encryption scheme that I obtained was Standard V2 R3 128-bit RC4 and one could obtain similar schemes to identify different encryptions internally implemented in the PDF.

Decrypting Password Protected PDFs

Having understood how to detect encrypted PDFs using Python, let us create a method to decrypt password protected PDFs:-

def decrypt_pdf(pdf_file_path, password):
    doc = fitz.Document(pdf_file_path)
    if doc.authenticate(password):
        file_name = "pdf_decrypted.pdf"
        doc.save(file_name)
        print("\Successfully decrypted PDF")
    else:
        print("\t Password incorrect!! Cannot decrypt PDF!!!")

Now, let us test this method using password protected PDF identification module with the following block:-

from getpass import getpasspdf_file_path = "password_pdf.pdf" #Specify your PDF path here
if is_password_protected_pdf(pdf_file_path):
    password = getpass("Enter password to decrypt PDF:")
    decrypt_pdf(pdf_file_path, password)
else:
    print("PDF is not password protected!")

The getpass method in Python can be used to get the password/input from the user in a masked format (similar to Linux password input format where characters are not displayed during input). It’s description can be found here. If the correct decryption password is passed, then the PDF will be saved without the password and now can be opened directly and processed, and the method fails if the wrong password is entered.

Conclusion

We have seen how to decrypt password protected PDFs, all using fitz module. There are some ways to deal with text encrypted PDFs also. However, such PDFs will fail with modules like pdftotext(one of the modules used to extract PDF text by retaining text positions/alignments as present in the PDF) if not dealt properly.

Hence, one may have to device a different method to obtain data from the text encrypted PDF documents. One can try extracting the PDF data using in built methods of the fitz module also. You can find it in detail from the PyMuPDF documentation. But overall, these techniques can be found pretty handy if dealing with a large PDF dataset with a mix of ordinary and encrypted PDFs.

I hope I have given you a good knowledge on this, and also hope that this article would be useful in the long run. I would be more than willing to acknowledge questions which can be posted through comments. Thanks!