PDF content verification in Playwright
Are you looking for a blog post about how to use Playwright and NodeJS to verify the content of a PDF file? If so, you’re in the right place! In this blog post, we’ll go over how to use Playwright, pdf-parse (a JavaScript library for parsing and extracting text from PDF file) and fs (a built-in Node.js module for working with the file system) to verify the contents of a PDF file.
Verifying PDF files in web applications testing is crucial commonly because PDF files are often used to give the most important information to users, such as invoices, contracts, and other legal documents.
First, I would like to introduce Playwright which is an open source automation framework for testing web applications. Like the other open-source test frameworks, it does not have built-in PDF or local files verification. While the Playwright framework is powerful compared to other automation tools, testing a PDF file is example of a minus of Playwright framework. That’s why I’ve written this blog.
Now that we have a basic understanding of what Playwright is, let’s go over how to use it to verify the content of a PDF file. First, we’ll need to install Playwright, fs and pdf-parse via npm:
npm install playwright fs pdf-parse
npm init playwright@latest
I will verify that a simple pdf content is the same as the expected PDF file in Playwright with fs and pdf-parse. Firstly, we need to get the content manually using pdf-parse to create the expected value file(txt).
Web UI is used to download the PDF file like real user:
Then I will write the PDF text content to console using pdf-parse:
Output:
A Simple PDF File
This is a small demonstration .pdf file -
just for use in the Virtual Mechanics tutorials. More text. And more
text. And more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. Boring, zzzzz. And more text. And more text. And
more text. And more text. And more text. And more text. And more text.
And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. And more text. And more text. Even more. Continued on page 2 ...
Simple PDF File 2
...continued from page 1. Yet more text. And more text. And more text.
And more text. And more text. And more text. And more text. And more
text. Oh, how boring typing this stuff. But not as boring as watching
paint dry. And more text. And more text. And more text. And more text.
Boring. More, a little more text. The end, and just as well.
This content needs to be verified manually once, then, it is saved as expected.txt file into the ExportData directory.
We have expected.txt file now to compare with the actual data. Now we can create an automated test case in Playwright for the expected PDF file with the actual PDF file. So, at the end of the blog, code block and hierarchy of directories will be like:
- ExportData
- actual.txt
- expected.txt
- sample.pdf
- node_modules
- playwright-report
- tests
- example.spec.js
- package-lock.json
- package.json
- playwright.config.js
example.spec.js
Looking at the code snippet below, the first line imports the test
and expect
functions from Playwright's @playwright/test
module. The second line imports the fs
module, which is used to read and write files on the file system:
The code below defines a test case using the Playwright test
keyword. This test navigates to a URL that serves a PDF file using the page.goto
function. It waits for the download event to be triggered and clicks on a link to download the PDF file:
Once the download has completed, as it can be seen below, the code saves the PDF file into the /ExportData
directory using the filename suggested by the download event. It uses the pdf-parse
module to extract the text from the PDF file and save it to a file called actual.txt
into the /ExportData
directory:
The code snippet below reads the expected and actual values from the files that were saved earlier. It uses the expect
function from Playwright to assert that the values match. If they do not match, the test will fail:
At the end, the complete example.spec.js
file:
const { test, expect } = require('@playwright/test');
const fs = require('fs');
// Define a test using Playwright's `test` function
test('verify content', async ({ page }) => {
// Navigate to a URL that serves a PDF file
await page.goto('https://www.africau.edu/images/default/sample.pdf');
// Wait for the download event and click on a link to download the PDF file
const [download] = await Promise.all([
page.waitForEvent('download'),
page.getByRole('link', { name: 'A Simple PDF File https://www.africau.edu › images › default › sample' }).click()
]);
// Use the suggested filename from the download event to save the file
const suggestedFileName = download.suggestedFilename();
const filePath = 'ExportData/' + suggestedFileName;
await download.saveAs(filePath);
// Use the 'pdf-parse' module to extract the text from the PDF file
var pdf = require('pdf-parse');
var dataBuffer = fs.readFileSync('./ExportData/sample.pdf');
await pdf(dataBuffer).then(function(data) {
fs.writeFileSync('./ExportData/actual.txt', data.text);
});
// Read the expected and actual values from the saved files
let expected_export_values = fs.readFileSync('./ExportData/expected.txt', 'utf-8');
let actual_export_values = fs.readFileSync('./ExportData/actual.txt', 'utf-8');
// Use the `expect` function from Playwright to assert that the values match
expect(expected_export_values).toMatch(actual_export_values);
});
Conclusion
Neither Playwright nor other open source test automation tools have built-in PDF file verification. When I needed to verify generated PDF file by the web application that I am testing with Playwright, I could not find any blog/document/tutorial online. That is why I wrote this blog, so that one can learn how to test the content of a PDF file in the Playwright automated web application test suite. 🚀