Automation for validating the content of PDF document

Dwiki Nugraha
Blibli.com Tech Blog
3 min readMay 31, 2023

Testing requirements are developing day by day due to the complexity of the feature, we may learn how to validate various cases in various conditions.

In this article, I will elaborate more on how to automate the content of the PDF document.

To begin with, As an example, SIPLah (Sistem Aplikasi Pengadaan Sekolah) is a procurement application in Indonesia that is used by schools to purchase their needs using government funding. As required, siplah.blibli.com generates some documents that might be used as a legal report by the schools to other stakeholders, for instance, but not limited to BAST, comparison, invoice, negotiation document, SPK.

Invoice Document
Invoice Document

From this background, it is clearly apparent that the validity of the documents is essential in this application. Thus, as SDET of B2G squad, we tried to explore how to automate not only the detail shown in the application but also the content of the pdf documents that will be downloaded by users.

There is a solution the validation of the result, to minimize the duration of the test.

in Java, there is already a library to manipulate a PDF document called Apache PDFBox, which can be used to convert the PDF document into String data. After that, we can add an assertion to verify or validate with expected data that is shown on PDF Document.

Implementation

1. Add Dependency on pom.xml
https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox

<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.26</version>
</dependency>

2. Put a code to read and convert it from a pdf document into a string

public String convertPDFtoString(File path){
String contentString = "";
try {
PDDocument pdDocument = PDDocument.load(path);
contentString = new PDFTextStripper().getText(pdDocument);
pdDocument.close();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return contentString;
}

3. Add Assertion

String orderID = "123/221/22";
String pdfContent = convertPDFtoString(new File(filePath));
assertThat("Order id is not correct or not exist",
pdfContent,
containsString(orderID));

Example

We have a scenario test to validate the grand total of the invoice document, to make sure the calculation is correct because there is a tax 11% that has to calculate.

Invoice Document

the library will return data after converting it into a string

Console result

Implementing daily automation

As a reference, in our automation, we have 49 documents validated across all scenarios in our regression that run daily to make sure all the docs have a valid value, calculation, and other notable information.

Cucumber report

Conclusion

There are always two sides to the coin, on the other hand, this approach still has a weakness which is we can only validate the content without ensuring the position. Thus, if there is the same value placed in the document in the wrong field/position, this case will not be captured as an issue.

Reference

https://www.javatpoint.com/pdfbox-tutorial

https://pdfbox.apache.org/

https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox

Credit
Blibli — B2G Squad — SDET Team
Faizatunnisa
Stella Suharli
Vincent Novanto
Abhipraya Radhityaqso
Dwiki Nugraha

--

--