SpringBoot: Extract Text From PDF

George Berar

2 min readMar 16, 2022

In today’s article I will show you a simple way to extract text from a PDF file.

Image credits to https://www.2pdfconverter.com/

Before we start you can find the entire code here.

What We Build?

A simple REST API endpoint which accepts a PDF file as input and returns the extracted content as text.

How We Build?

Maven Dependency

Add the Apache PDFBox dependency in your pom.xml:

Other versions here

Define The API

Create a REST endpoint which accepts a PDF file as input:

where the response looks like this:

Note: I’m using Lombok to generate getters/setters/constructors etc.

and the content extractor service looks like this:

Test

For testing purposes I use a simple PDF file with only two paragraphs:

Doing the request against the API returns the expected text:

Conclusion

As always please keep in mind this approach might or might not suit your project context or needs and I’m not in the position to say there’s no other way to do it differently or better. I really hope you enjoyed it and had fun reading it.

Stay safe and remember you can find the code here.