SpringBoot: Extract Text From PDF
In today’s article I will show you a simple way to extract text from a PDF file.
Before we start you can find the entire code here.
What We Build?
A simple REST API endpoint which accepts a PDF file as input and returns the extracted content as text.
How We Build?
Maven Dependency
Add the Apache PDFBox dependency in your pom.xml:
Other versions here
Define The API
Create a REST endpoint which accepts a PDF file as input:
where the response looks like this:
Note: I’m using Lombok to generate getters/setters/constructors etc.
and the content extractor service looks like this:
Test
For testing purposes I use a simple PDF file with only two paragraphs:
Doing the request against the API returns the expected text:
Conclusion
As always please keep in mind this approach might or might not suit your project context or needs and I’m not in the position to say there’s no other way to do it differently or better. I really hope you enjoyed it and had fun reading it.
Stay safe and remember you can find the code here.