Extract text from PDF in .NET Standard 2.0

Andriy Andruhovski
asposepdf
Published in
2 min readMar 19, 2018
“Close-up of the definition of the word “focus” in a French dictionary” by Romain Vignes on Unsplash

Extract text from PDF is one of a common data processing task. In this story, we’ll look at the example of how to extract text from the whole PDF document using the Aspose.PDF for .NET. Using this library has some advantages:

  • It’s a single cross-platform API for .NET/.NET Standard 2.0, Java, and Android platforms with the native core, so you don’t need additional dependencies;
  • The large set of data converters from/to PDF;
  • The wide range of functions for editing PDF and additional functions like the signing, encryption, printing etc;

In this example, we have used a standard template “Console App (.NET Core)” in Visual Studio 2017.

Please note, that Aspose.PDF is a commercial product, but you can obtain a temporary license here.

The easiest way to extract text from PDF is using a PdfExtractor class. In order to extract the text we need to complete the following steps:

  1. Create an object of the PdfExtractor class
  2. Bind the input PDF file using the BindPdf() method.
  3. Call the ExtractText() method to extract all the text into the memory.
  4. Call the GetText() method.

The PdfExtractor also has the ability to extract the text into separate files. To use this feature we need to run the while loop with check returned value from the HasNextPageText() method and get the text using the GetNextPageText() method.

So, let's start. The Command Line Parser Library was used to parse command line arguments in this example. So the implementation of this console application is pretty straightforward:

Following snippet shows class for command line options:

Conclusion

The simple example was considered in this article, but the Aspose.PDF for .NET library also has tools for more complex cases. To learn more please follow the Developer Guide or leave comments on this article.

--

--