How to Programmatically Extract Data from a PDF Using .NET C#

MESCIUS inc.
MESCIUS inc.
Published in
13 min readMay 5, 2022

The PDF format is one of the most common text formats to create agreements, contracts, forms, invoices, books, and many other documents. Each of these PDF documents consists of a variety of different elements such as text, images, and attachments (among others). When working with a PDF document, extracting different types of elements from the document may be required.

GrapeCity Documents for PDF lets you parse the PDF document and extract the data and elements, allowing developers to utilize data in other applications or projects, such as databases or other documents.

This blog will help developers get an understanding of how to use the GrapeCity Documents for PDF API and C# to programmatically retrieve the data and elements they need from PDF files:

  1. Extract text from a PDF document using C#
  2. Extract invoice total amount and customer email address in a PDF with Regex using C#
  3. Extract PDF document information using C#
  4. Extract images from a PDF document using C#
  5. Extract attachments from a PDF document using C#
  6. Extract PDF form data and save it as XML using C#
  7. Extract Fonts using C#
  8. Extract data from a table using C#
  9. Extract data from a multi-page table using C#
  10. Extract data from structure tags using C#

To begin with, an understanding of how to start working with GrapeCity Documents for PDF is required. Refer to the documentation and demo quick start to get up and running quickly. Once that setup is complete, we can get started and understand each of the above-listed implementations in detail using the GcPdf API members and C# code snippets.

Extract text from a PDF document using C#

At the heart of every PDF is text. Which normally makes up the majority of any single document. Therefore, extracting text from a PDF document tends to be the most common function required. Developers can extract all text from a document or search and find specific text to extract anywhere in the document. The sections ahead describe how to perform all these types of extractions on a PDF document.

Extract all text from a document using C#

The code snippet below shows how to extract all the text from a PDF document using the GetText method of GcPdfDocument class, which returns the document text as a string.

void ExtractAllText()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "ItalyDoc.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = new GcPdfDocument();
doc.Load(fs);

//Extract All text from PDF document
var text = doc.GetText();
//Display the results:
Console.WriteLine("PDF Text: \n \n" + text);

//Extract specific page text:
**var** pageText = doc.Pages[0].GetText();
//Display the results:
Console.WriteLine("PDF Page Text: \n" + pageText);
}
}

Extract all text from a specific page using C#

Similarly, the text for a particular page in a PDF document is extracted by invoking the GetTextmethod of the Page class. The page can be accessed using the Pages property of the GcPdfDocument class.

void ExtractPageText()  
{
using (**var** fs = File.OpenRead(Path.Combine("Resources", "PDFs", "ItalyDoc.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = **new** GcPdfDocument();
doc.Load(fs);

//Extract specific page text:
**var** pageText = doc.Pages[0].GetText();

//Display the results:
Console.WriteLine("PDF Page Text: \n" + pageText);
}
}

Extract text from predefined bounds using C#

This section and code snippet focus on understanding how to extract text from a known physical position in the document. Begin by getting the page text map using the GetTextMap method of Page class. After retrieving the page’s text map, extract the text fragment at a specific location by invoking the GetFragment method of TextMap class and passing in the known bounds as parameters to the GetFragment method.

The extracted text is returned via an out parameter passed to the GetFragment method.

void ExtractTextFromBounds()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "ItalyDoc.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = **new** GcPdfDocument();
doc.Load(fs);

//Get Page TextMap
var tmap = doc.Pages[0].GetTextMap();

//Retrieve text at a specific (known to us) geometric location on the page:
float tx0 = 7.1f, ty0 = 2.0f, tx1 = 3.1f, ty1 = 3.5f;
HitTestInfo htiFrom = tmap.HitTest(tx0 * 72, ty0 * 72);
HitTestInfo htiTo = tmap.HitTest(tx1 * 72, ty1 * 72);
tmap.GetFragment(htiFrom.Pos, htiTo.Pos, out TextMapFragment range1, out string text);

//Display the results:
Console.WriteLine("Text extracted from specific bounds: \n \n" + text);
}
}

The image below depicts the result of the above code, which showcases the text extracted from specific bounds of the document displayed in the console window:

Extract invoice total amount and customer email address in a PDF with regex using C#

PDF documents are often used to generate invoices, purchase orders, delivery receipts, and many similar documents. Because of the prevalence of these documents, there might be a need to process these documents and extract useful information such as invoice total or customer email address. To accomplish this, the GcPdf API offers the use of regular expressions so that specific known pieces of information can be quickly and easily extracted from documents.

void ExtractRegex()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "InvoiceDemo.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = **new** GcPdfDocument();
doc.Load(fs);

//Find Invoice total amount
FindTextParams searchParam1 = **new** FindTextParams(@"(Total)\r\n\$([-+]?[0-9]*\.?[0-9]+)", false, false, 72, 72, true, true);
IList<FoundPosition> pos1 = doc.FindText(searchParam1);
string totalAmount = pos1[0].NearText.Substring(pos1[0].PositionInNearText + pos1[0].TextMapFragment[0].Length).TrimStart();
Console.WriteLine("Total amount found using regex in FindText method: " + totalAmount);

//Find customer's email address from Invoice
FindTextParams searchParam2 = new FindTextParams(@"[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+", false, false, 72, 72, true, true);
IList<FoundPosition> pos2 = doc.FindText(searchParam2);
string foundEmail = pos2[0].NearText.Substring(pos2[0].PositionInNearText, pos2[0].TextMapFragment[0].Length);
Console.WriteLine("Email Address found using regex in FindText method: " + foundEmail);
}
}

The image below depicts the result of the above code displayed in the console window:

Extract PDF document information using C#

All PDF documents have data associated with them that is not necessarily part of the visible portion of the document. Some of these items include Author, Producer, Date the document was created and other properties. This is useful information that may need to be extracted for different purposes referencing the appropriate authors or being able to do searches on similar documents by the same author, just to name a couple of ideas.

As the snippet below shows, this can be accomplished by using the DocumentInfo property of GcPdfDocument class.

void ExtractPDFInformation()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "C1Olap-QuickStart.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = **new** GcPdfDocument();
doc.Load(fs);

//Extract PDF file information using the DocumentInfo property
Console.WriteLine("Displaying PDF file information:\n \n");
Console.WriteLine("Author: " + doc.DocumentInfo.Author);
Console.WriteLine("Creator: " + doc.DocumentInfo.Creator);
Console.WriteLine("Producer: " + doc.DocumentInfo.Producer);
Console.WriteLine("Subject: " + doc.DocumentInfo.Subject);
Console.WriteLine("Title: " + doc.DocumentInfo.Title);
Console.WriteLine("Creation Date: " + doc.DocumentInfo.CreationDate);
Console.WriteLine("Keywords: " + doc.DocumentInfo.Keywords);
}
}

The image below depicts the result of the above code:

Extract images from a PDF document using C#

Images are another common element of a PDF document that users might need to extract from the document to either create an image repository or to use in creating a new PDF document. The GetImages method of GcPdfDocument class can be used to extract all the images from a PDF document.

This method returns a collection of images that can be saved as individual images in different image formats including jpg, png, and bmp using the GcBitmap class.

The code snippet below shows how easy it is to extract all the images from a PDF document and save each one as an individual image:

void ExtractImages()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "ItalyDoc.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = new GcPdfDocument();
doc.Load(fs);

//Extract images from the loaded PDF
var imageInfos = doc.GetImages();

int n = 1;
foreach (var imageInfo in imageInfos)
{
//Save extracted images
GcBitmap extImage = imageInfo.Image.GetNativeImage(Color.Transparent, Color.Transparent);
extImage.SaveAsPng("Image" + n.ToString() + ".png");
extImage.Dispose();
n++;
}
}
}

The image below displays the result of the above code, showcasing the image extracted from the PDF document saved as a PNG file:

Extract attachments from a PDF document using C#

Another element that is found in a PDF document is an attachment, which can be images, Word documents, PowerPoint presentations, or Excel worksheets containing important information. When working with a PDF document, users may want to extract a specific attachment from the PDF document for external use. An attachment can either be associated with a specific page using the FileAttachmentAnnotation or it can be added to the PDF document as an embedded file.

For details, you may refer to the Attachment documentation topic.

The code snippet below implements both these scenarios:

void ExtractAttachments()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "DocFileAttachment.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = new GcPdfDocument();
doc.Load(fs);

//Get File Attachment using FileAttachmentAnnotation
//Get first page from document
Page page = doc.Pages[0];

//Get the annotation collection from pages
AnnotationCollection annotations = page.Annotations;

//Iterates the annotations
foreach (AnnotationBase annot in annotations)
{
//Check for the attachment annotation
if (annot is FileAttachmentAnnotation)
{
FileAttachmentAnnotation fileAnnot = (FileAttachmentAnnotation)annot;
FileSpecification.FileInfo file = fileAnnot.File.File;
//Extracts the attachment and saves it to the disk
string path = Path.Combine("Attachments", file.FileName);
FileStream stream = new FileStream(path, FileMode.Create);
file.EmbeddedFileStream.GetStream().CopyTo(stream);
stream.Dispose();
}
}

//Get Document Attachments (document level attachments )
//Iterates through the attachments
if (doc.EmbeddedFiles.Count != 0)
{
foreach (KeyValuePair<string, FileSpecification> attachment in doc.EmbeddedFiles)
{
//Extracts the attachment and saves it to the disk
FileSpecification.FileInfo file = attachment.Value.File;
string path = Path.Combine("Attachments", file.FileName);
FileStream stream = new FileStream(path, FileMode.Create);
file.EmbeddedFileStream.GetStream().CopyTo(stream);
stream.Dispose();
}
};
}
}

The image below showcases the results of extracting file attachments and document attachments from a PDF document and saving them as individual files:

Extract PDF form data and save it as XML using C#

PDF is the most commonly used format to create forms that are used to gather information. A few examples are employment applications, legal forms, banking forms, etc. In all such instances, it is important to extract information from these forms for further processing. XML is a common format used for data processing. Therefore, saving PDF form data in XML format is a basic requirement in many applications.

void ExtractFormData()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "filledForm.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = new GcPdfDocument();
doc.Load(fs);

MemoryStream stream = new MemoryStream();
//Export the form data to a stream
doc.ExportFormDataToXML(stream);

//Alternatively, we can even export to a file of appropriate format
//Export the form data to an XML file
doc.ExportFormDataToXML("sampleFormData.xml");
}
}

The image below showcases the data extracted from event feedback from saved as an XML file using the above code:

Extract Fonts using C#

A PDF document consists of many different types of fonts used to format the document text. It is not easy to recognize all the fonts used in a document by visually inspecting the words. It would be nice to have a method that lists all the fonts used in a document for quick editing or font replacement.

The GetFonts method of GcPdfDocument class returns a list of all the fonts used in the PDF document. The code snippet below shows how to use this method to extract fonts from a PDF document:

void ExtractFonts()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "ItalyDoc.pdf")))
{
//Load Sample PDF document
GcPdfDocument doc = new GcPdfDocument();
doc.Load(fs);

//Extract list of fonts used in loaded PDF document
var fonts = doc.GetFonts();

Console.WriteLine($"Total of {fonts.Count} fonts found in PDF file: \n");

int i = 1;
foreach (var font in fonts)
{
Console.WriteLine($"{i}. BaseFont: {font.BaseFont}; IsEmbedded: {font.IsEmbedded}.");
++i;
}
}
}

The image below showcases the result of executing the above code:

Please refer to the following demo, which showcases the use of GetFonts method.

Extract data from a table using C#

The use of tables is common in PDF documents; whether to showcase a list of items in an invoice/purchase order or format data and statistics in financial reports to make them easier to understand, they always have a significant presence in PDF documents. Because of this, extraction of tables and their data from a PDF document is a requirement in many cases.

void ExtractTableData()  
{
using (**var** fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
{
const float DPI = 72;
const float margin = 36;
var doc = new GcPdfDocument();

var tf = new TextFormat()
{
FontSize = 9,
ForeColor = Color.Black
};
var tfHdr = new TextFormat(tf)
{
FontSize = 11,
ForeColor = Color.DarkBlue
};
var tfRed = new TextFormat(tf) { ForeColor = Color.Red };

//The approx table bounds:
var tableBounds = new RectangleF(0, 3 * DPI, 8.5f * DPI, 3.75f * DPI);

var page = doc.NewPage();
page.Landscape = true;
var g = page.Graphics;

var tl = g.CreateTextLayout();
tl.MaxWidth = page.Bounds.Width;
tl.MaxHeight = page.Bounds.Height;
tl.MarginAll = margin;
tl.DefaultTabStops = 150;
tl.LineSpacingScaleFactor = 1.2f;

var docSrc = new GcPdfDocument();
docSrc.Load(fs);

var itable = docSrc.Pages[0].GetTable(tableBounds);

if (itable == null)
{
tl.AppendLine($"No table was found at the specified coordinates.", tfRed);
}
else
{
tl.Append($"\nThe table has {itable.Cols.Count} column(s) and {itable.Rows.Count} row(s), table data is:", tfHdr);
tl.AppendParagraphBreak();
for (int row = 0; row < itable.Rows.Count; ++row)
{
var tfmt = row == 0 ? tfHdr : tf;
for (int col = 0; col < itable.Cols.Count; ++col)
{
var cell = itable.GetCell(row, col);
if (col > 0)
tl.Append("\t", tfmt);
if (cell == null)
tl.Append("<no cell>", tfRed);
else
tl.Append(cell.Text, tfmt);
}
tl.AppendLine();
}
}

TextSplitOptions to = new TextSplitOptions(tl) { RestMarginTop = margin, MinLinesInFirstParagraph = 2, MinLinesInLastParagraph = 2 };
tl.PerformLayout(true);
while (true)
{
var splitResult = tl.Split(to, out TextLayout rest);
doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
if (splitResult != SplitResult.Split)
break;
tl = rest;
doc.NewPage().Landscape = true;
}
doc.Save("ExtractTableData.pdf");
}
}

The image below showcases the result of executing the above code:

You can refer to the following demo for detailed implementation.

Extract data from a multi-page table using C#

In this section, we extend the usage of the GetTable method described in the last section to extract tables that are split over multiple pages in a PDF document. The code snippet below showcases how the GetTable method can be used to extract data from a multi-page table.

void ExtractMultiPageTableData()  
{
const float DPI = 72;
const float margin = 36;
var doc = new GcPdfDocument();

var tf = new TextFormat()
{
FontSize = 9,
ForeColor = Color.Black
};
var tfHdr = new TextFormat(tf)
{
FontSize = 11,
ForeColor = Color.DarkBlue
};
var tfRed = new TextFormat(tf) { ForeColor = Color.Red };

using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "product-list.pdf")))
{
var page = doc.NewPage();
page.Landscape = true;
var g = page.Graphics;

var tl = g.CreateTextLayout();
tl.MaxWidth = page.Bounds.Width;
tl.MaxHeight = page.Bounds.Height;
tl.MarginAll = margin;
tl.DefaultTabStops = 165;

var docSrc = new GcPdfDocument();
docSrc.Load(fs);

for (int i = 0; i < docSrc.Pages.Count; ++i)
{
// TableExtractOptions allow you to fine-tune table recognition accounting for
// specifics of the table formatting:
var teo = new TableExtractOptions();
var GetMinimumDistanceBetweenRows = teo.GetMinimumDistanceBetweenRows;
// In this particular case, we slightly increase the minimum distance between rows
// to make sure cells with wrapped text are not mistaken for two cells:
teo.GetMinimumDistanceBetweenRows = (list) =>
{
var res = GetMinimumDistanceBetweenRows(list);
return res * 1.2f;
};
var top = i == 0 ? DPI * 2 : DPI;

// Get the table at the specified bounds:
var itable = docSrc.Pages[i].GetTable(new RectangleF(DPI * 0.25f, top, DPI * 8, DPI * 10.5f - top), teo);

// Add table data to the text layout:
tl.Append($"\nTable on page {i + 1} of the source document has {itable.Cols.Count} column(s) and {itable.Rows.Count} row(s), table data is:", tfHdr);
tl.AppendParagraphBreak();
for (int row = 0; row < itable.Rows.Count; ++row)
{
var tfmt = row == 0 ? tfHdr : tf;
for (int col = 0; col < itable.Cols.Count; ++col)
{
var cell = itable.GetCell(row, col);
if (col > 0)
tl.Append("\t", tfmt);
if (cell == null)
tl.Append("<no cell>", tfRed);
else
tl.Append(cell.Text, tfmt);
}
tl.AppendLine();
}
}

// Print the extracted data:
TextSplitOptions to = new TextSplitOptions(tl) { RestMarginTop = margin, MinLinesInFirstParagraph = 2, MinLinesInLastParagraph = 2 };
tl.PerformLayout(true);
while (true)
{
var splitResult = tl.Split(to, out TextLayout rest);
doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);
if (splitResult != SplitResult.Split)
break;
tl = rest;
doc.NewPage().Landscape = true;
}

doc.Save("ExtractMultiPageTableData.pdf");
}
}

The image below showcases the results of executing the above code:

You can refer to the following demo for detailed implementation.

A tagged PDF is a PDF document that contains tags. Tags provide a logical structure describing how the content of the PDF is presented through assistive technology, making a tagged PDF document accessible to everyone. Therefore, extracting data based on these structure tags is another type of extraction expected from a PDF document API.

void ExtractStructureTagData()  
{
using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "TaggedPdf.pdf")))
{
var doc = new GcPdfDocument();
doc.Load(fs);

// Get the LogicalStructure and top parent element:
LogicalStructure ls = doc.GetLogicalStructure();
Element root = ls.Elements[0];

foreach (Element e in root.Children)
{
string type = e.StructElement.Type;
if (string.IsNullOrEmpty(type) || !type.StartsWith("H"))
continue;
// Note: topmost level is 1:
if (!int.TryParse(type.Substring(1), out int headingLevel) || headingLevel < 1)
continue;

// Get the element text:
string text = e.GetText();
text = new string(' ', (headingLevel - 1) * 2) + text;
Console.WriteLine(text);
}
}
}

The image below showcases the result of executing the above code:

Refer to the following demo to see how different types of data can be extracted from a PDF document using varied structure tags.

Download the sample to go through the detailed implementation of all the code snippets described in this post. You may even try replacing the resource PDF files in this sample to see how these PDF data extraction techniques can be helpful for you.

For further information on GrapeCity Documents for PDF, you may refer to the demos and documentation.

Originally published at https://www.developer.mescius.com on May 5, 2022.

--

--

MESCIUS inc.
MESCIUS inc.

We provide developers with the widest range of Microsoft Visual Studio components, IDE platform development tools, and applications.