Programmatically Search and Highlight Text in PDFs using C# in .NET

MESCIUS inc.

Published in

MESCIUS inc.

8 min readFeb 21, 2024

What You Will Need
• Visual Studio Code
• .NET 8
• NuGet Package: DS.Documents.Pdf 7.0.3

Controls Referenced
• Document Solutions for PDF — A .NET/C# PDF API
Documentation | Online Demo Explorer

Tutorial Concept
This tutorial discusses programmatically conducting text searches and highlighting found text in PDFs using a C#/.NET PDF API.

This tutorial delves into different ways to programmatically search, find, and highlight text within PDF documents using .NET/C# API. We will go over loading a PDF, conducting text searches, and creating highlight markups with nuanced colors and shapes. In this example, we will use Document Solutions for PDF (DsPdf, formerly GcPdf) , which enables seamless integration for C#/.NET software developers seeking advanced PDF generation functionalities. This piece will showcase the generated PDFs using the included JavaScript Document Solutions PDF Viewer.

This blog will cover how to conduct the following PDF text searches programmatically using a C# .NET PDF API:

Find and Highlight Text in a PDF Documents
Search for Text on a Specific PDF Page
Find and Highlight Text From a Specific Range of PDF Pages
Search for Text in a PDF Based on Structure Tags
Find and Markup Transformed Text in PDFs

To Follow Along, Download a Sample App for this Tutorial Here.

Find and Highlight Text in a PDF Document Using C#

DsPdf simplifies conducting programmatic text searches in PDF documents through its FindText method, enabling users to locate all instances of specific text. The highlighting of each found item can be achieved using the System.Drawing graphics class along with the bounds of the identified text. Users can customize text search parameters through the FindTextParams constructor, with options such as wholeWordand matchCase. These parameters provide flexibility, allowing users to determine whether the search should match whole words, be case-sensitive, or both.

Note: To follow along with this section, you must include the GrapeCity.Documents.Common namespace.

The following code will search for the whole word “wetlands” in a PDF and then highlight the found text:

// Initialize the DsPdf document instance
var doc = new GcPdfDocument();
  
using (var fs = new FileStream(Path.Combine("wetlands.pdf"),FileMode.Open, FileAccess.Read))
{
   // Load a sample PDF  
   doc.Load(fs);
   // Use the FindText method to search text for drive, using case-insensitive, whole word match  
   var findsDrive = doc.FindText(new FindTextParams("wetlands", true, false), OutputRange.All);

   // Highlight all found text using semi-transparent orange red  
   foreach (var find in findsDrive)doc.Pages[find.PageIndex].Graphics.FillPolygon(find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
   
   doc.Save("1 - Search and Highlight Text.pdf");
}

Developers can do a multitude of searches and apply different types of markups. See our online documentation and demo explorer to learn more.

Search for Text on a Specific PDF Page using C#

In specific scenarios, users might opt to narrow down text searches to a particular page rather than scanning the entire PDF document. This can be achieved by accessing the text map interface of a specific page using its index and conducting a text search exclusively within that page’s text map. For instance, the provided code demonstrates the following steps: instantiating a new FindTextParams class and performing a text search within the Text Map using the FindText method.

The following code demonstrates this by searching and highlighting the word “the” on the 2nd page of the PDF document.

        // Create new instance of PDF document  
        GcPdfDocument doc = new GcPdfDocument();
        using (var fs = new FileStream(Path.Combine("wetlands.pdf"), FileMode.Open, FileAccess.Read))
        {
           // Load existing PDF  
            doc.Load(fs);
            // 1. Create a new instance of FindTextParams
            var ftp = new FindTextParams("the", true, false);
            // 2. Get the text map of a page by its index, not index starts at 0 so this will search page 2  
            var tm = doc.Pages[1].GetTextMap();
            if (tm != null)
                // 3. Perform text search within the text map using FindText Method and highlight text orange                    
                tm.FindText(ftp, (p_) => {
                    doc.Pages[1].Graphics.FillPolygon(p_.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
                });
            doc.Save("2 - Search Text Only Page 2.pdf");
        }

Find and Highlight Text From a Specific Range of PDF Pages Using C#

Searching for text within a specific page range in a PDF is crucial for focused analysis. This targeted approach improves performance and isolates content for detailed examination. Developers can conduct this text search programmatically easily by defining the OutputRange class of the FindText methods. The OutputRange class provides the searchRange property.

Note: To follow along with this section, you must include the GrapeCity.Documents.Common namespace.

The code below will search and highlight text only on pages 2 and 3 of the provided PDF document.

 // Initialize the DsPdf document instance
 var doc = new GcPdfDocument();
 using (var fs = new FileStream(Path.Combine("wetlands.pdf"),
       FileMode.Open, FileAccess.Read))
 {
     // Load an existing document from file stream  
     doc.Load(fs);
     // Create an new FindTextParams instance  
     var ftp = new FindTextParams("the", true, false);
     // Define to and from page range properties  
     OutputRange searchRange = new OutputRange(2, 3);
     // Find all text using case-insensitive word search within the page range  
     var findsTextThe = doc.FindText(ftp, searchRange);

     foreach (var find in findsTextThe)
         doc.Pages[find.PageIndex].Graphics.FillPolygon
        (find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
     doc.Save("3 - Find and Highlight Text From a Specific Range of PDF Pages.pdf");
 }

Search for Text in a PDF Based on Structure Tags

Searching for text based on structural tags offers an alternative method for specifying parameters in a text search. For instance, to locate headers like H1, H2, or H3, users can employ the GetLogicalStructure method to retrieve the PDF document’s structure. By specifying the desired tag item, such as “H1,” users can initiate a process to obtain the PDF structure, searching the page root for the specified structural tag and iteratively navigating through the located tags to highlight the tag containing the desired text.

Note: To follow along with this section, you must include the GrapeCity.Documents.Pdf.Recognition.Structure namespace.

The following code will get the PDF’s H1 tags and search through them for the text “C1Olap”.

 GcPdfDocument doc = new GcPdfDocument();
 using (var fs = new FileStream(Path.Combine("read-tags-to-outlines.pdf"), FileMode.Open, FileAccess.Read))
 {
     doc.Load(fs);
     // Get the LogicalStructure of the doc
     LogicalStructure ls = doc.GetLogicalStructure();
     if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
     {
         // No structure tags found:
         Console.Write("No structure tags were found in the source document.", doc.Pages.Add());
         return;
     }

     // Element holds a reference of the logical structure
     Element root = ls.Elements[0];
     // Find all the H1 tags
     var find = root.Children.ToList().FindAll(e_ => e_.StructElement.Type == "H1");
     //  Loop through all found H1 tags for specific text  
     foreach (Element e in find)
     {
         var color = Color.FromArgb(64, Color.Red);
         if (e.HasContentItems)
         {
             // Get headers text  
             var text = e.GetText();
             foreach (var i in e.ContentItems)
             {
                 // Search for title with text "C1Olap"   
                 if (text.Contains("C1Olap", StringComparison.OrdinalIgnoreCase))
                 {
                     if (i is ContentItem ci)
                     {
                         var p = ci.GetParagraph();
                         if (p != null)
                         {
                             // Get the coordinates of the found H1 tag  
                             var rc = p.GetCoords().ToRect();
                             rc.Offset(rc.Width, 0);
                             // Draws highlighting around found H1  
                             ci.Page.Graphics.DrawPolygon(p.GetCoords(), color, 1, null);
                         }
                     }
                 }
             }
         }
         else
             Console.WriteLine("No Text Found");
     }
     doc.Save("4 - Search for Text in a PDF Based on Structure Tags.pdf");
     Console.WriteLine("PDF saved");
 }

To learn more about reading PDF structure tags using C#, check out the online Read Structure Tags Demo .

Find and Markup Graphically Transformed Text in PDFs

PDFs are known to contain graphically transformed text; drawing text on top of an existing PDF using page graphics. This is typical when adding a logo or watermark to a PDF. DsPdf supports the ability to search for text specifically within graphically transformed text and highlight the found items.

To accomplish this, use DsPdf’s FindText method to search for the wanted text.

Then, loop through each page containing the searched text and create a content stream using DsPdf’s ContentStreams property. With this stream, get the graphics on the page using the GetGraphics method and apply the highlighting to the bounds of the found text from the returned graphics.

The provided code snippet conducts a search within a PDF document to identify graphically transformed text acting as a logo watermark for specified text, then highlighting the found instances with blue rectangles.

        // Initialize the DsPdf document instance
        var doc = new GcPdfDocument();

        using (var fs = new FileStream(Path.Combine("Transformed Text.pdf"), FileMode.Open, FileAccess.Read))
        {
            // Load an existing document from file stream  
            doc.Load(fs);
            // Find all text items 'LOGO', using case-sensitive search
            var finds = doc.FindText(new FindTextParams("LOGO", false, true), OutputRange.All);
            // Highlight all finds: first, find all pages where the text was found  
            var pgIndices = finds.Select(f_ => f_.PageIndex).Distinct();
            // Loop through pages with found text  
            foreach (int pgIdx in pgIndices)
            {
                var page = doc.Pages[pgIdx];
                // Create a content stream of the page  
                PageContentStream pcs = page.ContentStreams.Insert(0);
                // Get the graphics included on the a pages content stream  
                var g = pcs.GetGraphics(page);
                foreach (var find in finds.Where(f_ => f_.PageIndex == pgIdx))
                {
                    foreach (var ql in find.Bounds)
                    {
                        // Set the color used to fill the polygon/highlight the found text  
                        g.FillPolygon(ql, Color.CadetBlue);
                        g.DrawPolygon(ql, Color.Blue);
                    }
                }
            } 
            doc.Save("5 - Find and Markup Graphically Transformed Text in PDFs.pdf");
        }
        Console.WriteLine("PDF saved");

Try our online demo for Finding Transformed Text using a .NET PDF API to see another example.

Learn More About this .NET C# PDF API

This article scratches the surface of the full capabilities of Document Solutions for PDF. Learn how to , extract, modify, redact, apply signatures, and more with this .NET C# PDF API. Document Solutions offers a full-fledged PDF solution, including a client-side JavaScript PDF viewer control. The JS PDF viewer control is showcased throughout this piece. To learn more about the .NET C# API and its JavaScript PDF viewer, check out our demos and documentation:

Document Solutions for PDF , .NET C# PDF API

Online Demo Explorer | Documentation

Document Solutions PDF Viewer , JavaScript PDF viewer control