Read text from PDF using.NET core and Tesseract OCR

Published in

IceApple Tech Talks

2 min readMar 10, 2024

Recently we had request from our client to extract the text from PDF and store the Text in Elastic Search using .NET Core. On initial search we found many solutions but none of them are open source. After couple of days of struggle we come with below solution.

PDF to Text conversion involves two stages.

Convert the PDF to Image (We used DtronixPDF)
Extract the text from Image using Tesseract OCR

Step 1: Convert the PDF to Image using DtronixPdf

The inbuild DtronixPdf PDF does not worked, After long search, found this below library which helps to create PDF from image.

Download this DtronixPdf repository and add the library as project reference to your .NET Core API project.

https://github.com/inlineHamed/DtronixPdf/tree/master

Step 2: Extract the text from Image using Tesseract OCR

Install Tesseract NuGet package. Tesseract OCR requires test data. The test data can be downloaded from the below path https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata

NuGet\Install-Package Tesseract -Version 5.2.0

After installing the package. Create a folder called testdata. Move the eng.traineddata file downloaded from the github under testdata folder.

Now create API, which takes file as Input, In respose you will the Page wise file content as array of string. I have attached the sample method for your reference.

      [HttpPost]
        public async Task<ActionResult<List<string>>> ExtractTextFromPDF(IFormFile file)
        {
            List<string> pdfContent = new List<string>();

            string path = Path.Combine(AppContext.BaseDirectory, "tessdata");

            string filePath = Path.Combine(Path.GetTempPath(), file.FileName);

            if (file.Length > 0)
            {
                using (Stream fileStream = new FileStream(filePath, FileMode.Create))
                {
                    await file.CopyToAsync(fileStream);
                }
            }

            if (System.IO.File.Exists(filePath))
            {
                using (var engine = new TesseractEngine(path, "eng", EngineMode.Default))
                {
                    PdfDocument pdfDocument = new PdfDocument();
                    await pdfDocument.Load(filePath, null);
                    if (pdfDocument.Pages > 0)
                    {
                        try
                        {

                            for (int i = 0; i < pdfDocument.Pages; i++)
                            {
                                System.Text.StringBuilder stringBuilder = new System.Text.StringBuilder();

                                string imageFileName = Guid.NewGuid() + ".png";

                                await using var page = await pdfDocument.GetPage(i);

                                await using var result = await page.Render(RenderFlags.RenderAnnotations, scale: 2, DispatcherPriority.Normal);


                                result.ToBitmap().Save(imageFileName, System.Drawing.Imaging.ImageFormat.Jpeg);

                                using (var img = Pix.LoadFromFile(imageFileName))
                                {
                                    using (var pageTest = engine.Process(img))
                                    {
                                        var text = pageTest.GetText();

                                        if (!string.IsNullOrEmpty(text))
                                        {
                                            stringBuilder.AppendLine("Page :" + (i + 1).ToString());
                                            stringBuilder.AppendLine(text);
                                        }
                                    }
                                }

                                pdfContent.Add(stringBuilder.ToString());

                            }


                        }
                        catch (Exception ex)
                        {
                        }
                        finally
                        {
                            if (pdfDocument != null)
                            {
                                await pdfDocument.DisposeAsync();
                                pdfDocument = null;
                            }
                        }
                    }

                }

            }

            return pdfContent;
        }

Read text from PDF using.NET core and Tesseract OCR

Written by Ramesh Angamuthu