Extract Pdf text using MailboxProcessor in Suave Web app

5 min readJul 16, 2017

In this article, we will look at using MailboxProcessor to extract text from a Pdf file in a Suave web app. We will use PdfBox to extract the text.

We begin by creating a new console app using the following command

dotnet new console -lang F#

This will create a new console app. Next, we will add Suave to this project using following command

dotnet add package Suave

I am using VSCode with Ionide plugin to write code but you can use any IDE of your choice. Let’s open this project in VSCode. In command line enter following command

code .

This should open VSCode as shown here.

If you open and look at the .fsproj file you will see that Suave has been added as a reference to this file.

.
.
.
<PackageReference Include="Suave" Version="2.1.1" />
.
.
.

Run following command to restore and build the app.

dotnet restore && dotnet build

Run following command to run the app.

dotnet run

This should output Hello World form F#.

Let’s download PdfBox and place pdfbox-app-2.0.6.jar file in a directory called PdfBox under the project folder.

Let’s add a file to the project and name it PdfUtils.fs. To this file we will add following lines of code.

let pathSeparator = Path.DirectorySeparatorChar.ToString()let appDir = Directory.GetCurrentDirectory() + pathSeparatorlet pdfInputDir = appDir + "pdfInput" + pathSeparatorlet textOutputDir = appDir + "textOutput" + pathSeparatorlet pdfBoxDir = appDir + "pdfBox" + pathSeparatorlet createAppDirs() =
  [pdfInputDir;textOutputDir;]
  |> Seq.iter(fun x -> if not (Directory.Exists x) then Directory.CreateDirectory(x) |> ignore)

Here we have some values for folders and a function to create pdfInputDir and textOutputDir.

Change file Program.fs as shown below

open System
open PdfUtilslet initApp() =
  createAppDirs()[<EntryPoint>]
let main argv =
  initApp()
  0

If we run the app now we should see 2 folders pdfInput and textOutput created under the project folder.

We will add a function to call PdfBox and extract the text from a Pdf document and save it to a file in textOutputDir. Add the following code to PdfUtils.fs file.

let private quoteStr stringValue =
    let quoteChar = @""""
    quoteChar + stringValue + quoteCharlet private extractPdfText pdfFile =
    try
        let pdfFileName = Path.GetFileName pdfFile
        let inputFile = pdfInputDir + pdfFileName
        let javaProcess = new Process()
        javaProcess.StartInfo.FileName <- "java"
        javaProcess.StartInfo.Arguments <- "-jar " 
            + pdfBoxDir + "pdfbox-app-2.0.6.jar ExtractText " 
            + quoteStr(inputFile) + " "
            + quoteStr(textOutputDir + pdfFileName + ".txt")
        javaProcess.Start() |> ignore
        javaProcess.WaitForExit()        // Delete file after processing.
        File.Delete(inputFile)
     with
        | ex -> printfn "%A" ex

Here we have added two functions, the first one quoteStr adds quotes around a given string. This is used for input and out filenames. The code in extractPdfText function starts the PdfBox app with given parameters to extract the text form a pdf File. After extracting the the text form file we delete it. More details about PdfBox usage for extracting text can be found here.

Let’s add MailboxProcessor code below to PdfUtils.fs

let private extractPdfTextAgent = MailboxProcessor.Start(fun inbox ->
    let rec messageLoop() = async {
        let! msg = inbox.Receive()
        extractPdfText msg
        return! messageLoop()
    }
    messageLoop()
)

Here we have declared a MailboxProcessor and all it does is call extractPdfText function passing in the pdf filename as a parameter.

Before we add web apis we will add code to handle files which are present in pdfInput folder and have not been processed because the web server was restarted or for some other reason. Let’s add following code to handle this situation.

let processExistingPdfFiles() =
    Directory.GetFiles pdfInputDir
    |> Seq.iter extractPdfTextAgent.Post

We will call this function form the initApp function in Program.fs file as shown below.

let initApp() =
    createAppDirs()
    processExistingPdfFiles()

Let’s setup Suave and add 2 apis to it. One will be for submitting the pdf file, from which text will be extracted, and other will be to return the text extracted from the file.

Let’s add a function for the api which will accept a pdf file and call the function to extract text from it as shown below.

let extractText ctx = async {
    let guidAndFiles = ctx.request.files 
                        |> Seq.map(fun x -> (Guid.NewGuid().ToString(), x))
                        |> Seq.toArray    guidAndFiles
    |> Seq.iter(fun (fileGuid, x) -> 
        let tempFileName = Path.GetFileName x.tempFilePath
        let fileToProcess = pdfInputDir + fileGuid
        File.Copy(x.tempFilePath, pdfInputDir + tempFileName)
        File.Move(pdfInputDir + tempFileName, fileToProcess)
        
        extractPdfTextAgent.Post fileToProcess
    )    let guidsOnly = guidAndFiles 
                    |> Seq.map(fun (fileGuid, x) -> fileGuid)
                    |> Seq.fold(fun acc x -> if (String.IsNullOrEmpty acc) then x else acc + "," + x) ""
    return! Successful.OK (guidsOnly.ToString()) ctx
}

This api handles multiple files, it iterates over all the files passed to it and then creates a tuple sequence of string and HttpUpload. It then iterates over this sequence posting message to extractPdfTextAgent passing in the filename to extract text from. The method returns a list of guids which the client has to retain to call the get api to get the extracted text.

Let’s add a function to return the text of the file that was extracted. If the text was successfully extracted then a file with .txt extension is created in the textOutput folder, if the file is not present then either the conversion has not happened or it might have failed so we return not found for both these cases. The code for this function is shown below.

let getFileText fileName ctx = async {
    let outputFile = textOutputDir + fileName + ".txt"
    if File.Exists outputFile then
        let fileText = File.ReadAllText outputFile
        return! OK fileText ctx
    else
        return! RequestErrors.NOT_FOUND "File not found" ctx
}

We will change Program.fs to expose these 2 functions as web apis as shown below.

// Learn more about F# at http://fsharp.orgopen System
open Suave  
open Suave.Filters  
open Suave.Operators  
open Suave.Successful
open PdfUtilslet app =  
    choose
        [ GET >=> choose
            [ 
                path "/" >=> OK "Index"
                pathScan "/api/text/%s" getFileText
            ]
          POST >=> choose
            [ 
                path "/api/text/" >=> extractText
            ] 
        ]let initApp() =
    createAppDirs()
    processExistingPdfFiles()[<EntryPoint>]
let main argv =
  initApp()  let newConfig = 
        { defaultConfig with
            bindings = [ HttpBinding.createSimple HTTP "127.0.0.1" 8085 ]
        }
  startWebServer newConfig app  0

Now if we run the app using command dotnet run and open browser and go to url http://localhost:8085/ we should see text “Index” in the browser. This means our web server is running. Let’s try and call the apis. You can use any api client you want, I will use RestLet extension that I have installed in Chrome browser.

Open RestLet extension in Chrome and set it up as shown in the screenshot below and press send. This will send the file to the web server and return a guid or comma separated guids.

Using the Guid returned above call the get api as shown below to see what the extracted text looks like.

The call to extract text using PdfBox can be replaced with call to Google vision api or Microsoft Computer vision api.

The source code can be found on github.

Extract Pdf text using MailboxProcessor in Suave Web app

Written by Sandeep Chandra