Extract Pdf text using MailboxProcessor in Suave Web app

In this article, we will look at using MailboxProcessor to extract text from a Pdf file in a Suave web app. We will use PdfBox to extract the text.

We begin by creating a new console app using the following command

dotnet new console -lang F#

This will create a new console app. Next, we will add Suave to this project using following command

dotnet add package Suave

I am using VSCode with Ionide plugin to write code but you can use any IDE of your choice. Let’s open this project in VSCode. In command line enter following command

code .

This should open VSCode as shown here.

If you open and look at the .fsproj file you will see that Suave has been added as a reference to this file.

.
.
.
<PackageReference Include="Suave" Version="2.1.1" />
.
.
.

Run following command to restore and build the app.

dotnet restore && dotnet build

Run following command to run the app.

dotnet run

This should output Hello World form F#.

Let’s download PdfBox and place pdfbox-app-2.0.6.jar file in a directory called PdfBox under the project folder.

Let’s add a file to the project and name it PdfUtils.fs. To this file we will add following lines of code.

let pathSeparator = Path.DirectorySeparatorChar.ToString()
let appDir = Directory.GetCurrentDirectory() + pathSeparator
let pdfInputDir = appDir + "pdfInput" + pathSeparator
let textOutputDir = appDir + "textOutput" + pathSeparator
let pdfBoxDir = appDir + "pdfBox" + pathSeparator
let createAppDirs() =
[pdfInputDir;textOutputDir;]
|> Seq.iter(fun x -> if not (Directory.Exists x) then Directory.CreateDirectory(x) |> ignore)

Here we have some values for folders and a function to create pdfInputDir and textOutputDir.

Change file Program.fs as shown below

open System
open PdfUtils
let initApp() =
createAppDirs()
[<EntryPoint>]
let main argv =
initApp()
0

If we run the app now we should see 2 folders pdfInput and textOutput created under the project folder.

We will add a function to call PdfBox and extract the text from a Pdf document and save it to a file in textOutputDir. Add the following code to PdfUtils.fs file.

let private quoteStr stringValue =
let quoteChar = @""""
quoteChar + stringValue + quoteChar
let private extractPdfText pdfFile =
try
let pdfFileName = Path.GetFileName pdfFile
let inputFile = pdfInputDir + pdfFileName
let javaProcess = new Process()
javaProcess.StartInfo.FileName <- "java"
javaProcess.StartInfo.Arguments <- "-jar "
+ pdfBoxDir + "pdfbox-app-2.0.6.jar ExtractText "
+ quoteStr(inputFile) + " "
+ quoteStr(textOutputDir + pdfFileName + ".txt")
javaProcess.Start() |> ignore
javaProcess.WaitForExit()
        // Delete file after processing.
File.Delete(inputFile)
with
| ex -> printfn "%A" ex

Here we have added two functions, the first one quoteStr adds quotes around a given string. This is used for input and out filenames. The code in extractPdfText function starts the PdfBox app with given parameters to extract the text form a pdf File. After extracting the the text form file we delete it. More details about PdfBox usage for extracting text can be found here.

Let’s add MailboxProcessor code below to PdfUtils.fs

let private extractPdfTextAgent = MailboxProcessor.Start(fun inbox ->
let rec messageLoop() = async {
let! msg = inbox.Receive()
extractPdfText msg
return! messageLoop()
}
messageLoop()
)

Here we have declared a MailboxProcessor and all it does is call extractPdfText function passing in the pdf filename as a parameter.

Before we add web apis we will add code to handle files which are present in pdfInput folder and have not been processed because the web server was restarted or for some other reason. Let’s add following code to handle this situation.

let processExistingPdfFiles() =
Directory.GetFiles pdfInputDir
|> Seq.iter extractPdfTextAgent.Post

We will call this function form the initApp function in Program.fs file as shown below.

let initApp() =
createAppDirs()
processExistingPdfFiles()

Let’s setup Suave and add 2 apis to it. One will be for submitting the pdf file, from which text will be extracted, and other will be to return the text extracted from the file.

Let’s add a function for the api which will accept a pdf file and call the function to extract text from it as shown below.

let extractText ctx = async {
let guidAndFiles = ctx.request.files
|> Seq.map(fun x -> (Guid.NewGuid().ToString(), x))
|> Seq.toArray
    guidAndFiles
|> Seq.iter(fun (fileGuid, x) ->
let tempFileName = Path.GetFileName x.tempFilePath
let fileToProcess = pdfInputDir + fileGuid
File.Copy(x.tempFilePath, pdfInputDir + tempFileName)
File.Move(pdfInputDir + tempFileName, fileToProcess)

extractPdfTextAgent.Post fileToProcess
)
    let guidsOnly = guidAndFiles 
|> Seq.map(fun (fileGuid, x) -> fileGuid)
|> Seq.fold(fun acc x -> if (String.IsNullOrEmpty acc) then x else acc + "," + x) ""
return! Successful.OK (guidsOnly.ToString()) ctx
}

This api handles multiple files, it iterates over all the files passed to it and then creates a tuple sequence of string and HttpUpload. It then iterates over this sequence posting message to extractPdfTextAgent passing in the filename to extract text from. The method returns a list of guids which the client has to retain to call the get api to get the extracted text.

Let’s add a function to return the text of the file that was extracted. If the text was successfully extracted then a file with .txt extension is created in the textOutput folder, if the file is not present then either the conversion has not happened or it might have failed so we return not found for both these cases. The code for this function is shown below.

let getFileText fileName ctx = async {
let outputFile = textOutputDir + fileName + ".txt"
if File.Exists outputFile then
let fileText = File.ReadAllText outputFile
return! OK fileText ctx
else
return! RequestErrors.NOT_FOUND "File not found" ctx
}

We will change Program.fs to expose these 2 functions as web apis as shown below.

// Learn more about F# at http://fsharp.org
open System
open Suave
open Suave.Filters
open Suave.Operators
open Suave.Successful
open PdfUtils
let app =  
choose
[ GET >=> choose
[
path "/" >=> OK "Index"
pathScan "/api/text/%s" getFileText
]
POST >=> choose
[
path "/api/text/" >=> extractText
]
]
let initApp() =
createAppDirs()
processExistingPdfFiles()
[<EntryPoint>]
let main argv =
initApp()
  let newConfig = 
{ defaultConfig with
bindings = [ HttpBinding.createSimple HTTP "127.0.0.1" 8085 ]
}
startWebServer newConfig app
  0

Now if we run the app using command dotnet run and open browser and go to url http://localhost:8085/ we should see text “Index” in the browser. This means our web server is running. Let’s try and call the apis. You can use any api client you want, I will use RestLet extension that I have installed in Chrome browser.

Open RestLet extension in Chrome and set it up as shown in the screenshot below and press send. This will send the file to the web server and return a guid or comma separated guids.

Using the Guid returned above call the get api as shown below to see what the extracted text looks like.

The call to extract text using PdfBox can be replaced with call to Google vision api or Microsoft Computer vision api.

The source code can be found on github.