How to Parse Millions of PDF Documents Asynchronously with Apache Tika

Antonia Langfelder
Wellcome Data
Published in
9 min readMar 22, 2023

--

Photo by Maarten van den Heuvel on Unsplash

Over the years, the Wellcome Trust has received a huge number of grant applications and funded thousands of research projects. As a result, millions of files have accumulated documenting Wellcome’s communication with its grant applicants and grant holders. These documents range from application forms to award letters and reports, and they come in a variety of file formats. As part of a wider NLP task to build a classifier to tag these documents according to their categories, we were faced with the challenge of having to extract all text contained within these documents. This mainly required dealing with PDFs, which are notorious for their varying levels of machine-readability. Nowadays there are several tools available to data scientists who need to extract text from PDFs, so how did we decide which one to use for our project? We needed a solution that was:

(a) scalable to millions of documents

(b) able to deal with a mixture of different file formats in addition to PDF

(c) able to apply optical character recognition (OCR) to scanned documents

(d) straightforward to set up

We decided to use Apache Tika, which covers most of our requirements… perhaps apart from (d), but this is what I attempt to solve by writing this blog post.

Introducing Apache Tika

Apache Tika is an open source Java framework for file type detection and parsing, with an impressive collection of ~75 parsers (see here for more information on the available parsers). Using these parsers, Tika can extract text and metadata from >100 file formats (the full list of supported formats can be found here). In addition, Tesseract is integrated into Tika, making it possible to apply OCR to documents containing scans and other images.

Tika already comes with API bindings in Java, and many developers are actively creating bindings in other languages. In addition, Tika can be accessed as a RESTful API via the tika-server module. Pre-built docker images for tika-server are available on Dockerhub, making it easy to use the standard server endpoints for document parsing.

Introducing tika-pipes

It’s relatively straightforward to set up Tika Server to parse contents and metadata of individual files (made even easier for data scientists thanks to the tika-python library), but what if you have thousands of files to parse? Sequential parsing can be tedious especially if you have lots of PDFs containing images and don’t want to wait around for Tesseract to finish processing before moving on to the next file.

This is where tika-pipes comes in. First added in Tika 2.0 (released in 2021) and aside from other benefits, this module allows you to parse files asynchronously. For a more detailed overview of tika-pipes, check out this talk by Tika developer Tim Allison at ApacheCon 2021:

https://www.youtube.com/watch?v=BeYVRpWbCfQ

There is some technical documentation on tika-pipes available here, but it can be tricky to figure out how to set everything up properly as it is lacking in some places.

Setting up tika-pipes

If you look at the official documentation for tika-pipes, the first thing you’ll notice is a security warning about potential access vulnerabilities. The developers therefore strongly suggest running tika-pipes modules for tika-server in tightly controlled networks. However, in order to use tika-pipes (and access the server’s /async endpoint) with Docker, some tweaks are required, which I will outline below. In short, you’ll need a custom Dockerfile and run your docker container with a custom tika config.

Firstly, in your Dockerfile, you’ll need to fetch additional artifacts from maven central (repository for Java packages) as required for tika-pipes to work. In my case, the raw documents were stored on S3, so I added the S3 fetcher and emitter. If you want to parse files in your local filesystem, just change these extra dependencies to the filesystem emitter (tika-emitter-fs). Note that the filesystem fetcher is already included as standard, so you’ll only need to download the emitter in this case. As suggested in the official documentation, I used the Dockerfile based on https://github.com/LogicalSpark/docker-tikaserver/tree/2.0.0 and added my modifications from there. This is my modified Dockerfile:

FROM ubuntu:focal as base
RUN apt-get update

ENV TIKA_VERSION 2.6.0
ENV TIKA_SERVER_JAR tika-server-standard

FROM base as dependencies

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install gdal-bin tesseract-ocr \
tesseract-ocr-eng curl gnupg

# Set this environment variable if you need to run OCR
ENV OMP_THREAD_LIMIT=1

RUN echo ttf-mscorefonts-installer msttcorefonts/accepted-mscorefonts-eula select true | debconf-set-selections \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y xfonts-utils fonts-freefont-ttf fonts-liberation ttf-mscorefonts-installer wget cabextract

RUN wget -O adoptium-public.key https://packages.adoptium.net/artifactory/api/gpg/key/public && \
apt-key add adoptium-public.key && \
echo "deb https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" > /etc/apt/sources.list.d/adoptium.list && \
apt-get update && apt-get -y install temurin-17-jdk


FROM dependencies as fetch_tika

ENV NEAREST_TIKA_SERVER_URL="https://www.apache.org/dyn/closer.cgi/tika/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar?filename=tika/${TIKA_VERSION}/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar&action=download" \
ARCHIVE_TIKA_SERVER_URL="https://archive.apache.org/dist/tika/${TIKA_VERSION}/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar" \
DEFAULT_TIKA_SERVER_ASC_URL="https://downloads.apache.org/tika/${TIKA_VERSION}/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc" \
ARCHIVE_TIKA_SERVER_ASC_URL="https://archive.apache.org/dist/tika/${TIKA_VERSION}/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc" \
TIKA_VERSION=$TIKA_VERSION

RUN DEBIAN_FRONTEND=noninteractive apt-get -y install gnupg2 \
&& wget -t 10 --max-redirect 1 --retry-connrefused -qO- https://downloads.apache.org/tika/KEYS | gpg --import \
&& wget -t 10 --max-redirect 1 --retry-connrefused $NEAREST_TIKA_SERVER_URL -O /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar || rm /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar \
&& sh -c "[ -f /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar ]" || wget $ARCHIVE_TIKA_SERVER_URL -O /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar || rm /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar \
&& sh -c "[ -f /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar ]" || exit 1 \
&& wget -t 10 --max-redirect 1 --retry-connrefused $DEFAULT_TIKA_SERVER_ASC_URL -O /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc || rm /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc \
&& sh -c "[ -f /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc ]" || wget $ARCHIVE_TIKA_SERVER_ASC_URL -O /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc || rm /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc \
&& sh -c "[ -f /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc ]" || exit 1 \
&& gpg --verify /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar.asc /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar

# This is where we get the extra dependencies
RUN wget -t 10 --max-redirect 1 --retry-connrefused https://repo1.maven.org/maven2/org/apache/tika/tika-fetcher-s3/${TIKA_VERSION}/tika-fetcher-s3-${TIKA_VERSION}.jar -O /tika-fetcher-s3-${TIKA_VERSION}.jar \
&& wget -t 10 --max-redirect 1 --retry-connrefused https://repo1.maven.org/maven2/org/apache/tika/tika-fetcher-s3/${TIKA_VERSION}/tika-fetcher-s3-${TIKA_VERSION}.jar.asc -O /tika-fetcher-s3-${TIKA_VERSION}.jar.asc \
&& gpg --verify /tika-fetcher-s3-${TIKA_VERSION}.jar.asc /tika-fetcher-s3-${TIKA_VERSION}.jar \
&& wget -t 10 --max-redirect 1 --retry-connrefused https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-s3/${TIKA_VERSION}/tika-emitter-s3-${TIKA_VERSION}.jar -O /tika-emitter-s3-${TIKA_VERSION}.jar \
&& wget -t 10 --max-redirect 1 --retry-connrefused https://repo1.maven.org/maven2/org/apache/tika/tika-emitter-s3/${TIKA_VERSION}/tika-emitter-s3-${TIKA_VERSION}.jar.asc -O /tika-emitter-s3-${TIKA_VERSION}.jar.asc \
&& gpg --verify /tika-emitter-s3-${TIKA_VERSION}.jar.asc /tika-emitter-s3-${TIKA_VERSION}.jar

FROM dependencies as runtime
RUN apt-get clean -y && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
ENV TIKA_VERSION=$TIKA_VERSION
RUN mkdir /tika-bin
COPY --from=fetch_tika /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar /tika-bin/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar
# The extra dependencies need to be added into tika-bin together with the tika-server jar
COPY --from=fetch_tika /tika-fetcher-s3-${TIKA_VERSION}.jar /tika-bin/tika-fetcher-s3-${TIKA_VERSION}.jar
COPY --from=fetch_tika /tika-emitter-s3-${TIKA_VERSION}.jar /tika-bin/tika-emitter-s3-${TIKA_VERSION}.jar

EXPOSE 9998
ENTRYPOINT [ "/bin/sh", "-c", "exec java -cp \"/tika-bin/*\" org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $0 $@"]

You’ll also need to create a custom tika-config.xml file to set up your fetcher and emitter. Each fetcher and emitter needs to include a name parameter. This can be anything you like, as long as it is unique for each fetcher and emitter. In addition, S3 fetcher and emitter parameters need to include the S3 bucket name and region, as well as an AWS credentialsProvider. If the environment you’re running the docker container in (in my case, an EC2 instance) is on the same network as the S3 bucket(s) you are fetching from/emitting to, you can set this to instance and everything should work from here. If you need to provide your AWS credentials to access the bucket, you can set the credentials provider to ‘profile’ and provide the profile name in the profile parameter (more on this later).

Note that you also need to set the server parameter enableUnsecureFeatures to true in order for tika-pipes endpoints to work.

This covers all the minimum requirements for tika-pipes to work, but it is worth keeping in mind that there are lots of ways in which Tika can be further customised via the config. For example, I’ve included a timeout parameter in my config, but check out the official documentation on configuring Tika or the server config overview for more options. For anything more specific you may also need to consult the API documentation.

Here is a sample config file (just delete the AWS credentials provider you aren’t using):

<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
</parsers>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.s3.S3Fetcher">
<params>
<name>s3f</name> <!-- this can be any name you like -->
<region>eu-west-1</region>
<bucket>my-bucket</bucket> <!-- bucket containing raw data -->
<credentialsProvider>instance</credentialsProvider>
<!-- or use a profile -->
<credentialsProvider>profile</credentialsProvider>
<profile>default</profile>
</params>
</fetcher>
</fetchers>
<emitters>
<emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
<params>
<name>s3e</name> <!-- this can be any name you like -->
<region>eu-west-1</region>
<bucket>my-bucket</bucket> <!-- destination bucket -->
<credentialsProvider>instance</credentialsProvider>
<!-- or use a profile -->
<credentialsProvider>profile</credentialsProvider>
<profile>default</profile>
</params>
</emitter>
</emitters>
<server>
<params>
<enableUnsecureFeatures>true</enableUnsecureFeatures>
</params>
</server>
<pipes>
<params>
<tikaConfig>./config/tika-config.xml</tikaConfig>
</params>
</pipes>
<async>
<params>
<timeoutMillis>1000000</timeoutMillis>
</params>
</async>
</properties>

Now I’ll explain how to try this out locally. Let’s say you have both these files in a directory tika_dir. First you need to build your docker container:

docker build -t tikapipes tika_dir

You can then start the server in a docker container with your custom config file. If you’ve set the credentials provider for your S3 fetcher/emitter to ‘profile’, you’ll also need to add your AWS credentials to the container. I did this by attaching them as a read-only volume:

docker run -d \
--name tika_container \
-v ~/.aws/:/root/.aws:ro \ # Remove this line if credentials provider is set to 'instance'
-v tika_dir:/config \
-p 9998:9998 tikapipes:latest \
-c ./config/tika-config.xml

You can now parse your files by submitting a request containing a list of FetchEmitTuples to the server’s /async endpoint (see here for the official documentation on FetchEmitTuples). In essence, FetchEmitTuples look like dictionaries which need to include: (1) a fetcher, in my case the name of the S3 fetcher as specified in my custom tika config (s3f), (2) the path to the file you want to parse (fetchKey), (3) the name of the emitter, and (4) the path including name you want the emitted JSON file to have (emitKey). Note that this works in the same way if you’re using a filesystem fetcher to parse files in your local filesystem. Here is an example, where the source documents my_input_file1.pdf and my_input_file2.pdf are located in my-bucket/path1, and the output files my_output_file1.json and my_output_file2.json will be written to my-bucket/path2:

curl -X POST -H "Content-Type: application/json" \
-d '[{"fetcher":"s3f","fetchKey":"path1/my_input_file1.pdf","emitter":"s3e","emitKey":"path2/my_output_file1"},
{"fetcher":"s3f","fetchKey":"path1/my_input_file2.pdf","emitter":"s3e","emitKey":"path2/my_output_file2"}]' \
http://localhost:9998/async

If all is well, you should end up with JSON files in your output path containing the parsed contents of the input PDFs.

So far, so good.

If you’ve looked at the Dockerfile in more detail, you may have noticed a curious new environment variable which warrants a brief explanation: OMP_THREAD_LIMIT=1. If you are new to Tesseract you may not be aware that, by default, a single Tesseract process will actually start multiple threads. This means that, if you are trying to apply OCR to several documents using tika-pipes in async mode, you will end up with multiple Tesseract instances running in parallel, each of which will actually be starting multiple threads. As a result, parse times will be significantly slower (see here for a more detailed discussion of this issue). In my case, it slowed down processing to the extent that async was much slower than parsing each document sequentially! However, simply disabling Tesseract multithreading in your environment (as above) will do the trick.

How fast is it?

To get a better estimate of expected parse times, I ran some experiments comparing async with sequential parsing. I used a sample of PDF files where each document was about 10–20 pages long and ~20% of documents required OCR. I wrote my parsers in Python using the requests module and calculated parse times by reading S3 object metadata with boto3.

For reference, here are some code snippets from my async parser:

As described earlier, the filenames need to be passed as a string of FetchEmitTuples, which can be created like this:

def format_data(docs, output_dir):
data = []
for doc in docs:
fetchkey = str(doc)
emitkey = str(Path(output_dir) / Path(doc).stem)
data.append(f'{{"fetcher":"s3f","fetchKey":"{fetchkey}","emitter":"s3e","emitKey":"{emitkey}"}}')
return f'[{",".join(data)}]'

And here is a function to parse several files with the /async endpoint:

def parse(docs, output_dir):
res = requests.post(
url=f'http://localhost:9998/async',
data=format_data(docs, output_dir),
headers={'Accept':'application/json'}
)
if res.status_code == 200:
print('status: ok')

Finally, these are the results of my parse time experiments:

Actual parse times will of course depend on the properties of your own mix of raw documents, but my results should give you an idea of what to expect in relative terms. If you only need to parse a few documents at a time, you won’t gain much from using async tika-pipes and might as well process the documents sequentially using one of the convenience Docker images published by the Tika Dev team. However, async mode will pay off as you get towards large numbers of documents. In addition, when it’s necessary to apply OCR and run Tesseract, choosing an instance with more compute is likely to speed things up even more (just make sure to switch off Tesseract’s multithreading!).

--

--