Building poppler-utils for CentOS 6.5 (really)
For a number of reasons, one of our production machines is firmly fixed at CentOS 6.5.
This same server also handles big scanned PDFs for us, processing them to individual files and working some magic to log data from each page. It’s a fun setup that I won’t go into here.
Recently, the package we were using to convert PDFs to JPEGs (sejda-console, which aside from this is a really nice tool) started producing some strange results: all images were a bizarre white-on-black edge detection version (kinda like this). This had something to do with the original PDF, but we couldn’t figure out a common thread.
Rather than re-process these big PDFs to maybe prevent this, we switched to another package: poppler. Specifically the poppler-utils command pdftoppm
.
The switch to poppler was great…until we tried to deploy to production.
Time to Troubleshoot
When something works on one machine and not another, the first thing to compare is the installed version of the package.
As expected, we had a problem:
Version 0.12.4 is from 2010, and is the latest version served from yum
. That’s no good. And, as far as we could figure out, there’s not a more recent version built for CentOS 6.5.
Looking at the poppler changelog, we don’t really need to have version 0.41.0 in production, just the one that does the job for us: pdftoppm
with JPEG output support. That was added in 0.13.0, so we picked version 0.13.4 as our build target.
Building from source
If you’re trying to build poppler-utils for CentOS, or something similar, these are the steps we took to get it working. (I’m writing this as our production server churns through 39 backlogged PDFs and the few thousand JPEG files created by this build, so it’s definitely working.)
Step 1: Get the source
From the command line, let’s start by retrieving the source files. You can put them anywhere; they’re only good during the build, and can be safely deleted when we’re done.
$ cd ~
$ wget https://poppler.freedesktop.org/poppler-0.13.4.tar.gz --no-check-certificate
Note: the --no-check-certificate
flag is yucky, but wget
won’t work without it as of this writing. If you’d like, you can use another method to get the tarball to your machine, such as curl
.
Step 2: Unpack the source
$ tar xf poppler-0.13.4.tar.gz
$ cd poppler-0.13.4
Step 3: Prepare & configure with libjpeg
As I said earlier, we were most interested in the pdftoppm
command creating a bunch of JPEG files for us. This requires us to include a few extra steps when installing.
You’ll need to install libjpeg
and libjpeg-devel
from yum
:
$ sudo yum install -y libjpeg libjpeg-devel
Once that’s done, we can configure
the installation of our package:
$ ./configure --enable-libjpeg
The flag --enable-foo
tells the configure script which extra features to include or not include, and there are many other options. Read through the INSTALL file to learn more, or dive right into the configure file with vim configure
to see the source.
Step 4: Make and install
Once configured, there are two more commands to run:
$ make ...$ make install
If you get an error installing, try again using sudo make install
instead.
Step 5: Did it work?
Before we’re done, check to make sure everything worked as expected.
The commands should be installed to /usr/local/bin/
, and we can check them by running:
$ /usr/local/bin/pdftoppm -v
pdftoppm version 0.13.4
Copyright 2005-2010 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
Step 6: Use it!
In the time it took to write this up, our server has processed half of our PDF backlog. This tool will now be used to process around 200,000 PDF to JPEG conversions per year, all thanks to building from source.
If you run across any issues with the steps above, or have suggestions on how we might have done this a little easier (besides upgrading from CentOS 6.5), you can find me on Twitter @jakebathman.