If you conduct academic research, you may have encountered challenges indexing and storing documents for future reference using citation managers when the original content is not available as a PDF. For example, if you desire to read the latest OECD Economic Outlook on a tablet, or make annotations in research management software, the method below will enable you to circumvent the restrictions enforced by these CDNs and generate a PDF from the content served to your browser.
These CDNs operate in a remarkably simple way. Using jQuery, your browser requests the images as you scroll them into focus. This triggers an event that makes a request to the CDN for individual images of every page. By opening the Chrome Developer Tools (or your browser’s equivalent), you can see those image files being loaded under the “Network” tab. Conveniently, these images are sequential; parsing them will be a snap.
On a Linux-based machine, you can use curl to download many files at once from the remote server. In this instance, assume the URL is as follows, with over 250 JPEG images located within the folder at the remote server.
Create a folder for your download files (
mkdir ~/Downloads/Document && cd ~/Downloads/Document). You can then use the following command to download files 1–250.jpg. Note the use of the brackets
[1-250] to signal we want to make multiple requests.
curl "http://www.fictionalcdn.com/assetnumber/somerandomstring/somefolder/large/[1-250].jpg" -o "#1.jpg"
If you want a PDF, you can use imagemagick to convert the image files.
Before you can do that, you have to ensure your image files are sequential. Your number format will be determined by the source, but in this case, even a sequence
1, 2, 3...151 is not acceptable as all numbers must have equal digits. You can verify this is an issue by typing
You could rename the files, or you can bypass that by using sort
ls -1 *.jpg | sort -n to reorder the filenames. You can pass the sorted output directly to imagemagick.
convert `ls -1 *.jpg | sort -n` +compress OECD2017Preliminary.pdf
In this case, because the source images were of a high resolution, my document turned out to be ~250mb even when using the
+compress option. I then used Adobe Acrobat to perform OCR/reduce the filesize as needed.