Saving web pages as PDFs in 2019, a real challenge
--
Today’s web technology doesn’t easily allow to export HTML content as PDF. This is a basic piece of functionality that one would expect to work seamlessly, but unfortunately, it is not the case. The article was written in December 2019.
I regularly export locally, on my machine, web pages and articles that I find interesting, for archiving purposes. Any content that I find valuable, I want it available offline, so I can still read it later, should it disappear from the web, or be moved behind a paywall.
Saving pages as HTML is not ideal because a) you get an HTML file plus a folder, not very practical if you want to retrieve them later, and b) you never know how that page is going to render in future versions of your browser. So the easiest, pain-free option would be to export the HTML content as a PDF file. I recently came to realise that on a Mac, Firefox cannot always print pages properly, and the result is often a poorly formatted document; the generated PDF document might for example contain the title and a few images, followed by a blank page. Chrome is somehow better because at least it shows a preview, but can also fail to print an entire article properly. After some very limited testing, Safari seems to generate the best PDF export, but that’s not always the case.
On top of that, overlays often contribute to poor results. Take the obnoxious cookies banner pop-ups that are now proliferating and infesting web pages worldwide, making them unusable (*). If left open, they overlay is going to be appear on top of the exported content, thus hiding a good 2–3 lines of text.
I am not sure whether it’s web developers, QAs, browser vendors of the W3C to blame for that, though I am inclined to say the latter: creating custom CSS for print used to be a relatively simple task, and making a page printable was once regarded as a best practice. If this simple rule is now disregarded even by major news websites, I suspect there must be technical challenges on implementing it correctly. Whatever the reason, we can clearly observe that after nearly 30 years of World Wide Web, one of the most basic pieces of functionality in the most commonly used browsers is broken. Of course there’s add-ons and hacks, but it’s tiresome and a waste of time to have to resort to third party tools, just to get something done that should be a no-brainer.
As an example for test purposes, here is the URL of a page from the MIT review website:
https://www.technologyreview.com/s/614605/sorryorganic-farming-is-actually-worse-for-climate-change
It gets even worst with a Reddit page like this one: https://www.reddit.com/r/musictheory/comments/bklgdd/any_recommended_music_theory_books/
(*) A topic for another rant. EU citizens have to say thank you to the bureaucrats in the EU commission who know nothing about technology, and are happy letting lawyers decide how web content should look like. The result is a mechanism that is appalling in terms of user experience, and addresses the problem it rightly wants to address in the worst possible manner. What’s worst, that doesn’t seem to have generated any outcry in the design community, and most UX experts seem to be totally fine with that, judging from the silence around this topic.
Practical solutions
Beyond the fact that this is an issue that should be addressed from the bottom down so standards are observed more widely and supported by browser technology, here are some solutions that I’ve found thanks to some people contacting me privately and some comments I’ve received on hacker news.
Reader mode (Firefox only, as of end 2019)
Firefox allows to turn a page dense of graphics into a much tidier, content-focused version, by clicking on the icon at the right edge of the address bar, next to the zoom icon and the bookmarks icon. It doesn’t always appear, only if the page code allows for it to work nicely. If it does, you can then call out the print dialogue, and then go for Open in Preview and from the File menu, move to the folder where you wish to save the PDF. I wonder why Firefox doesn’t have an easier way to get a print preview. Chrome used to have a similar “Distill” feature but apparently, it’s not available any more since the v75 update.
Not sure about Safari. I don’t use Safari despite it’s the most performant browser on a Mac computer. I find the user experience mediocre, and cannot find all the add-ons that I need. I used Vivaldi and thought the best of it, but finally gave up for similar reasons.
Add-ons to export as PDF
I value the fact that the PDF format is one that offers longevity in terms of support, and once it’s generated, it looks the way that it looks, and that’s not going to change significantly in the future. The content is also searchable, but all contained within a simple file that can be easily previewed and sits nicely in the file manager as a single entity. There’s add-ons that help make the PDF export tidier, the best ones I’ve tried are Print friendly & PDF (Chrome and Firefox) and Printable-The print doctor (Firefox only).
Software to easily generate PDF documents
There’s software that makes it easier to generate PDF documents from HTML pages. One of these is DoPDF by Softland, a freeware product that I haven’t tested myself, but here’s the recommendation that was provided to me.
“It installs like a printer. Then you use the Menu > Print command in your browser to “print” the current web page to the doPDF “printer”. The result is a PDF file, which should be a faithful representation of the appearance of any web page. doPDF even opens up the file for you automatically, as an option, in your favorite PDF reader (Foxit is a good reader, and can also print Web pages). doPDF has been available for many years, as has Foxit. Both are considerably better in many ways than the standard Acrobat “DC” reader, yet they are compatible”.
IPFS distributed files system
IPFS stands for InterPlanetary File System, and it’s a protocol to make web content decentralised, safer and faster to get. 2read is a browser add-on compatible with both Firefox and Chrome that allows to clean up the page similarly to the add-ons mentioned above. The export is cached on a server, but you can also rely on the emerging IPFS technology to “pin” that content locally. I’ve read something about it, it looks like a very promising technology to me, but it gets too technical for the problem I am trying to solve. For those interested, here’s a couple of discussions on Hacker News on this:
- Convert article in current tab to readable form and upload it to IPFSIPFS,
- The Interplanetary File System, Simply Explained
Skimming through the comments, it seems like the technology is still in its infancy, despite significant founding, so it may take a while before it becomes more usable and bugs-free. Certainly to keep an eye on, as what it promises is enticing. As the time of writing, it can only serve static pages. This may change in the future, should the usage of this technology ramps up.
Save as Web_ARChive (WARP)
It’s exciting to learn that an open format exists for archiving high-fidelity, dynamic web content. WARP does just that. They’ve also created a relatively simple piece of software called Webrecorder. I’ve tried it out and while the technology seems quite powerful and effective, the user experience still need lots of improvement. Beyond that, I just need an easy way to save content from a web page to my computer, this software is something I have to open, copy the URL into, and then it’s going to save archived content as a collection of obscurely-labeled files that can only be managed using the same software, or the leaner version of it (called Webrecorder player). Too impractical for me, and who knows what’s going to happen in the future. On the positive side, besides the fact that you really get a faithful, interactive copy of what you want to save for late retrieval, is the fact that “WARC is now recognised by most national library systems as the standard to follow for web archival.”
Polar app
Polar is a document manager that allows, among other things, to capture, annotate and highlight web pages, and save them locally so they are available offline. They use a proprietary format, but according to what they say, contents can be exported to PDF. Check out the documentation page. I find it a very promising product, unfortunately after installing the desktop app and trying it out shortly, I had the impression that the product is barely usable and not mature yet. It works as a web application on both Firefox and Chrome. They also have an extension for the latter, but I did not manage to login because there’s some technical issues. Will check again a few months from now.
There’s likely a bunch of other apps out there that you might want to look for. Evernote for example offers a web clipping extension, I’ve never used it despite making a heavy usage of Evernote for archiving purposes, I just don’t want the whole content of the pages I am saving to mix up with my notes, and saturate them with content that is likely to show up on a search query.
Web bundles
There’s a promising technology called Web bundles that allows to share websites as a single file over Bluetooth and run them offline. As interesting as it sounds, this technology has not been adopted widely yet,
Aggregated HTML documents
When exporting a page to PDF is not an option, because the markup doesn’t allow to it, or if I just want to save it as it is (hoping it won’t change too much in the future), I sometimes just save it as HTML. There’s tools out there that allow to encapsulate HTML content so it’s stored as a single file, rather than an HTML file and an attached folder with resources. This is what MHTML does, but compared to the Web bundles mentioned above, it does not enable executable JS. Browser support is also quite poor.
The SingleFile add-on (also available as a Chrome extension) allows to save HTML content as a single file. I haven’t tested it out because I am not sure how the technology behind works, and whether it’s reliable for future retrieval.