StreetLib Write: New Content Sanitizer

Fabio Nicotra
StreetLib
9 min readMay 3, 2017

--

At StreetLib, we noticed that authors and publishers needed an easy and convenient solution for building high-quality eBooks. This is why we developed our online application with a WYSIWG text editor.
And however amazing this system may seem from the outside, it is
on the inside that the magic truly happens.

See, the most common problem with poorly made ePubs (the standard eBook file format) is the mess of HTML behind the words. Even though you may see a beautiful paragraph, with a well integrated image and text aligned to perfection, that doesn’t mean that all is well behind the curtains. And this may give a completely different, and much uglier, reading experience to others.

From the outside, it may look like just a hammer nailing down a nail, but what goes on isn’t that simple or efficient.

To fix this, I and the rest of the tech team built an HTML neat freak, cleaning up every ounce of the text file to make it as sleek as possible, and thus more efficient. This way, readers get the book they deserve.

We recently released a new version of this HTML cleaner and I want to tell you about it. And for the less tech-savvy amongst you, my colleague AC de Fombelle will explain what I mean without all the jargon (in italics).

The original HTML sanitization

In order to generate valid EPUB documents, the StreetLib Write backend has always performed a series of tasks to sanitize the HTML contained in the users’ books.
The first basic form of sanitizing has always been always performed by the generator itself. The generator is the component that actually “makes” the final EPUB file: it takes all the book contents from the database and it assembles the final EPUB file (note that the EPUB is also the base used for other formats such as the MOBI and PDF).
So, from time to time, as new errors were found and reported by our users, new checks were introduced inside the generator.

Of course, this strategy had too many limits. We therefore needed to create a separate sanitizer that could perform the validation and sanitization based on a set of “grammar rules” that could be easily defined, and another set of “actions” to be performed in order to fix the HTML errors found in the user contents. The set of rules was loosely taken from the W3C specifications for XHTML1.1.

Consequently, we introduced the HTMLSanitizer (HTMLSanitizer class) as a new component within StreetLib Write a few years ago. Users can disable the feature via a checkbox in the download dialog of StreetLib Write’s User Interface.

Of course, we always advise our users to leave this box checked!

AC’s translation:

ePub files need be as clean as possible to guarantee smooth and comfortable reading on any device. So, our clever tech team decided “let’s just have our ePub builder clean up the file”.
At first, it was the same thing that generated the actual ePub file that did the cleaning. However, this became too complex. So they made a separate thing that is dedicated to cleaning up the file. If our users prefer a messy ePub, they can simply disable this feature.
Despite a lot of imperfections, the HTMLSanitizer did a pretty good job. It is launched by the generator during the EPUB generation, but remains separated component.

A rigid sanitizer that needed replacing

Unfortunately, this HTMLSanitizer was also limited: the grammar rules were defined directly inside the Ruby code, so it was not very flexible and many kinds of HTML errors simply got ignored until we defined a specific rule for them.

A few months ago, a user report made us aware that we needed to be able to generate a reflowable EPUB3. This was because the book contained a lot of HTML5, and since Write supports reflowable EPUB3, we wanted to take advantage of this feature.
But the sanitizer did not support HTML5 rules, so we had to disable it (resulting in us having to manually fix most errors).

Adding HTML5 rules to the HTMLSanitizer required a big effort, because we would have had to manually define them inside the sanitizer code.

This incident highlighted all the old HTMLSanitizer’s limits. It also gave us the chance to write a new one, especially after discovering the validation capabilities of the Nokogiri library that we were already using inside StreetLib Write.

Nokogiri is a powerful Ruby library that allows us to manipulate XML/HTML code, and we were already using it for many tasks (including the sanitization). It’s quite fast and performs well since its engine is native and written in C.
It has also a validation feature that can validate an XML/HTML document using a Schema definition file (xsd). Schema definition files are actually XML files containing all the grammar rules used for validation, and they are directly provided by W3C.

This gave us a huge opportunity to overcome the limits of the old HTMLSanitizer.

AC’s translation:

The cleaning component as it was did the job but was still limited. Each cleaning rule had to be set manually, so it was a pain to keep updated, leaving many HTML errors ignored. The fact that the cleaner did not support new HTML5 rules further highlighted its the limits. Fortunately, Nokogiri — a technical tool already used inside StreetLib Write — turned out to have great validation capabilities that were directly provided by W3C. What is W3C, you ask? Well it’s the World Wide Web Consortium: “the main international standards organization for the World Wide Web”. In other words, the group that defines standards for the technical languages used online. Which makes them the ultimate reference to get a good, clean HTML file.

N.B: I tried to figure out what Nokogiri was, meaning I would have had to figure out what Ruby is, and why it was good that it was native and written in C… I figured it basically meant it was a great solution to make our tech team’s work easier. I’ll leave it at that.

Below are a few details on the restrictions of the old cleaning tool and advantages of now using Nokogiri. The joy of going from broom to hoover :)

This witch also made the switch from broom to hoover

The limits of the old HTMLSanitizer

Let’s take a look at the main flaws of the old HTMLSanitizer:

  • The HTML grammar rules were manually defined inside the Ruby code so it was hard to rapidly define new rules for new formats (e.g. HTML5)
  • The defined rules did not cover all the possible validation errors, so many errors were simply missed and remained unfixed, requiring us to manually add new rule definitions after discovering those errors (mainly after user reports)
  • The “validation and fix” task was slow and heavy, because the sanitizer iterated over every single HTML element of the document (even if the element had no errors), checking it against defined rules and fixing it if necessary. Furthermore, after every fix, the element was checked again recursively in order to prevent the introduction of new errors through the act of fixing .
  • Only HTML errors were checked, leaving CSS errors untouched.
  • It didn’t track the changes made into the HTML code, so the users could not review (or approve) the changes made to the content of their book.

The features introduced by the new Sanitizer

The new sanitizer (HTMLSchemaSanitizer) pre-validates the HTML code before trying to fix it, using the validation features of Nokogiri library that rely on XSD schema definitions provided by W3C.

These are the main advantages of the new sanitizer:

  • It’s very easy to add new definitions for different formats, since we only have to add the XSD file usually provided by external entities (ie. the W3C). The new HTMLSchemaSanitizer already supports both XHTML1.1 (used for EPUB2) and XHTML5 (used for EPUB3).
    It can even validate SVG and MathML code inside HTML5 documents!
  • Since it’s using XSD validation files, almost every possible error is detected and fixed, reducing the number of HTML errors to almost zero!
  • It’s much faster and performs way better than the old sanitizer, because the HTML code is pre-validated by the Nokogiri XSD native validator. This means so we don’t need to iterate over every HTML element of the document, and only have to fix the errors reported by the XSD validator. If the content has no errors, sanitization doesn’t even start.
  • It also supports CSS validation and sanitization thanks to a validation tool provided by W3C.
  • It tracks all the changes made to the HTML code. This is very useful because we can log every error found and every action that was performed. In the future, we will also be to give users a preview of the changes to be made (who, in turn, could choose to refuse them). This feature also allows us to use the sanitizer on various steps of the book production workflow.

AC’s translation:
Like a good Daft Punk tune, this new solution to clean HTML is better, faster and stronger. It also supports CSS and tracks the changes made. It will also eventually allow us to show StreetLib Write users what the tool plans to do, and offer the possibility to accept or refuse those changes.

Future ideas: implementing the sanitizer on different production steps

Up until now, we’ve used the sanitization of the HTML content in just one production step, the final EPUB generation. The sanitizer is currently invoked by the generator during the assembly process of the final EPUB file.
The introduction of the new HTMLSchemaSanitizer won’t change this behaviour, for now.

But the new capabilities of the new HTMLSchemaSanitizer — i.e. allowing the user to review and revert changes made during the sanitization — give us the opportunity to use it also on different steps of the production workflow.
We must consider that the actions performed to fix the HTML errors could, to some extent, invasive. The content produced could be unacceptable for the user, even if technically correct.

The most important phase (other than generation) that could benefit from the sanitization is also the most critical: the book importer.

To date, most of the validation errors come from the bad HTML code generated during the conversion of the imported document (mostly from MS Word or OpenOffice documents).

Sanitization currently happens only in the final phase. This means there’s often a large difference between the HTML code of the imported document that the user can see and edit inside the Write’s editor and the HTML code of the final EPUB.

By importing already sanitized content, we could reduce this gap.

The new sanitizer could also be introduced in the authoring phase. It would be nice if the users were notified of the errors while editing the contents inside StreetLib Write’s editor.

The sanitizer could asynchronously check the edited content and notify the user of every possible error using the StreetLib Write editor UI.

Thanks to the error tracking feature of the new sanitizer, our users could also review the errors and all the possible fixes e.g. by clicking on the notification, our users could see a list of all the validation errors and a list of all the possible fixes with a preview of what their contents will look like after being fixed. This would allow our users to accept or refuse the automated fix provided by the sanitizer, and, in the latter case, they could try to manually fix the error.

AC’s translation:

Actually this paragraph was pretty self explanatory. Basically, instead of just cleaning the HTML at the end, before the ePub is generated, we can use the cleaning tool to improve the eBook production process. Firstly, by importing a cleaned HTML and then by running scans for errors while the eBook is being edited on StreetLib Write, so that the user can see automated or manual fixes to do on the spot. This means a shorter and sleeker editing process for the user, and an even cleaner eBook as a result!

If you want a very simple summary of everything, here it is: StreetLib Write turned a broom into a hoover, and now this hoover can clean every little thing each day instead of doing one huge spring clean at the end.

--

--