A History of HTML Parsing at Cloudflare: Part 2

Cloudflare
Cloudflare
Published in
3 min readNov 29, 2019

The second blog post in the series on HTML rewriters picks up the story in 2017 after the launch of the Cloudflare edge compute platform Cloudflare Workers. It became clear that the developers using workers wanted the same HTML rewriting capabilities that we used internally, but accessible via a JavaScript API.

This blog post describes the building of a streaming HTML rewriter/parser with a CSS-selector based API in Rust. It is used as the back-end for the Cloudflare Workers HTMLRewriter. We have open-sourced the library () as it can also be used as a stand-alone HTML rewriting/parsing library.

The major change compared to LazyHTML, the previous rewriter, is the dual-parser architecture required to overcome the additional performance overhead of wrapping/unwrapping each token when propagating tokens to the workers runtime. The remainder of the post describes a CSS selector matching engine inspired by a Virtual Machine approach to regular expression matching.

v2 : Give it to everyone and make it faster

In 2017, Cloudflare introduced an edge compute platform — Cloudflare Workers. It was no surprise that customers quickly required the same HTML rewriting capabilities that we were using internally. Our team was impressed with the platform and decided to migrate some of our features to Workers. The goal was to improve our developer experience working with modern JavaScript rather than statically linked NGINX modules implemented in C with a Lua API.

It is possible to rewrite HTML in Workers, though for that you needed a third party JavaScript package (such as Cheerio). These packages are not designed for HTML rewriting on the edge due to the latency, speed and memory considerations described in the previous post.

JavaScript is really fast but it still can’t always produce performance comparable to native code for some tasks — parsing being one of those. Customers typically needed to buffer the whole content of the page to do the rewriting resulting in considerable output latency and memory consumption that often exceeded the memory limits enforced by the Workers runtime.

We started to think about how we could reuse the technology in Workers. LazyHTML was a perfect fit in terms of parsing performance, but it had two issues:

  1. API ergonomics: LazyHTML produces a stream of HTML tokens. This is sufficient for our internal needs. However, for an average user, it is not as convenient as the jQuery-like API of Cheerio.
  2. Performance: Even though LazyHTML is tremendously fast, integration with the Workers runtime adds even more limitations. LazyHTML operates as a simple parse-modify-serialize pipeline, which means that it produces tokens for the whole content of the page. All of these tokens then have to be propagated to the Workers runtime and wrapped inside a JavaScript object and then unwrapped and fed back to LazyHTML for serialization. This is an extremely expensive operation which would nullify the performance benefit of LazyHTML.

LOL HTML

We needed something new, designed with Workers requirements in mind, using a language with the native speed and safety guarantees (it’s incredibly easy to shoot yourself in the foot doing parsing). Rust was the obvious choice as it provides the native speed and the best guarantee of memory safety which minimises the attack surface of untrusted input. Wherever possible the Low Output Latency HTML rewriter (LOL HTML) uses all the previous optimizations developed for LazyHTML such as tag name hashing.

Find the rest of this post, as well as part 1, over on the official Cloudflare blog.

Originally published at https://blog.cloudflare.com on November 29, 2019.

--

--