Anatomy Of A Chrome Crash

(For the layman! No programming knowledge needed!)

This is the final part of a four-part article. Here are parts one, two, and three.

Return To The Scene Of The Crime

I lied. I didn’t wash my hands of the bug. My conscious brain did; my subconscious brain decided it would be great to keep plugging away. The next morning it barged in on me to let me know that my last idea hadn’t been as bad as I’d thought. We could detect nested escapes in one pass, without arbitrary lookahead or recursion. It would only take looking at a few extra characters.

Escape sequences always look like “%XY”, where X and Y are hex characters. Thus there are only three ways to create a new escape sequence while unescaping: (1) unescape X, (2) unescape Y, or (3) unescape both. In all three cases, we only needed to worry when unescaping produced a valid hex character. Any other character couldn’t be part of an escape sequence. Hex characters were ordinary letters and numbers, so they weren’t normally escaped anyway. Cases where we needed to test further should be rare.

If we did need to test, we could catch case (1) by seeing if the previous character was “%” and the next was hex. This meant we had a new escape sequence. This wouldn’t catch case (3), since we wouldn’t have unescaped Y yet. The next character would be a percent sign instead of a hex value. But this was OK; after unescaping X, case (3) would become case (2). And we could detect case (2) by looking at the previous two characters for a “%” and a hex character.

This would only detect one level worth of nesting. Checking “/%%%3333” would tell us “%%333” was a problem, but not the outer characters. This was OK too. As soon as we found a nested escape, we could mark the GURL in question as invalid and undo our most recent unescape. Future canonicalization attempts would detect the same nested escape and fail the same way. Canonicalization would be idempotent, so it would be safe. It wouldn’t use recursion and become O(n²). In fact it would only scan up to three extra characters per unescaped hex character. This should make it fast enough.

I implemented this idea and tried the cases that had triggered DCHECKs before. I was elated when the browser didn’t crash or behave erratically at all. It just told me the URLs in question were invalid. Fixing the canonicalization code avoided the need to change all the callsites I’d found. I posted my solution for code review and walked away patting myself on the back. (More on code review in a minute.)

There was one nagging problem, and it had to do with case (3), and how we wouldn’t catch it until it became case (2). The problematic sequence on this bug, “/%%30%30”, was in case (3). My code would escape the first “%30” and get “/%0%30” before seeing the second “%30” and bailing out. This meant we would have unescaped one “%30” and not the other. This felt arbitrary. GURL’s rules say that the canonical form of “%30” is “0”; why obey that with one “%30” and not the other? And what if we sent this URL to a program with different rules, and it unescaped the other “%30”? This would create a URL with “%00”. Could we still cause bugs in other programs?

At first I decided this was an acceptable side effect. I started to rethink this after talking with my code reviewer the next day, when I was back in the office.

At Google, it’s not enough for a programmer to write a fix for a problem. Before that fix becomes part of a product, it must go through a code review process. One or more other programmers look at the bug and read the fix. They work with the author to make the code elegant, performant, and well commented. Above all, they make sure the change uses the right approach to begin with. It’s a bit like having someone help rewrite and edit your writing before publishing.

In my case, my reviewer was Brett Wilson, the original author of the GURL library. He had some of the same concerns I did. To save time, instead of posting comments on the review site, he walked over to my office. We discussed the different possible fixes and their downsides. By the end of that talk, I’d decided my original solution wasn’t quite right. It was a good start, but I needed to avoid the inconsistency in handling adjacent escapes. Assuming that was possible.

That’s All, Folks

It was possible, and the idea for how to do it came from reading one of the comments on the bug. It said, “Because the percent (‘%’) character serves as the indicator for percent-encoded octets, it must be percent-encoded as ‘%25’…”. In other words, if you encode “%” as “%25”, programs won’t treat it as the start of an escape sequence anymore. They’ll leave it escaped, since unescaping could turn subsequent hex characters into a new escape sequence. This was closely related to the problem I was solving. By making use of this rule, I could ensure all clients would treat the URL in a safe and consistent way.

Doing this meant modifying my previous solution a bit. The three cases I mentioned before, and the ways to detect them, stayed the same. The difference was how I’d react when detecting a problem. I would no longer undo the most recent unescape and flag the URL as invalid. Instead, I could just reach back to the initial “%” in the new escape sequence and convert it to “%25”. “/%%30%30” would first turn into “/%0%30”, then “/%00”, then “/%2500”. That would leave a valid URL, with no nested escape sequences. Its behavior would be more consistent than the first solution. Arguably, the result was a more accurate way to represent the intent of the original input. And despite the differences in their parsers, other browsers would all treat it the same way.

This became the kernel of my final solution. The change as checked in contains some more tweaks for better performance. For example, it avoids some of the character scanning by tracking problematic “%”s as they’re encountered. But the basic functionality is what I’ve described above.

Brett finished reviewing the change late Monday afternoon. That evening I landed it (meaning the change was integrated into the master copy of Chrome). The total time between first seeing the bug and landing the fix was about two and a half days, mostly over a weekend.

So that’s it. I learned about code I wasn’t familiar with and solved an interesting problem. Those sorts of things are why I do what I do. Having it be a high-profile public bug just added to the satisfaction. If you like thinking and learning as well, then I hope this has been a fun journey. Maybe you know a little more about programming, URLs, or this particular bug than you did before.

Until next time!

Thanks for reading! Let me know what you thought in the comments, and feel free to follow me on G+.