How We Measure Standards (and why it’s sort of a problem)

I’ve been working on the web platform since 2010. I moved over to the Internet Explorer team from SharePoint Designer to help with IE9, and I’ve been here ever since. At that time, Chrome had only barely shipped. Chrome version 4.0 (still based on webkit at the time) and Firefox version 3.6 (running Gecko) both released about the same day I started on IE. But to be clear: we were not much concerned with Chrome. It was certainly cute and seemed fast, but the real competition was Firefox. Mozilla strongly argued that they were the “Standard-bearer” of Standards (competing directly with Opera for that role), and we decided to compete against them in the standards arena.

During IE8 development, the team started experimenting with “standards mode,” which would allow Microsoft to ship a mode that supported “standards” as well as a mode that supported compatibility with legacy systems. In order to accomplish a standards mode, Microsoft engineers wrote a lot of tests. A lot. And we made some of them public by donating them to the W3C. These submissions were before there was a thing called the “Web Platform Tests” project. We submitted the tests to the individual working groups’ test repos, and then we reported the results of those submitted tests. We also participated in many “Test the Web Forward” events as a way to help those who were interested submit their own tests. I would argue that those events didn’t go well (they were fun, sure, and we did get some tests from them, but many had to be rejected due to poor quality).

It turns out that it really does require experts in the code to write good tests of that code. I think one day I’ll write about that more specifically.

So anyway: we were actively engaging in testing standards and we were actively competing against Firefox, who were perceived to be the Standard-Bearer of Standards. So, how best to compete against that perception? Work to change it…

When we submitted tests to the W3C, we were ridiculed by some for only submitting a subset of the tests we actually wrote. This was true: we did not submit all of our tests. The assumption by others was that we only wanted to submit tests we passed and keep all the failures secret. This is only partially true. We really did pass almost 100% of the tests we wrote, even the internal ones, because it’s weird to check in a test that is failing. In addition, we did submit some failing tests along the way (flexbox tests, for example) because that allowed us to more easily show progress.

So why is this a problem?

Well, as stated above: writing tests requires expertise. It is also very time-consuming. It is therefore very expensive. For all new features coming through the W3C, Microsoft was submitting tests that showed we were passing nearly 100% in Standards, and also that Firefox was passing only 90% or so (interestingly, and in hind-sight a really really important fact: Chrome was also at about the 100% mark on new features, and when I would try to demonstrate they were not following standards it was very difficult to do). So most of the tests that were publicly consumed were written by Microsoft and they were mostly concerned with only new features. For legacy features there were some tests, and there were the Acid Tests, of course, but once you pass them, you move on. They were never meant to continue to grow in size and scope the way the Web Platform Tests would.

What I learned from all of this is that he who controls the tests begins to control the perception. And our game theory was correct: Firefox had to dedicate engineers to fix public test failures. They had to dedicate engineers to write tests so that Microsoft would not control the entire narrative. The opportunity cost of doing so meant they had to be slower at creating new features. However, it turns out that this gamesmanship did not work against Google. While we were focused on Mozilla and their limited engineering resources, Google was choosing to throw as many bodies as necessary at the problem. And they did a really (really, (really)) good job at implementing features, writing tests, and demonstrating standards in their (now forked) engine: Blink (ironically named after a non-standard feature from Netscape Navigator).

In 2017, Chrome enabled two-way sync between the Chromium source and Web Platform Tests. This sounds like goodness: it means that Chromium is constantly being tested for standards and that regressions in standards will cause changes to be backed out of source. Of course, tests that currently fail will continue to fail, but at least new tests will be required to pass. And that is a good thing. It is…

But do you see where I’m going?

New features in Chromium must include tests. If those tests fail, the merge will likely be rejected. Those tests then sync automatically with the Web Platform Tests. Adding any additional tests that Chromium might fail requires expertise from another feature expert. One of those experts used to be Microsoft with EdgeHTML, but not anymore. Maybe Opera engineers could do it, but they are now based on Chromium as well and Presto is no more. And so it falls again to Mozilla.

You can see from the current dashboard (wpt.fyi, written primarily by Google), that Chrome passes more tests than any other engine. And they will continue to submit tests for new features that will also pass 100% of the tests. And the only way that Mozilla will change that is if they submit tests at the same rate as Chrome. You may be thinking, “Whoa John, hold on. Firefox also has two-way sync so aren’t they just doing the same thing?” Well, they could, except they have to spend so much time fixing bugs as seen on wpt.fyi, they really cannot play game theory here. It is also the case that developers treat Chrome as the de facto standard and don’t pay a lot of attention to test results other than their own anyway.

OK, so what’s my actual point?

The industry measures standards compliance via test suites that are not necessarily testing things we care about as web developers in ways that they may never be used by real code and do not show a complete picture of implementation status. I learned from Microsoft’s approach that we were not really improving standards across the different browser engines. We were improving the perception of standards in our engine and (hopefully) driving other engines to match our results. What that does not show us is all the areas where we fail to match standards in meaningful ways with all these new features. When we finally pass all of the tests in HTML, for example, does that mean we have interop? No. It means we pass all of the tests that (primarily) Chromium submitted for that feature. I have to admit that as I’ve been learning the Chromium codebase I’ve been quite impressed by how good it really is, but I am a bit worried about interoperability longer term. Who is going to be testing all of the cases that Chromium fails but no-one knows about due to a lack of tests? Who is testing the margins where two features interact? Who is testing the interoperability of end-to-end scenarios? And who is helping prioritize those bugs to the highest benefit of web developers?

I guess it’s up to all of us. If you care about this kind of stuff enough to have gotten to this point in the post, I think you should take some time to learn the tests in a feature you care about, learn to write tests for that feature, and then test the crap out of it. Maybe Chromium will pass all your tests. Maybe you won’t quite know if your tests are valid (how would you validate them other than running them in Chrome, by the way?). Maybe you’ll find an important gap in standards that would otherwise go unseen and maybe (just maybe) you’ll help test the web forward.

Caveat Caveat Caveat

  1. I have helped with wpt.fyi quite a bit, and my team is working to enable Azure Pipelines for the automation runs, enable the new Edge browser to report results, and help with the design of the reporting. I actually really appreciate a dashboard view like that and I think that Philip Jägenstedt is doing a great job with it. It’s not the dashboard’s fault that the test suite is incomplete.
  2. Some tests really don’t matter to devs at all. For example, a spec might say that a value MUST be a positive integer and that if something else is passed in, the implementation MUST error by doing ‘foo’. We write a test that uses a string. When we execute the test, we find that one browser errors with ‘foo’ (PASS), but another one (maybe EVERY other one) with ‘bar’ (FAIL). Does the dev really care? Even though these tests are used to show devs what features they can or cannot use on their sites, these tests are fundamentally for browser vendors who are writing browsers.
  3. Sometimes a simple fix to a browser will fix hundreds of failures. For example, if the above statement is tested via different strings, negative integers, zero (which may be positive in one browser and not-positive in another), empty, etc. etc. all going to the same code path that has a simple bug, having hundreds of failures does not necessarily mean the browser has hundreds of bugs.
  4. You may be asking by now, “why hasn’t John just submitted a bunch of tests showing interop failures?” Same reason others haven’t as well, I guess: Time. But I do hope to get better at this.