When Test Pilot was launched in May, Universal Search was among the first cohort of experiments. It is ambitious in scope: an attempt to both gain a better understanding of how users interact with the address bar, and to make content recommendations for in-flight searches while still respecting users’ privacy. It has been one of the more popular experiments: it was the first to reach 100,000 downloads, has the lowest uninstall rate, has sustained over 20,000 daily users over the last month, and was the subject of a beautiful laptop sticker:
On November 30th, Universal Search will break further ground as the first Test Pilot experiment to be retired. Since it depends on a server with substantial maintenance costs, we will be pushing a self-uninstalling update to the extension and decommissioning the server on that date.
This essay will take a look back on the project: providing an overview of how it worked, what we learned, and where future study might focus.
How it worked
Like all Test Pilot experiments, Universal Search was an extension — a Firefox add-on that modified the behavior of the browser. When installed, it sent user keystrokes in the address bar to a recommendation server. That server guessed what the user was trying to type and tried to provide a content recommendation. A user seeking their social media fix may type
f and should be recommended Facebook.
In some cases, the server went a step further and identified the nature of the content being recommended, allowing it to enhance the recommendation with metadata specific to that type. A user looking for Michael Jordan’s cinematic opus might have be recommended SpaceX’s website when typing
space, but should have be presented with the IMDb page for Space Jam— augmented with release date, plot summary and rating — when extending the query that to
What we learned
At the outset, we were looking to answer a few broad questions:
- Will users engage with browser-provided recommendations?
- When do users choose recommendations?
- Do users who choose recommendations come back for more?
- What are the characteristics of a successful recommendation?
- How do users interact with the address bar?
A focus on privacy
This isn’t a completely novel concept; other browsers have previously offered features like this. But often this means sending sensitive information about your browsing habits to untrusted third parties like advertising networks. We wanted to do this in a Mozilla sort of way that respected your privacy while still trying to help you get to your information more quickly.
Our data collection and privacy policies were clear and concise, and we made efforts to scrub potentially-identifying data. Since the server’s code is open source, you can verify that we are only collecting what we claim to. All requests to third-parties were proxied by our recommendation server, so no data providers had insight into your browsing habits.
The second hard problem
As the joke goes, there are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.
The Universal Search name was an homage to a Google project that added context to search results; prior art of a complex problem. It seemed to the team members that the name stuck the perfect balance of suitable and whimsical.
Upon installation, the extension removed the separate search box for which Firefox is somewhat notorious. This was done to preserve the integrity of our experiment; if users didn’t break their habits and start using the address bar to search, our measurements of user behavior would have been unreliable.
This was the most obvious UI change that Universal Search made. Plus, the extension did appear to unify the address bar and search box, and that sounds a lot like the word universal. Not only was this an unpopular design decision, but exit survey feedback showed that users commonly thought it was the experiment. Better naming — perhaps something akin to “Smart Search” — may have reduced experiment attrition and provided better results.
Experimenting with XUL
Our original approach to the extension involved replacing the entire address bar dropdown with an
<iframe> element, using a pubsub broker to communicate with the browser. It was an interesting approach, with many benefits:
- We could quickly iterate with Service Workers, downloading small interface updates in the background without requiring an add-on update.
- We could easily manipulate results: grouping similar items, prioritizing secondary results (bookmarks, history items, etc), and adding additional metadata for specific types of content.
That all sounds great, but there was one major problem: introducing a vastly new experience would make it challenging to identify the source of any behavioral changes. They might have been attributable to the new design, to differences in responsiveness, or to any of a number of new features we considered.
Ultimately that approach was abandoned, instead opting for a simpler XUL-based approach that inserted a single recommendation above the top of the existing results. Though we lost some of the benefits of the
<iframe> and our extension was likely to break with any upstream changes to the address bar, we were much better able to test our core research question: do users engage with browser-provided recommendations?
Injecting the recommendation with XUL proved to be challenging, but we now have a much better understanding of what is required to experiment with search results, and have made concrete plans to make the address bar more extensible for future experimentation. Through Embedded WebExtensions and WebExtension Experiments, these capabilities will also be available to WebExtensions when they land.
A common thread among feedback was concern over the quality of the recommendations; they sometimes felt like ads and often looked biased. Since we relied on search engines for weighting by relevancy, we were heavily influenced by website SEO. A search for
linux resulted in a link to Linux Mint, which scores highly in Bing’s autocomplete engine. This could have been avoided by removing the long tail of domains that exist on the internet, instead only recommending a subset of higher-relevancy ones.
The recommendation server also suffered from a different sort of bias: ethnocentricity. Search engines use information about the users to try to provide more relevant results, customizing them to what they guess are your age, gender, interests, location, and language. Yahoo’s BOSS API—another of the data sources — allows consumers like Universal Search to explicitly do this for region and language. We should have taken advantage of this, passing along the user’s information. Instead, Yahoo inferred the region from where the recommendation server was querying (the United States), then chose the default language for that region (English). This resulted in poor recommendation quality for users both outside of the United States and speaking non-English languages.
For example, acceptance criteria of the first version of the recommendation server included the search query
f recommending Facebook. Similarly, users in Russia might expect
v to take them to VKontakte, a popular social networking site in the country. Instead, it recommended to them Verizon Wireless — a company that doesn’t operate in Russia.
Power user bias
Test Pilot’s audience has not been well-studied, but one might reason that they are more technical and engaged with Mozilla than a general population audience: participation in Test Pilot is voluntary, and users are recruited through a variety of means that may be more likely to reach more technical users, including e-mail newsletters, the Test Pilot Discourse forum,
about:home snippets, and promotion in public meetings.
As Table 1 shows, the most common queries support that notion, with common searches including programming languages and Linux distributions. This reflects a clear bias in our audience that was not accounted for in collection or analysis.
Navigation and discovery
Perhaps the most important finding differentiates between two possible uses of search in the address bar: navigation, where the user knows their ultimate destination before they begin typing, and discovery, where the user isn’t sure where to find what they’re looking for. Though Universal Search was explicitly attempting to improve discovery, we were surprised that the dominant use case appeared to be navigational.
This was accidentally stumbled upon when adding a somewhat-obvious UX feature. We didn’t want the Universal Search recommendation to duplicate information, so it was omitted if the URL matched one that already existed in the results (e.g. if it was an already-existing tab, bookmark, or history item). We expected the introduction of this feature to improve engagement, but it actually reduced the clickthrough rate (CTR; see Table 2), indicating that users wanted to revisit past pages, making the recommendation much less useful when they were excluded.
Further support can be found by examining clickthrough rates of the types of recommendations Universal Search offered. There were three distinct types: TLDs, where an entire website is recommended; Wikipedia articles, where a specific Wikipedia article is recommended; and movies, where the IMDB page for a movie or television show was presented.
More general in nature, it stands to reason that TLD results would be a more common endpoint for navigational searches, while Wikipedia and movie results may be more likely to be the result of a discovery search. This was supported by the data (see Table 3), where clickthrough rates were significantly higher for TLD results than either movies or Wikipedia articles. This implies that the address bar is commonly used as an interface to frecent results, and the most effective recommendations were the ones that supported that.
Another indicator that users lean on the address bar for navigation is the degree to which users preferred the earliest-position items in the result set (see Chart 1). That users would favor early results is expected, but over half of all address bar interactions resulted in the top item being chosen.
One way the navigation and discovery hypothesis was tested was by introducing the movie result type. We detected that a recommendation was for a movie or television series, and augmented the findings with contextually-specific information about movies: the release year, run time, genres, rating on IMDB, and an image of the promotional poster. This additional information would be useful if the user was attempting to discover information, but would provide little benefit if simply navigating to a known destination.
Movie cards were shown (or not) as a random-sampled A/B test. As Table 4 shows, overall CTR was lower in the population that was shown movie cards. Thanks to a large sample size, this small effect was decidedly significant, indicating that the contextual information may be harmful to the perceived utility of the recommendation, as measured by clickthrough rate.
This finding warrants further study. Though it seems clear that Universal Search’s users were using the address bar for navigation more frequently than they are for discovery, this could be explained as an intersection of two factors:
- The previously-discussed power user bias could correlate with faster and more predetermined operations, making it less likely that they see and parse the recommendation.
- Habit. Firefox’s address bar has largely remained unchanged since Firefox 3, nearly a decade ago, and users may just be trained to use it in a specific way: for navigation.
A longitudinal Shield study that investigates changes in user behavior as they become habituated to the new functionality could eliminate both of these biases and provide clearer insight into these findings.
What comes next?
Building on these findings, Mozilla has launched a new initiative: Context Graph, a grander vision of a privacy-respecting recommendation engine for the web. Early efforts include two projects: Miracle, a project to gather data to train an experimental recommendation engine; and Heatmap, which annotates a user’s history with the ways they interact with page. Combined, these could form a basis for a better recommendation engine than the one offered by Universal Search.
Early efforts on Universal Search looked at ways to uncover and store data about websites you visit. This spirit has been continued with two projects: Fathom, a framework for extracting meaning from the DOM, and page-metadata-parser, a Fathom implementation in use by the Activity Stream Test Pilot experiment.
Though the experiment will be ending, we’d still love to hear about your experiences. A survey has been set up for structured feedback, but we’d also love to hear your freeform thoughts on the Test Pilot Discourse, where you can discuss the experiment with the project team.
This project wouldn’t have been possible without the experience and expertise of a massive group of people. A surely-incomplete list:
- Jared Hirsch, whose engineering prowess and adeptness with the depths of Firefox’s codebase made this experiment possible.
- Nick Chapman and Javaun Moradi, Universal Search’s visionaries and product managers.
- Bryan Bell, for his excellent design work and deep understanding of the problem space.
- Wil Clouser, John Gruen, Cory Price, Les Orchard, Sharon Bautista, Winston Bowden, Elvin Lee, Brian Smith, and the rest of the Test Pilot team for providing a platform for this sort of experimentation.
- Benson Wong and Daniel Thorn, who kept the recommendation server fast and stable.
- Ilana Segall, Robert Rayborn, Matt Grimes, Mark Reid, Rebecca Weiss, and all of the Strategy and Insights and Data Pipeline teams, who offered vast expertise and insight as we collected and analyzed an immense amount of data.
- Peter deHaan, Krupa Raj, and the entire SoftVision team for their thorough, relentless quality assurance.
- Panos Astithas, Marco Bonardo, Drew Willcoxon, Florian Quèze, and the rest of the Firefox Search team for their insight on a particularly tricky part of Firefox.
- Andy McKay, Kris Maglione, Andrew Swan, Mark Striemer, and the rest of the Firefox Add-ons team.
- Toby Elliott and his team of magicians: Erik Rose, Hanno Schlichting, K Lars Lohn, and Ryan Tilder.
- Each user and community member who participated in the experiment, contributed code, reported bugs, and provided feedback. We do it for you, but couldn’t do it without you.