Why it took a long time to build that tiny link preview on Wikipedia
Yesterday, my team completed the launch of “page previews” — a feature now deployed to hundreds of language editions of Wikipedia. We are seeing up to half a million hits every minute to our API to serve those cards that show when you hover over any link.
On the surface it looks quite simple. It’s something many websites have already. It has an image and some text and shows when you hover over a link. Hardly groundbreaking stuff … or so it may seem.
The original idea was conceived four years ago, based on an idea from a volunteer/editor many years before that.
It’s thus taken a few years for us to get this out to everyone. That might seem strange, but like an iceberg, once you start looking below it, it all makes sense.
We had to choose a thumbnail
We have several millions of pages, all stored as raw wikitext. We couldn’t expect every single article to be edited to designate a thumbnail.
As with all algorithms, it wasn’t perfect and since it wasn’t designed for the use case of page previews, it also required tweaking.
We had to make updates to limit the image to the first section of the article. Working with algorithms is hard, but for this purpose essential.
We had to generate a summary
We have several millions of pages, all stored as raw wikitext. How to summarise that without asking our editors to go in and painstakingly do that for every article?
The busy Max Semenik who helped us with the thumbnails, also wrote an extension to generate extracts of articles. It was originally written primarily for plain text summaries. Our initial versions of page previews used this, but we realised it wasn’t the best fit for what we were trying to do.
So we stopped using this.
We realised HTML was very important. For instance chemistry articles featured chemical formulae which needs subscript which requires HTML.
Many of our articles, begin with location information and pronunciation information. A lot of this content didn’t belong in our summaries and for other content it was less clear what belonged. The summaries had a lot of design input and we identified which content shouldn’t appear inside a preview. We wrote up a specification capturing the required behaviour.
In the end we decided to build on top of an API that was originally built for our native Android and iOS apps. We created a new API specifically for this purpose.
We now generate the summary from the entire article HTML. We parse it just like a browser and per the specification identify the first “non-empty” lead paragraph of every article.
One of the big challenges here, was the decision to strip content inside parentheses. Since we support over 300 languages this had to be localised (not everyone uses the same character set!).
Not only that, but of course some parentheses are vital… edge cases are everywhere. We had to think about all their potential usage and how best to do them.
Stripping parenthetical elements from user-generated HTML also turned out to be quite difficult. Whereas doing so in plain text requires a simple regular expression, things get more complicated when you consider nested HTML.
It was important to ensure the content continued to make sense with the content inside parentheses stripped and that we that we didn’t add any security vulnerabilities.
Thanks go to our Infrastructure team for helping us build out this API.
We work with our community
Our editing community cares a lot about our product. That’s why they write articles for you in their spare time for no monetary gain.
We include them in every part of the process, working with them tirelessly to fix every edge case (whether it be broken summaries, or inappropriate images) and to reassure them we know what we are doing, the impact of why we are doing and why we are continuing to do it.
Our initial version wasn’t good enough. Our community asked us not to go ahead with it. We answered by listening to them and making it better.
Thank you community and our community liaisons who helped facilitate those conversations!
Design design design
We did a lot of it. Our designer, Nirzar did a great write up so I won’t bother sharing any more here, but design was at every step of the process whether it was the initial prototypes (thanks Prateek Saxena!); discussing performance of the feature with our performance team; perfecting thumbnails and summaries; or talking to our community.
Thank you design team!
We had to instrument it
This was a big change to how people interact with our content. We care a lot about privacy in Wikimedia. We are likely to be one of the few (only?!) major websites who don’t install third party scripts to track you.
We don’t use vendors to A/B test or analyse our user’s behaviour.
Yet, despite all this, we don’t cut corners.
We don’t want to make dumb risky changes.
Every time we build something major we have to build out the infrastructure to evaluate it. We construct hypothesizes and tests to test those hypothesizes. We build it. We test it. We listen to the data. We adapt. We test again.
This means we juggle being a product team and an analytics team. Our development team wear multiple hats. Given the scale we work at have to deal with bugs. Occasionally, we find major bugs in browser vendors.
Our final A/B test by Tilman Bayer gave us many answers. It’s a great read!
Given the findings of that A/B test we also decided to start measuring “page preview views” as an additional metric to page views. That metric is firing 1000 events per second and our analytics team is kicking ass dealing with that sort of scale.
Thank you analysts, thank you analytics team!
We had to scale our API to support you
We are seeing 0.5 million hits to our API a minute.
We are seeing 0.5 million hits to our API a minute.
I say that twice because that’s a high volume of traffic.
Our traditional APIs were originally built for bots to help clean up your edits. They weren’t designed for readers.
The Wikimedia services team has been vital to the success of this project by providing the infrastructure to deal with, dealing with a lot of the caching (we rely heavily on Varnish) and making sure when content gets edited new summaries get generated. It’s well known that cache invalidation is one of the harder Computer Science problems.
Thank you services team!!
Thank you thank you thank you
Shipping something always feels good. I hope the “simple” preview my team with the help of many teams across the Wikimedia Foundation has shipped enriches your experience.
Lot’s of us have obviously been involved and we’re proud of what we’ve put out.
We’re not done. Software is never done.
We have code to clean up and new ideas for how to grow this small little feature.
Some might say we’re only seeing the tip of the iceberg.
The Wikimedia Foundation is a non-profit and relies on donations from people like you to keep the site up and running in such a way that it lives up to our values. To help us continue to build features like this, please consider donating to us.
Jon Robson, Senior Software Engineer, Desktop & Mobile Web