Building a country and language detection pipeline
Jason Jia | Pinterest engineer, Content
A Pin saved to Pinterest is much more than just an image. Each of the 75 billion Pins saved is a visual bookmark with dozens of signals, such as the website it originated from, as well as its country and language of origin. As more than half of all Pinners are outside the U.S., we rely heavily on language and country detection to help Pinners discover and engage with local ideas. Here we’ll share how we detect and use language and country signals across Pinterest — in search, home feed, Related Pins and topics — to improve the experience for Pinners and drive engagement internationally.
Detecting a link’s language and country
We define a URL’s country as the primary audience a website is meant to reach. Similarly, we define a URL’s language as the primary language of that page. We can detect a link’s language and country through multiple methods, and we weigh them based on their precision.
Whitelisting for language detection
Our local teams on the ground in Brazil, Japan, France, Germany and the UK use their expertise to give insight into which domains or links may be popular in their respective country. We work with them to compile whitelists of the most popular domains for each country to help us move faster on detecting language. These whitelists are generally pretty accurate but require a large amount of effort to manually curate and verify, making this approach difficult to scale.
URLs carrying locale signals
The ccTLD, subdirectory and subdomain of a link can also offer useful information about the country and language of a landing page. This doesn’t have the best coverage since many links may not contain this information, but for the instances where we can parse a ccTLD, subdirectory or subdomain from the URL, this method generally works well.
Pinner engagement signals
While we try to optimize for objective signals when determining the language and country of a website, sometimes non-objective signals also yield useful information. For instance, we can identify if a certain link is most interacted with by Pinners from a specific country or language, and conclude the country and language of the Pin.
Two subjective signals we take into consideration are when a user saves a Pin from a website or clicks through a Pin to reach its landing page. Since these are subjective signals, they don’t have stellar precision rates, but nonetheless help us classify many URLs for which we otherwise had little information.
The actual HTML of a page contains a wealth of information as long as it can be parsed properly. Our in-house HTML parsers take raw HTML from landing pages and extract a wide range of metadata, such as HTML language tags, descriptions, titles and headers. We can then pass this text into our in-house text processing library to try and detect the language of the page. Almost all web pages contain some HTML, so this method gives us a decent level of precision and coverage.
Even when a page itself provides us with very little usable information, we may still determine the language or country of that page. We can do cluster inference, a method similar to the k-nearest neighbors algorithm. If we can classify the language or country of other links that share the same domain or path as an unclassified link, and if the language or country for these classified links mostly agree, then we can assume the link we know nothing about is in that language or from that country.
The IP address of a web page also contains useful information about the potential country of a link. We use a third party IP geolocation lookup service MaxMind to determine the region or country of a link’s IP address. Every website has an IP address, but websites can be hosted almost anywhere. Therefore, this approach has 100 percent coverage but low accuracy, so it acts as a catch-all method for when we can’t classify a link using anything else.
The most straightforward measure of success in terms of language and country detection is our precision and recall rates. Precision is simple to calculate. All we have to do is take a sample of links from each country we want to know more about and find out the portion that we got right. We generally do this by human evaluation through Amazon Mechanical Turk or CrowdFlower.
Recall is harder to measure since the proportion of international links is so much smaller than those from U.S. websites. Our solution was to first estimate the number of links from a language or country we found by taking the total number of labeled links and multiplying it by the precision rate for that language or country. Then, we estimate the percentage of links we missed by taking a sample of all other links and running human evaluation. Multiplying this percentage by the number of links not labeled gives us an estimate of how many links we missed. Calculating recall given the number of links we found and the number of links we missed can then be done with a simple equation.
When trying to build a robust language and country detection pipeline, we found the best approach was to use many different technologies, ranging from machine learning to text processing, and a wide range of objective and subjective signals. All of these aspects together work much better than any individual element. Through these different approaches, we’ve achieved high levels of precision and recall, and have been key in driving local engagement or international growth.
There’s still a lot of work to be done to improve our country and language detection pipelines, including expanding language and country detection to all countries and languages. Other areas of focus could include adding local entity detection to our HTML parser to identify a web page’s country of origin based on public information — such as an address from a “Contact Me” page, or the currency and landmarks that may appear on a page. We could also potentially add support for sites with multiple languages and those that detect a user’s language or country and redirect them to an appropriate version of the page. If any of these problems sound interesting, join us!
Acknowledgements: Thanks for Kuai Xu, Anna Markowska, and John Milinovich for their guidance and invaluable contributions to the landing page country and language detection flow.