Improving Language Detection

Published in

Foursquare

5 min readAug 26, 2015

At Foursquare, we attempt to personalize as much of the product as we can. In order to understand the more than 70 million tips and 1.3 billion shouts our users have left at venues, each of those pieces of text must be run through our natural language processing pipeline. The very foundation of this pipeline is our ability to identify the language of a given piece of text.

Traditional language detectors are typically implemented using a character trigram model or a dictionary based ngram model. The accuracy of these approaches is directly proportional to the length of the text being classified. For short pieces of text like tips in Foursquare or shouts in Swarm (see examples below), however, the efficacy of these solutions begins to break down. For example, if a user writes only a single word like “taco!” or an equally ambiguous statement like “mmmm strudel,” a generic character- or word-based solution would not be able to make a strong language classification on those short strings. Unfortunately, given the nature of the Foursquare products, these sorts of short strings are very commonplace, and we needed a better way to accurately classify the languages in which they are written.

To this end, we decided to rethink the generic language identification algorithms and build our own identification system, making use of some of the more unique aspects of Foursquare data: the location of where the text was created and the ability to aggregate all texts by their writer. While there are many multilingual users on our platform, the average Foursquare user only ever writes tips or shouts in a single language. Given that fact, it seemed inefficient to apply a generic language classification model against all of the text that a single user creates. If we have 49 data points that strongly point to a user writing in English, and that user’s 50th data point is an ambiguous text that a generic language model thinks could be German or English (with 40% and 38% accuracy respectively), chances are that the string should correctly be tagged as English and not German, even if the text contains German loanwords. Our solution to this problem was to build a custom language model for every one of our users that leave tips or shouts, and then to allow those user language models to help influence the result of the generic language detection algorithm.

The first step in this process is to run generic language detection on every tip and shout in the database. Each tip and shout is associated with a venue that has an explicit lat/long associated with it. We then reverse geocode that lat/long to the country in which that venue is located, which lets us know the country that the user was in when they wrote the text. Next, we couple the generic language detection results with this country data to create a language model for every country. While this per-country language distribution model may not correctly resemble the real life language distributions of a given country, it does model the language behavior of the users that share text via Foursquare and Swarm in those countries.

Example of top 5 languages and weights calculated in the country language models:

US - United States of America
    Top Tip Langs               Top Shout Langs
    en - 0.80096                en - 0.5092
    es - 0.00850                de - 0.0139
    it - 0.00804                es - 0.0102
    de - 0.00559                it - 0.0096
    fr - 0.00459                nl - 0.0088RU - Russian Federation
    Top Tip Langs               Top Shout Langs
    ru - 0.77396                ru - 0.40054
    bg - 0.02990                uk - 0.04615
    uk - 0.02049                bg - 0.04446
    sr - 0.01458                sr - 0.03221
    en - 0.01450                be - 0.02420TH - Thailand
    Top Tip Langs               Top Shout Langs
    th - 0.67228                th - 0.60632
    en - 0.17507                en - 0.10340
    ru - 0.01969                zh - 0.00488
    it - 0.00327                ja - 0.00478
    de - 0.00298                de - 0.00467

With country models in hand, we then do a separate grouping of strings by user and are able to calculate a language distribution on a per-user basis. However, one of the problems with this approach is not every user has enough data to create a reliable user model. A new user who is multilingual will cause classification problems with this system early on due to the lack of data to produce a reliable model. To solve this particular problem we use the language model of the dominant country for that user as a baseline. When a user has little to no data for their user language model, we allow the country model to be merged into the low information user model. As more data becomes available for a given user, we slowly weight the user model higher than the dominant country model until we have enough data where the user model becomes the more dominant model between the two.

Finally, we create per country, orthographic feature models using the strings that are grouped by country. For this model, we have a set of 13 orthographic features that, when a string triggers one of them, the string’s generic language identification results are added to the other strings results that triggered for that feature, in a specific country. This allows us to have a feature “containsHanScript” and have a completely different language distribution in China than the one that is calculated for Japan, where both Chinese and Japanese contain characters from the Han script. Other examples of this are Arabic vs. Farsi with the “containsArabicScript” feature, Russian vs. Ukranian vs. Bulgarian with the “containsCyrillicScript” feature, and all romance languages with the “containsLatinScript” feature.

With the user models and the orthographic feature models in place, we then rerun language identification on all of our tips and shouts, using the appropriate user’s language model and applying any triggered orthographic feature model that the string matches, and we merge the 2 results together, along with the generic language detectors’ results for a given string and we’re left with a higher quality language classification. On preliminary analysis, we were able to correctly tag an additional ~3M tips and ~250M shouts using this method.

Examples of corrected language identification:

"Place has good tacos tortas and licuados yum" 
   Spanish -> English 
   US user writing a tip in Chicago"Хороше фірмове пиво!!!" 
   Serbian -> Ukrainian
   Ukrainian user writing a tip in Ivano-Frankivsk"Zastavte se na točenou Kofolu!" 
   Slovene -> Czech
   Czech user writing a tip in Prague

If these kinds of language problems interest you, why not check out our current openings!

-- Maryam Aly (@maryamaaly), Kris Concepcion (@kjc9), Max Sklar (@maxsklar)

Improving Language Detection

Written by 4SQ Eng