How apps like Yelp, Foursquare or TripAdvisor validate their user-generated data

Nikhil Dandekar
Startup Grind
Published in
3 min readSep 15, 2015

--

By Nikhil Dandekar, Engineering Manager, Quora

User-generated data does contain bad content that needs to be validated and cleaned. Bad content can be due to accidental mistakes or due to users acting maliciously to serve their own interests. Companies like Foursquare or TripAdvisor that rely on user-generated data use a combination of algorithms and human curation to ensure data correctness.
Simple errors, or errors which follow common patterns can be captured by algorithms. Here are a few examples of data validation and cleaning that can be performed by algorithms:

  • A restaurant closes down. A Foursquare algorithm can check the number of Swarm checkins the place gets every day, and if that suddenly drops to zero or a very low value, Foursquare can figure out the place is closed.
  • A user repeatedly promotes their website on every TripAdvisor review they write. This can be captured by a simple rule-based algorithm that checks the percentage of posts from every user that contain the same website. You can also train sophisticated Machine Learning models to solve these kind of problems. E.g. Spam Detection across various domains is a very well-researched topic by now.
  • A user falsely checks in to many places in a day to obtain a lot of Swarm mayorships. An algorithm can detect cheating of this sort, e.g. if users checkin to too many places in a day, or checkin to places which are far away from their actual physical location.

Algorithms are great at capturing the simple stuff, but they are not perfect and can lead to errors of their own. Also a sufficiently smart malicious user can outwit simple algorithms. This is where human curation kicks in.

Human curation can be of 2 types: organic and paid.

Organic human curation is generally done by the more passionate users of the website or app. E.g. Foursquare has a vibrant Superuser community that regularly fixes any mistakes they see in the data. Quora has a very active community that cares a lot about maintaining question and answer quality and ensuring correctness. For any user-generated content website, fostering this sense of community among its users is thus very important. Also important is providing the right set of tools for the users to do the curation. E.g. Quora provides a way in its user interface to let users report bad questions, bad answers, bad users, and lets users merge duplicate questions etc.

Finally, if the above avenues still don’t work, companies resort to paying people to clean the data and fill out important missing data. Companies hire contractors or use crowdsourcing websites like Amazon Mechanical Turk or CrowdFlower to do this. Compared to organic human curation, paid curation can lead to better data cleaning, but it’s also expensive and hard to scale. Websites generally use the paid option only for fixing their most popular and sought-after content. Young websites which haven’t yet built a community of passionate users also have to resort to the paid option.

(Answered originally at Quora: What do apps like Foursquare or TripAdvisor do to validate their user-generated data?) Enjoyed that read? Click the ❤ below to recommend it to other interested readers!

--

--

Nikhil Dandekar
Startup Grind

Engineering Manager doing Machine Learning @ Google. Previously worked on ML and search at Quora, Foursquare and Bing.