Finding the Tennis Suspects

Deanonymizing BuzzFeed’s Tennis Exposé

Russell Kaplan
Jan 21, 2016 · 5 min read

By Russell Kaplan, Jason Teplitz, and Christina Wadsworth

The tennis world was sent reeling when BuzzFeed News and the BBC jointly published The Tennis Racket, which revealed “evidence of widespread match-fixing by players at the upper level of world tennis”. But BuzzFeed refused to publish the names of those players.

We dove into the data and found the names ourselves.

The Names

Important Disclosure: We are not commenting in any way about the allegations of match-fixing or any other accusations made explicitly or implicitly by BuzzFeed News’ article. Please bear in mind that BuzzFeed News deliberately chose not to name these players because of an insufficient degree of certainty that the suspicious patterns they identified were a result of match-fixing. Our only claim is that after conducting an analysis of BuzzFeed’s anonymized dataset, we believe the names below represent the deanonymized 15 players BuzzFeed found in their published analysis of public betting data, players who according to BuzzFeed “regularly lost matches in which heavily lopsided betting appeared to substantially shift the odds — a red flag for possible match-fixing” (BuzzFeed News, The Tennis Racket).

BuzzFeed anonymized the names of all players in their analysis, which is published on GitHub. Their approach revealed 15 players with multiple suspicious matches where the odds shifted significantly between the start and end of the match. According to the article, this degree of unusual betting activity is often associated with match-fixing.

We found the names of these players by scraping the same public odds data that BuzzFeed used in their analysis. Then we looked for unique, exact matches between a bookmaker’s odds for a match in BuzzFeed’s dataset and a bookmaker’s odds in ours. (See Methodology below for more details.) Using these unique matches, we were able to unambiguously link the scrambled names (called “hashes”) in BuzzFeed’s results to real names.

Among the more interesting names on the list:

  • Lleyton Hewitt, the former world No. 1 and two-time Grand Slam singles champion.

The full list of 15 players, in order of BuzzFeed’s statistical assessment of how unusual their suspicious match results were:

Igor Andreev (0.0096%)
Lukáš Lacko (0.0178%)
Ivan Dodig
(0.0195%)
Andrey Golubev
(0.041%)
Juan Ignacio Chela
(0.2259%)
Lleyton Hewitt
(0.2737%)
Jan Hájek
(0.5258%)
Albert Montañés
(0.5684%)
Daniel Gimeno-Traver
(0.5984%)
Janko Tipsarević
(1.637%)
Alex Bogomolov Jr.
(2.0464%)
Matthew Ebden
(2.781%)
Denis Istomin
(3.3895%
Teymuraz Gabashvili
(4.2248%)
Michael Russell
(4.7099%)

The percentages from BuzzFeed next to each name represent the chance that the player would have lost as many suspicious matches as he did, if the opening odds of those matches were correct. The lower the percentage, the more unusual their suspicious match results (according to opening odds). For a complete interpretation of what the percentages mean, see the table towards the end of BuzzFeed’s original analysis.

Methodology

Layman’s Overview

Think of BuzzFeed’s anonymized dataset like a phonebook, where the names are replaced with unique jumbles of characters (e.g. “ffe23c8b”). We collected a few chapters of the same phonebook from public data online, with real names instead of jumbles. Then all we had to do was ask: which phone numbers do we have that BuzzFeed also has? Phone numbers are unique, so once we found a match, we could unambiguously link BuzzFeed’s name jumble to the real name of a player. We found several of these unambiguous links for each of the 15 players above.

In this analogy, the phone numbers are odds sets: 4 distinct numbers from a bookmaker that quantify each player’s odds of winning at the start of the match and by the end (plus some extra information — see Details below). The “chapters from the same phonebook” we collected are odds from bookmakers that we scraped from OddsPortal.com, the same source that BuzzFeed used.

Details

BuzzFeed anonymized the bookmakers they used in addition to the player names. However, BuzzFeed revealed that all of their data is from “seven large, independent bookmakers whose odds are available on OddsPortal.com.” Our first step was to identify some of the seven, to make sure our scraped dataset didn’t include odds from bookmakers that BuzzFeed never used. We identified four of the seven bookmakers with very high confidence (> 95%). We then discarded all our scraped data that weren’t from one of these four bookmakers.

Next we sought to find the players. Unlike numbers in a phonebook, odds sets aren’t necessarily unique. Indeed, BuzzFeed’s ~130,000-item dataset had several entries with identical odds sets. So we considered a few more pieces of information as well as odds sets to reduce the number of duplicates: the name of bookmaker, the year of the match, whether the match ended normally or abnormally (e.g. a player retiring due to injury).

We then took all the data entries that still weren’t unique among the rest of the entries from the same dataset and discarded them. (We only want unambiguous results, and we had enough data that we could afford to throw some out.) We did this for BuzzFeed’s data and ours.

We’re left with something very useful: thousands uniquely identifiable data entries with scrambled names (from BuzzFeed) and real names (from our scraping). We then simply cross-referenced the two datasets to unscramble the names of the players.

For an even more detailed walkthrough, see our published source code with comments on GitHub.

Ensuring Correctness

We took great pains to verify that all of the names above are the true deanonymizations of the scrambled names in BuzzFeed’s data analysis. Here are some of steps we took to help ensure correctness:

  • We were very conservative about the data we used — we threw out all ambiguous data entries from BuzzFeed and from our scraping. Furthermore, we only used our scraped data from bookmakers that we were highly confident BuzzFeed also used in its analysis.

Who We Are

We’re three Stanford undergrads studying computer science: Russell Kaplan (Twitter), Jason Teplitz (Twitter), and Christina Wadsworth (Twitter).

If you’d like to get in touch with us, please email:
findingthetennissuspects@gmail.com

Acknowledgements

  • John Templeton, the investigative data reporter for BuzzFeed News who did the original analysis.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store