Finding the Tennis Suspects

Deanonymizing BuzzFeed’s Tennis Exposé

Russell Kaplan
5 min readJan 21, 2016

By Russell Kaplan, Jason Teplitz, and Christina Wadsworth

The tennis world was sent reeling when BuzzFeed News and the BBC jointly published The Tennis Racket, which revealed “evidence of widespread match-fixing by players at the upper level of world tennis”. But BuzzFeed refused to publish the names of those players.

We dove into the data and found the names ourselves.

The Names

Important Disclosure: We are not commenting in any way about the allegations of match-fixing or any other accusations made explicitly or implicitly by BuzzFeed News’ article. Please bear in mind that BuzzFeed News deliberately chose not to name these players because of an insufficient degree of certainty that the suspicious patterns they identified were a result of match-fixing. Our only claim is that after conducting an analysis of BuzzFeed’s anonymized dataset, we believe the names below represent the deanonymized 15 players BuzzFeed found in their published analysis of public betting data, players who according to BuzzFeed “regularly lost matches in which heavily lopsided betting appeared to substantially shift the odds — a red flag for possible match-fixing” (BuzzFeed News, The Tennis Racket).

BuzzFeed anonymized the names of all players in their analysis, which is published on GitHub. Their approach revealed 15 players with multiple suspicious matches where the odds shifted significantly between the start and end of the match. According to the article, this degree of unusual betting activity is often associated with match-fixing.

We found the names of these players by scraping the same public odds data that BuzzFeed used in their analysis. Then we looked for unique, exact matches between a bookmaker’s odds for a match in BuzzFeed’s dataset and a bookmaker’s odds in ours. (See Methodology below for more details.) Using these unique matches, we were able to unambiguously link the scrambled names (called “hashes”) in BuzzFeed’s results to real names.

Among the more interesting names on the list:

  • Lleyton Hewitt, the former world No. 1 and two-time Grand Slam singles champion.
  • Igor Andreev, the former doubles partner of Nicolay Davydenko, who was involved in a match-fixing scandal referenced in BuzzFeed’s article. According to BuzzFeed’s analysis, there is a mere 0.0096% chance that Andreev would have lost as many suspicious matches as he did (15 in total), if the chances implied by the opening odds were correct.
  • Ivan Dodig, a Grand Slam doubles champion.
  • Janko Tipsarević, a former Top-10 singles player.

The full list of 15 players, in order of BuzzFeed’s statistical assessment of how unusual their suspicious match results were:

Igor Andreev (0.0096%)
Lukáš Lacko (0.0178%)
Ivan Dodig
(0.0195%)
Andrey Golubev
(0.041%)
Juan Ignacio Chela
(0.2259%)
Lleyton Hewitt
(0.2737%)
Jan Hájek
(0.5258%)
Albert Montañés
(0.5684%)
Daniel Gimeno-Traver
(0.5984%)
Janko Tipsarević
(1.637%)
Alex Bogomolov Jr.
(2.0464%)
Matthew Ebden
(2.781%)
Denis Istomin
(3.3895%
Teymuraz Gabashvili
(4.2248%)
Michael Russell
(4.7099%)

The percentages from BuzzFeed next to each name represent the chance that the player would have lost as many suspicious matches as he did, if the opening odds of those matches were correct. The lower the percentage, the more unusual their suspicious match results (according to opening odds). For a complete interpretation of what the percentages mean, see the table towards the end of BuzzFeed’s original analysis.

Methodology

Layman’s Overview

Think of BuzzFeed’s anonymized dataset like a phonebook, where the names are replaced with unique jumbles of characters (e.g. “ffe23c8b”). We collected a few chapters of the same phonebook from public data online, with real names instead of jumbles. Then all we had to do was ask: which phone numbers do we have that BuzzFeed also has? Phone numbers are unique, so once we found a match, we could unambiguously link BuzzFeed’s name jumble to the real name of a player. We found several of these unambiguous links for each of the 15 players above.

In this analogy, the phone numbers are odds sets: 4 distinct numbers from a bookmaker that quantify each player’s odds of winning at the start of the match and by the end (plus some extra information — see Details below). The “chapters from the same phonebook” we collected are odds from bookmakers that we scraped from OddsPortal.com, the same source that BuzzFeed used.

Details

BuzzFeed anonymized the bookmakers they used in addition to the player names. However, BuzzFeed revealed that all of their data is from “seven large, independent bookmakers whose odds are available on OddsPortal.com.” Our first step was to identify some of the seven, to make sure our scraped dataset didn’t include odds from bookmakers that BuzzFeed never used. We identified four of the seven bookmakers with very high confidence (> 95%). We then discarded all our scraped data that weren’t from one of these four bookmakers.

Next we sought to find the players. Unlike numbers in a phonebook, odds sets aren’t necessarily unique. Indeed, BuzzFeed’s ~130,000-item dataset had several entries with identical odds sets. So we considered a few more pieces of information as well as odds sets to reduce the number of duplicates: the name of bookmaker, the year of the match, whether the match ended normally or abnormally (e.g. a player retiring due to injury).

We then took all the data entries that still weren’t unique among the rest of the entries from the same dataset and discarded them. (We only want unambiguous results, and we had enough data that we could afford to throw some out.) We did this for BuzzFeed’s data and ours.

We’re left with something very useful: thousands uniquely identifiable data entries with scrambled names (from BuzzFeed) and real names (from our scraping). We then simply cross-referenced the two datasets to unscramble the names of the players.

For an even more detailed walkthrough, see our published source code with comments on GitHub.

Ensuring Correctness

We took great pains to verify that all of the names above are the true deanonymizations of the scrambled names in BuzzFeed’s data analysis. Here are some of steps we took to help ensure correctness:

  • We were very conservative about the data we used — we threw out all ambiguous data entries from BuzzFeed and from our scraping. Furthermore, we only used our scraped data from bookmakers that we were highly confident BuzzFeed also used in its analysis.
  • After unscrambling the names, we manually looked up random matches on OddsPortal.com for the players we found, found the corresponding match entries in BuzzFeed’s dataset, and looked at the player hashes in those entries. We made sure our prior unscramblings of those hashes were consistent with the names listed for the match on OddsPortal.com. We did this multiple times for each player and each time our unscrambling was consistent.
  • We consulted with a Stanford statistics professor about our methodology, who after analyzing it is highly confident that our approach is sound.
  • We had an independent third party do a code review of our data processing and analysis to double-check for mistakes.
  • We went through anecdotal clues about these players’ identities that BuzzFeed revealed in its report and checked that they lined up with / didn’t contradict the names we found.

Who We Are

We’re three Stanford undergrads studying computer science: Russell Kaplan (Twitter), Jason Teplitz (Twitter), and Christina Wadsworth (Twitter).

If you’d like to get in touch with us, please email:
findingthetennissuspects@gmail.com

Acknowledgements

  • John Templeton, the investigative data reporter for BuzzFeed News who did the original analysis.
  • Brandon Garcia, a graduate student in computer science at Stanford with professional data science experience. Thanks for double checking our code!

--

--

Russell Kaplan

Stanford CS '17. Autopilot @tesla. Formerly @hackwithtrees, @metamindio, @dropbox, @mongodb