Note on the Datasets

Harman Singh
Seed Data
Published in
3 min readMar 13, 2017

What we’re using, and why

The results I post on this publication are all derived from Jeff Sackmann’s datasets. There are quite a lot of these repos, ranging from the incredibly comprehensive every-shot-of-every-rally logs in tennis_MatchChartingProject to the simple collection of interesting ideas in tennis_Research_Notes. While the MCP is something I want to get around to working with eventually, at the moment the only dataset I’ve really used is tennis_atp.

The tennis_atp dataset is a fairly simple set to work with, and consequently the most popular one by far. It’s mostly just a collection of match results and scores for all matches played in the open era, with only most basic stats for each match. Most matches under tour level, i.e. qualifiers, challengers, and even Davis Cup matches, have no real stats. For reference, I’m giving here a quick breakdown of what the dataset looks like, specifically my fork of it, and what the setup script does to make things easier.

There are three tables in the database, players, rankings, and matches.

Players and Rankings

Schema for players and rankings

Quick examples.

Players from India who are left-handers.

SELECT lastname FROM players WHERE HAND='L' AND COUNTRY='IND';

IDs of all the world #1s.

SELECT DISTINCT player_id FROM rankings WHERE pos=1;

Rather unintuitively, the player_id field isn’t actually unique (although it’s pretty close), so it can’t be the primary key in the Player table. This is an issue with the dataset, and until it is resolved, we’re forced to handle player_id clashes manually if and when they arise.

There is a time-based field in each of the tables, namely birth, date and tourney_date, corresponding to the players’ birthdate, the ranking date, and match date respectively. Because the original data has these values as simple strings, and not something that can be easily queried, the setup script adds a birth_date, a ranking_date, and a match_date field to the tables, in PSQL’s standard date format. This lets me do stuff like

SELECT * FROM players WHERE EXTRACT(YEAR FROM birth_date)=1980;

As opposed to

SELECT * FROM players WHERE SUBSTRING(birth from 0 for 5)='1980';

Matches

Our most useful table is also by far the densest. Apart from the basic result and score fields, we also have some basic stats for each match.

Schema for matches
  • Seeds
  • Rank
  • Tournament name
  • Tournament level
  • Surface
  • Draw size
  • Round
  • Score
  • Duration (in minutes)
  • Aces
  • Double faults
  • Service points
  • 1st serves in
  • 1st serves won
  • 2nd serves won
  • Service games
  • Break points faced
  • Break points saved

Quick example. Score of the grand slam match ever played:

SELECT minutes, score FROM MATCHES WHERE TOURNEY_LEVEL='G' AND minutes IS NOT NULL ORDER BY minutes desc limit 1;

And here’s an example that uses all of them together. What was the longest (known) match won by the Indian player who has been in the top 100?

WITH indian_best_ranks AS (SELECT
p.player_id
FROM
players p
INNER JOIN
rankings r
ON p.player_id=r.player_id
WHERE
country='IND'
GROUP BY
p.player_id
HAVING
MIN(pos)<100)
SELECT
winner_name,
score,
match_date,
minutes
FROM
matches m
INNER JOIN
indian_best_ranks i
ON m.winner_id=i.player_id
WHERE
minutes IS NOT NULL
ORDER BY
minutes DESC LIMIT 1;

Result:

winner_name    |       score       | match_date | minutes------------------+-------------------+------------+---------Somdev Devvarman | 7-6(6) 2-6 7-6(8) | 2011-04-25 |     194(1 row)

--

--