Note on the Datasets

Published in

Seed Data

3 min readMar 13, 2017

What we’re using, and why

The results I post on this publication are all derived from Jeff Sackmann’s datasets. There are quite a lot of these repos, ranging from the incredibly comprehensive every-shot-of-every-rally logs in tennis_MatchChartingProject to the simple collection of interesting ideas in tennis_Research_Notes. While the MCP is something I want to get around to working with eventually, at the moment the only dataset I’ve really used is tennis_atp.

The tennis_atp dataset is a fairly simple set to work with, and consequently the most popular one by far. It’s mostly just a collection of match results and scores for all matches played in the open era, with only most basic stats for each match. Most matches under tour level, i.e. qualifiers, challengers, and even Davis Cup matches, have no real stats. For reference, I’m giving here a quick breakdown of what the dataset looks like, specifically my fork of it, and what the setup script does to make things easier.

There are three tables in the database, players, rankings, and matches.

Players and Rankings

Quick examples.

Players from India who are left-handers.

SELECT lastname FROM players WHERE HAND='L' AND COUNTRY='IND';

IDs of all the world #1s.

SELECT DISTINCT player_id FROM rankings WHERE pos=1;

Rather unintuitively, the player_id field isn’t actually unique (although it’s pretty close), so it can’t be the primary key in the Player table. This is an issue with the dataset, and until it is resolved, we’re forced to handle player_id clashes manually if and when they arise.

There is a time-based field in each of the tables, namely birth, date and tourney_date, corresponding to the players’ birthdate, the ranking date, and match date respectively. Because the original data has these values as simple strings, and not something that can be easily queried, the setup script adds a birth_date, a ranking_date, and a match_date field to the tables, in PSQL’s standard date format. This lets me do stuff like

SELECT * FROM players WHERE EXTRACT(YEAR FROM birth_date)=1980;

As opposed to

SELECT * FROM players WHERE SUBSTRING(birth from 0 for 5)='1980';

Matches

Our most useful table is also by far the densest. Apart from the basic result and score fields, we also have some basic stats for each match.

Seeds
Rank
Tournament name
Tournament level
Surface
Draw size
Round
Score
Duration (in minutes)
Aces
Double faults
Service points
1st serves in
1st serves won
2nd serves won
Service games
Break points faced
Break points saved

Quick example. Score of the grand slam match ever played:

SELECT minutes, score FROM MATCHES WHERE TOURNEY_LEVEL='G' AND minutes IS NOT NULL ORDER BY minutes desc limit 1;

And here’s an example that uses all of them together. What was the longest (known) match won by the Indian player who has been in the top 100?

WITH indian_best_ranks AS (SELECT
        p.player_id 
    FROM
        players p 
    INNER JOIN
        rankings r 
            ON p.player_id=r.player_id 
    WHERE
        country='IND' 
    GROUP BY
        p.player_id 
    HAVING
        MIN(pos)<100) 
SELECT
    winner_name,
    score,
    match_date,
    minutes 
FROM
    matches m 
INNER JOIN
    indian_best_ranks i 
        ON m.winner_id=i.player_id 
WHERE
    minutes IS NOT NULL 
ORDER BY
    minutes DESC LIMIT 1;

Result:

winner_name    |       score       | match_date | minutes------------------+-------------------+------------+---------Somdev Devvarman | 7-6(6) 2-6 7-6(8) | 2011-04-25 |     194(1 row)

Note on the Datasets

Players and Rankings

Matches

Written by Harman Singh