Analyzing 50 Years of Tennis

Nicolas Escobar Jariton
Data Guasu
Published in
8 min readOct 17, 2018

In this post I will make use of Python’s libraries: pandas, matplotlib and seaborn to analyze data from ATP tennis competitions from the year 1968 up to 2018, including Grand Slams, Masters Series, Masters Cup and International Series competitions.

The dataset prepared by Jeff Sackmann contains information on every single match played since 1968. It includes details such as match’s date, location, tournament type, surface, winner and loser, score, duration and additional statistics such as players’ rankings, age and height, aces, double faults, break points faced and saved, service points among other helpful stats for both players.

The Jupyter Notebook with the full analysis can be accessed here.

Loading the data

There is one .csv file per year per tournament type. In this analysis I’m only loading atp_matches_YYYY files which contain the ATP matches for that year.

# Path to all atp_matches_YYYY.csv files
path ='tennis_atp_data'
files = glob.glob(path + "/*.csv")
tennis_df = pd.concat((pd.read_csv(f) for f in files))

Deriving new columns

First let’s derive two new columns to store year and year-month attributes.

# Extract year and month in YYYYMM format
tennis_df['tourney_yearmonth'] = tennis_df.tourney_date.astype(str).str[:6]
# Extract year in YYYY format
tennis_df['tourney_year'] = tennis_df.tourney_date.astype(str).str[:4]

Exploratory Data Analysis (EDA)

Distribution of most important attributes

We start the EDA by looking at the distribution of some of the key attributes. Histogramsare very helpful to represent numerical distributions of the underlying data.

dimensions = ['winner_rank','loser_rank','winner_age','loser_age','winner_ht',
'loser_ht','w_svpt','l_svpt']
plt.figure(1, figsize=(20,12))
for i in range(1,9):
plt.subplot(2,4,i)
tennis_df[dimensions[i-1]].plot(kind='hist', title=dimensions[i-1])

In the histograms above we see that attributes like winner_rank and loser_rank are skewed to the right (median lower than the mean). On the other hand, we see attributes like winner_ht and loser_ht which are closer to a normal distribution (bell shaped).

Evolution of winners’ rankings in Grand Slam finals

In the following graph we are using a scatter plot to represent rankings of players that won Grand Slam finals in each year. The size of the bubble indicates the ranking of the player that lost the match (the smaller the bubble, the better the ranking).

We see slight differences between Grand Slams winners’ rankings. In the US Open, rankings are more dispersed, meaning that more players with lower rankings were able to win the tournament. Also, looking at the size of the bubbles, losers had also lower rankings.

Distribution of aces by surface type

Box plots are also useful to understand distributions by looking at what are called the five number summary: minimum, first quartile, median, third quartile and maximum.

tennis_df_h = tennis_df[~np.isnan(tennis_df['w_ace']) & (tennis_df['tourney_level'].isin(['G','M'])) ].copy()
g = sns.boxplot(x="surface", y="w_ace", data=tennis_df_h)
g.set(xlabel='Surface', ylabel='Aces')

In this plot I compare the distribution of aces in each surface type. We can see, for example, that the median and maximum number of aces is much higher in grass than in clay courts. This makes sense for us tennis followers, as grass is a faster surface than clay.

Evolution of specific countries based on their players wins’

If we consider countries of current top players (Nadal, Federer, Del Potro, Djokovic, Isner), how did players of these countries perform over the years?

rgentina, Spain and Switzerland show a huge jump in the number of wins in Grand Slams in the early 2000s that coincides with the appearances of legendary players like Coria, Nalbandian, Gaudio, Del Potro (Argentina), Robredo, Nadal (Spain) and Federer, Wawrinka (Switzerland). United States on the other hand, shows a big drop in wins after the 80s. Legends like Sampras, Agassi, Chang and others made it very difficult for the new generation to level their records. The case of Serbia is difficult to analyze because of political reasons: the country became independent in the early 2000s (before it was part of Yugoslavia). However, even if we consider this fact we have to acknowledge that the appearance of Djokovic put Serbia in the spotlight.

Players with most aces and double falts

Aces in tennis are points won by serves that are not touched by the receiver. Double falts on the other hand, are points lost by the server because of two missed serves. Here, I would like to see the players that hit the most aces in history and the ones with the most double faults. Do we find players in both lists?

# Create dataframe with details on aces by winners of each match
sw = tennis_df.groupby(['winner_name']).agg({'w_ace':'sum'}).fillna(0).sort_values(['w_ace'], ascending=False)
# Create dataframe with details on aces by losers of each match
sl = tennis_df.groupby(['loser_name']).agg({'l_ace':'sum'}).fillna(0).sort_values(['l_ace'], ascending=False)
# Concatenate dataframes
dfs = [sw,sl]
r = pd.concat(dfs).reset_index().fillna(0)
# Derive new column with total number of aces
r['aces'] = r['l_ace']+r['w_ace']
final = r.groupby('index').agg({'aces':'sum'}).sort_values('aces',ascending=False).head(10)
final = final.reset_index()
final.columns = ['Player','Aces']
final = final.sort_values('Aces',ascending=True)
final.plot('Player','Aces', kind='barh', title='Players with Most Aces', legend=False)

Ivanisevic, Sampras, Lopez, Rusedski are players in both lists. Aces are higher risk serves so it makes sense that the incidence of double faults is also higher in those players.

Players’ win performance over time

How did top players perform over time? Since I’m a big Federer fan, we shall start with the (current) Grand Slam record holder.

plot_history_player('Roger Federer')

The Swiss started with ATP wins in 1998 reaching his peak in 2003 when he started winning Grand Slam matches as well.

Now, let’s see how Nadal did over the years..

plot_history_player('Rafael Nadal')

We see a similar behaviour in the case of Rafa. He reaches a peak in ATP matches wins and then starts winning more Grand Slam matches. This happens in both players because when they start competing, they play a lot of (ATP) tournaments until they get to a point where they focus on the major tournaments (Masters, Grand Slams, few 500 ATP).

Dominance

Unique number of players that won Grand Slam and Master tournaments (by year). How many unique players won the biggest tournaments per year since 1968?

# Unique number of players that won GS and Masters per year 
s = tennis_df[(tennis_df['round']=='F')&(tennis_df['tourney_level'].isin(['M','G']))].groupby(['tourney_year']).agg({'winner_name':'nunique'})
t= s.reset_index()
t.columns=['Year','Unique_Winners']
t.plot('Year', 'Unique_Winners', kind='line', title='Unique # of Players that Won GS and Masters Finals', legend=False)

Unique number of players that won Grand Slam tournaments (by periods)

We know that the last decade in tennis was pretty much dominated by few players (Federer, Nadal, Djokovic). Let’s look at that in a plot.

Dominance is pretty clear in this graph. There were between 15 and 17 unique winners in the three previous periods and only 7 in the last decade!

Players’ effectiveness by surface types

What is the effectiveness of top ranked players? Effectiveness is measured by the number of wins over the total matches played.

Let’s start with Roger again.

plot_effectiveness('Roger Federer')

Federer’s effectiveness reached its peak between 2005 and 2010, reaching up to 100% on grass tournaments with a good performance on hard courts as well. The effectiveness on clay on the other hand considerably lower.

Now let’s see how Nadal did.

plot_effectiveness('Rafael Nadal')

Nadal shows a strong effectiveness on clay tournaments (no surprise here, as he is named the King of Clay).

Age of Grand Slam champions over time

Are Grand Slams champions younger or older as time goes by? In which Grand Slams do we find the youngest and oldest champions?

The first Grand Slams champions were in their 30s but then in mid-1970 younger champions emerged. The average age became pretty stable after that until around 2010 where we see a steady increase in the average age of champions. So, what happened here? Legends started getting older but they kept on winning titles: Federer won his 20th Grand Slam in 2018 Australian Open with 36 years of age!

Retirements

What is the evolution of retirements over time? In which tournament do we see most of these retirements?

Is is just that we have more retirements because there are more matches played in that particular torunament or surface? What if we consider the ratio of retirements over matches played?

This plot, which uses a log-scale, shows that there is effectively an upward trend in retirements ratio, although not specific to any surface.

Top rivalries by decade

What are the top rivalries in tennis’ history based on the number of matches played between players?

We might think Federer vs. Nadal is the biggest rivalry but they actually haven’t played the most matches together as we shall see..

Key insights

To summarize, the key insights that I got from this analysis are:

  • Grass and hard courts have a higher incidence in the number of aces.
  • Many players appear in both rankings of top aces and double faults in history. More aces could mean taking higher risks that could lead to a higher number of double faults.
  • In the last ten years of professional tennis only 7 players have won most of the big tournaments as compared to an average of 16 players in previous decades.
  • The average age of champions increased in the last few years. This seems to be due to dominance from the same players that were getting older each year.
  • There is an upward trend in retirements ratio in recent years with no clear difference between surface types.
  • In the 80s rivalries were more even. In the 90s there was a rivalry that stood out and in the 2010s the rivalries were shared between few players with a higher number of encounters.

Conclusion and future work

Well, I hope you have enjoyed this analysis as much I enjoyed working on it!

There is definitely much more that could be done with this data, here are a few ideas:

  • Answers additional questions like:

Who are the players that reverted most matches in Grand Slam tournaments? (e.g. ended up winning after losing two sets to zero)

Who are the players with the longest winning and losing streaks?

  • Apply unsupervised ML techniques to create players segmentation based on statistics like aces, scores, retirements, etc.
  • Apply supervised ML techniques to predict future performance of players

--

--

Nicolas Escobar Jariton
Data Guasu

Data Scientist at Phoenix Games. M.S. in Data Science, Indiana University.