Animating Soccer Transfer Fees with Python and Matplotlib

Stefan Gouyet
Analytics Vidhya
Published in
5 min readFeb 16, 2020

Transfer fees are a never-ending discussion in soccer, with every summer bringing higher prices and more extravagant deals.

I decided to visualize the rise of transfer fees over time, using an animated bar chart, grouped by position. For this, I relied on matplotlib and two very useful Medium articles by Gabriel Berardi and Pratap Vardhan.

Data was gathered via ewenme’s git repo, which was originally scraped from Transfermarkt. To begin, I merged the individual datasets from the repository:

mytemplist=[]#loop over folders in directory
for folder in os.listdir(os.getcwd()):
if '.' in folder:
continue

current_folder = os.getcwd() + '/' + folder
print(current_folder)

#loop over file in folder and read csv
for file in os.listdir(current_folder):
print(os.getcwd() + '/' + folder + file)
df = pd.read_csv(os.getcwd() + '/' + folder + '/'+ file)

mytemplist.append(df)

print('done writing file: {}'.format(file))
df = pd.concat(mytemplist)

The second step was to clean the data by grouping positions into the four main ones, removing duplicates (keep rows where a player joins a team — “in” — and remove rows where a player leaves a team — “out”), and creating a new variable with both player name and team name (i.e. “Eden Hazard (Real Madrid)”).

#group positions based on attack, mid, def, goalkeeper
positions = [df['position'].isin(['Centre-Forward',
'Left Winger',
'Right Winger',
'Striker',
'Second Striker',
'Forward']),
(df['position'].isin(['Central Midfield',
'Attacking Midfield',
'Defensive Midfield',
'Right Midfield',
'Midfielder'])),
(df['position'] == 'Goalkeeper')]

generic_position = ['Attacker', 'Midfielder', 'Goalkeeper']
df['generic_position'] = np.select(positions, generic_position, default = 'Defender')
#drop players where fee_cleaned==NA
df.dropna(subset = ['fee_cleaned'], inplace=True)
df = df[df['fee_cleaned'] > 0.1]#keep only players with transfers "in" to club - not leaving (duplicates otherwise)
df = df[df['transfer_movement'] == 'in']
#make new variable: player_name + club_name
df['player_name_club'] = df.apply(lambda x: str(x['player_name'] + ' (' + x['club_name'] + ')'), axis=1)

Third, I created a new dataframe (top_transfers), where I kept rows of players who were in the top 10 of their year-position combination (i.e. 1999 defenders). This allowed for a smaller dataset that was cleaner and easier to look at.

#Get top 10 transfers for each year + position; sort by fee_cleaned, group by, and take top 10
top_transfers = df.sort_values(by=['fee_cleaned'],ascending=False).groupby(['year','generic_position']).head(10)
#keep necessary columns only
top_transfers = top_transfers[['player_name_club', 'club_involved_name','fee_cleaned','year','generic_position']]
#Sort by year
top_transfers = top_transfers.sort_values(by=['year'],ascending=True)

The fourth step was to use a function similar to R’s complete function (tidyr package), which doesn’t exist as a stand-alone function in pandas and requires a few lines of code (this stackoverflow post was very useful). In a nutshell, this step allows for a transfer fee (i.e. N’Golo Kante’s transfer to Chelsea in 2016) to persist in 2017, 2018, and 2019.

#complete rows to keep player row after year of signing (i.e. player_x signed in 2018, keep player_x in 2019 data as well)
#credit to: https://stackoverflow.com/questions/40093971/pandas-dataframe-insert-fill-missing-rows-from-previous-dates
levels = ['year','player_name_club']
full_idx = pd.MultiIndex.from_product([top_transfers[col].unique() for col in levels], names=levels)
top_transfers_complete = top_transfers.set_index(levels).reindex(full_idx)
top_transfers_complete = top_transfers_complete.groupby(level=['player_name_club']).ffill().reset_index()top_transfers_complete.fillna(0, inplace=True)

Looking at N’Golo Kante in our new dataset shows the following:

top_transfers_complete[top_transfers_complete['player_name_club'].str.contains('Kant')].tail(10)

As you can see above, once the transfer has taken place (in 2016 for this example), our new dataframe includes a row for each year and player_name_club combination.

For the data visualization aspect of the project, I created two functions that created bar charts and separated our bars into subplots, based on position (Goalkeeper, Defender, Midfielder, Attacker).

The first function, shown below, uses matplotlib to design an individual bar chart, listing the top five player’s name and price, for the year and position type. Most of the code below is purely formatting.

def draw_indiv_barchart(ax,df, position,year):
df = df[df.generic_position == position]
dff = df[df['year'].eq(year)].sort_values(by='fee_cleaned',
ascending=True).tail(5)
ax.clear()
ax.set_xlim([0, 250])
ax.set_xticks(np.arange(0, 250, 50))
ax.barh(dff['player_name_club'], dff['fee_cleaned'])
dx = dff['fee_cleaned'].max() / 200

for i, (value, name) in enumerate(zip(dff['fee_cleaned'], dff['player_name_club'])):
ax.text(value + dx, i + 0.1, ' ' + name, color='#3b4463',size=16, weight=600,
ha='left', va='bottom', fontdict = {'fontname': 'Georgia'})
ax.text(value + dx, i - 0.1, ' £'+str(value)+'mil', size = 14, weight =200,
ha = 'left', va = 'center', fontdict = {'fontname': 'Georgia'})
ax.text(0, 1.09, ' £ (Millions)', transform=ax.transAxes, size=16, color='#777777')
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
ax.xaxis.set_ticks_position('top')
ax.tick_params(axis='x', colors='#777777', labelsize=14)
ax.set_yticks([])
ax.margins(0, 0.01)
ax.grid(which='major', axis='x', linestyle='-')
ax.set_axisbelow(True)
ax.xaxis.set_ticks_position('top')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.subplots_adjust(left = 0.075, right = 0.75, top = 0.825, bottom = 0.05, wspace = 0.2, hspace = 0.2)
plt.locator_params(axis = 'x', nbins = 12)
plt.box(False)

The second function called our first function four times, looping over a dictionary ax_dict.

ax_dict = {ax1: "Attacker",
ax2: "Midfielder",
ax3 : "Defender",
ax4: "Goalkeeper"}
def draw_subplots(year):

for ax, position in ax_dict.items():
draw_indiv_barchart(ax,top_transfers_complete,position,year)
if ax == ax1:
ax.set_title(year, size=42, weight=600, ha='center',fontdict = {'fontname': 'Georgia'}, y = 1.1)
ax.set_ylabel(position, size = 22,fontdict = {'fontname': 'Georgia'})

To visualize one year’s worth of data (2019 in this case), the following code was run:

current_year=2019fig, (ax1, ax2,ax3,ax4) = plt.subplots(4,figsize=(22, 16))
fig.suptitle('Most Expensive Soccer Transfers (1992-2019)',
ha='center', va='center', y= 0.9, x = 0.4,
fontsize=48, fontdict = {'fontname': 'Georgia'})
ax1.set_ylabel('Attacker')
fig.text(0.65, 0.02, 'Source: Transfermarkt', ha='right',fontsize=30,
bbox=dict(facecolor='white', alpha=0.8, edgecolor='white'),fontdict = {'fontname': 'Georgia'})

draw_subplots(year = current_year)

Producing the following static bar chart:

The final step was to animate the visualization, using IPython’s HTML function and matplotlib’s animation function. The gif at the top of the page was generated at fps = 2, and the one below is at 0.7, hence the slow-moving figure:

animator = animation.FuncAnimation(fig, draw_subplots, frames=range(1992, 2020), interval = 600)
HTML(animator.to_jshtml())
animator.save('animation.gif', writer='imagemagick', fps = 2, bitrate = 1800)

That’s the final visualization, showing a nice trend upwards over the past 20 years.

Full git repository here: https://github.com/stefangouyet/transfers

— -

This post was republished to my personal blog on August 31, 2020.

--

--

Stefan Gouyet
Analytics Vidhya

Frontend Engineer | Cloud enthusiast | Washington, D.C.