Preprocess FIFA World Cup data with Python

shotin
Analytics Vidhya
Published in
5 min readAug 30, 2020

Overview

A large amount of football play by play data was published by Wyscout in May 2019 on figshare.

These data are JSON format and it is complicated to process, so I tried to preprocess these data to DataFrame format with Python for later analysis. I’m going to use FIFA World Cup 2018 data. Use Google Colab, so you don’t need to build any dev environment on your laptop.

This is not sports analytics article but TIPS for preprocessing with pandas.

Copyright ©1994–2020 FIFA. All rights reserved

After this process, I visualized passing data. Please see also that article.

1. Read Json files

I’m going to use these files as follow:

  • matches_World_Cup.json
  • events_World_Cup.json
  • players.json
  • teams.json

Read these files using “pandas.read_json”.

import pandas as pd
pd.set_option("max_columns", 100) #This allows us to see a bunch of columns
match_origin = pd.read_json("matches_World_Cup.json”.replace(“‘“, ‘“‘))event_origin = pd.read_json(“events_World_Cup.json”.replace(“‘“, ‘“‘))player_origin = pd.read_json(“players.json”.replace(“‘“, ‘“‘))team_origin = pd.read_json(“teams.json”.replace(“‘“, ‘“‘))

See inside of dataframe.

match_origin.head()

Looks good… No, “teamsData” column is nested text data as json format. We need to process this.

Get “teamsData” of each row using loop, you can treat as dict type.

for team_data in match_origin["teamsData"].head():
print(team_data)

2. Convert nested json into dataframe

We cannot use “teamsData” for analysis as it is, so we have to convert “teamsData” into dataframe. We create the two dataframe, one stores match info and another stores match members from “teamsData”.

matchesInfo = pd.DataFrame(
columns=[
“matchId”
,”gameweek”
,"homeAway"
,”homeTeamId”
,”homeScore”
,”homeScoreP”
,”awayTeamId”
,”awayScore”
,”awayScoreP”
]
)
matchesMember = pd.DataFrame(
columns=[
“matchId”
,”teamId”
,”homeAway”
,”playerId”
,”startingF”
,”goals”
,”ownGoals”
,”yellowCards”
,”redCards”
]
)

We get the value from “teamsData” as dict type using loop.

for matchId, gameweek, teams in zip(match_origin[“wyId”], match_origin[“gameweek”], match_origin[“teamsData”]):  for team, info in teams.items():
if str(info[“side”]) == “home”:
homeAway = 1
else:
homeAway = 2
matchesInfo = matchesInfo.append(
pd.Series(
[
matchId
,gameweek
,homeAway
,team
,info["score"]
,info["scoreP"]
]
,index=matchesInfo.columns
)
,ignore_index=True
)
for startingM in info[“formation”][“lineup”]:
matchesMember = matchesMember.append(
pd.Series(
[
matchId
,team
,homeAway
,startingM[“playerId”]
,1
,startingM[“goals”].replace(“null”,”0")
,startingM[“ownGoals”]
,startingM[“yellowCards”]
,startingM[“redCards”]
]
,index=matchesMember.columns
)
,ignore_index=True
)
for benchM in info[“formation”][“bench”]:
matchesMember = matchesMember.append(
pd.Series(
[
matchId
,team
,homeAway
,benchM[“playerId”]
,0
,benchM[“goals”].replace(“null”,”0")
,benchM[“ownGoals”]
,benchM[“yellowCards”]
,benchM[“redCards”]
]
,index=matchesMember.columns
),ignore_index=True
)

Let’s see inside.

matchesInfo.head()
matchesMember.head()

3. Convert json (dictionary) into dataframe

Move on to player and team. Both dataset also have json format in area, role and so on, but these are not nested, just dictionary type.

You can use “json_normalize” to convert json (dictionary) into dataframe.

from pandas.io.json import json_normalizeteams = pd.DataFrame(
data={
"teamId": team_origin["wyId"]
,"name": team_origin["name"]
,"officialName": team_origin["officialName"]
}
).join(
pd.DataFrame(
data={
"areaCode": json_normalize(team_origin["area"])["alpha3code"]
,"areaName": json_normalize(team_origin["area"])["name"]
}
)
)
players = pd.DataFrame(
data={
"playerId": player_origin["wyId"]
,"clubTeamId": player_origin["currentTeamId"]
,"nationalTeamTd":player_origin["currentNationalTeamId"].replace("null","0")
,"playerName": player_origin["shortName"]
,"firstName": player_origin["firstName"]
,"middleName": player_origin["middleName"]
,"lastName": player_origin["lastName"]
}
).join(
pd.DataFrame(
data={
"positionCode": json_normalize(player_origin["role"])["code2"]
,"positionName": json_normalize(player_origin["role"])["name"]
}
)
)

See inside of dataframe.

teams
players

4. Get play by play data

In the end, get play by play data which is the most important.

Let’s see inside of event data.

event_origin.head()

What is “tags”? You can see api document in this link:

1801 is accurate pass and 701 is lost. You definitely need to process this column.

Firstly, we need to define dataframe.

events = pd.DataFrame(
columns=[
“eventId”
,”matchId”
,”matchPeriod”
,”teamId”
,”playerId”
,”beforeEventSec”
,”eventSec”
,”eventId”
,”subEventId”
,”goalF”
,”assistF”
,”keyPass”
,”accurateF”
,”fromX”
,”fromY”
,”toX”
,”toY”
]
)

Insert data into dataframe using loop in event data. (This takes a long, long time…) I chose Goal, Assist, KeyPass and Accurate Pass in tags, but you can choose whatever you want.

beforeSec = 0for index_event,event in event_origin.iterrows():
#initialize variable
fromX = -1
fromY = -1
toX = -1
toY = -1
goal_f = 0
assist_f = 0
keypass_f = 0
accurate_f = 0
for pos in event[‘positions’]:
if (fromX == -1 or fromY == -1):
fromX = pos[“x”]
fromY = pos[“y”]
else:
toX = pos[“x”]
toY = pos[“y”]

for tag in event[‘tags’]:
if int(tag[“id”]) == 101:
goal_f = 1
elif int(tag[“id”]) == 301:
assist_f = 1
elif int(tag[“id”]) == 302:
keypass_f = 1
elif int(tag[“id”]) == 1801:
accurate_f = 1

events = events.append(
pd.Series(
[
event[“id”]
,event[“matchId”]
,event[“matchPeriod”]
,event[“teamId”]
,event[“playerId”]
,beforeSec
,event[“eventSec”]
,event[“eventId”]
,event[“subEventId”]
,goal_f
,assist_f
,keypass_f
,accurate_f
,fromX
,fromY
,toX
,toY
]
,index=events.columns
)
,ignore_index=True
)

beforeSec = event[“eventSec”]
events

There are no event id and sub event id master as json file, so we need to create.

eventKinds = event_origin[[“eventId”,”eventName”]][~event_origin[[“eventId”,”eventName”]].duplicated()]subEventKinds = event_origin[["subEventId","subEventName"]][~event_origin[["subEventId","subEventName"]].duplicated()]
eventKinds.sort_values([“eventId”]).head()
subEventKinds[subEventKinds.subEventId != “”].sort_values([“subEventId”]).head()

In the end, save dataframe to csv.

matchesInfo.to_csv(“csv/matches.csv”,index=False)
matchesEntry.to_csv(“csv/matches_member.csv”,index=False)
events.to_csv(“csv/events.csv”, index=False)
eventKinds.to_csv(“csv/eventKinds.csv”,index=False)
subEventKinds.to_csv(“csv/subEventKinds.csv”,index=False)
players.to_csv(“csv/players.csv”,index=False)
teams.to_csv(“csv/teams.csv”,index=False)

I’m going to visualize data using these csv next time.

Thank you for reading and see you soon!

--

--