Turning wiki content into a graph for football fans
Football Wiki is an extensive database of everything football fan can be interested in. It describes teams, players, managers, cup finals and important tournaments.
While being a great source of information about a given football player or a team, this wiki (and wikis in general) does not really work great when we want:
- to find out which players played for both Liverpool and Manchester United,
- to list all midfielders in Premier League born in Germany,
- to check who played for Liverpool in 2012.
Wikis are “just” content. Interlinked, but still just a text (or a wikitext, to be precise). In order to answer the above questions, we need to understand the content and model it properly.
Understanding the content of the wiki
At a first glance, article content may seem to be just a blob of text. However, wikitext does provide features like links and templates — content gets a bit of structure. Links model relations between articles (“Ole Gunnar Solskjær” article will link to “Manchester United” and “Molde FK”). However, way more data can be extracted from templates.
Templates as a structured data source
Infoboxes, a specific type of templates, provide a set of properties for a given article. When it comes to the world of football these can be:
- biodata — date and place of birth, nationality,
- player height and position on the field,
- club foundation date, stadium capacity,
- list of clubs given person played in / trained.
All the properties for an infobox are provided in an article’s wikitext:
In order to avoid error-prone parsing of wikitext of templates invocation, we implemented a simple API end-point that provide us with a JSON-formatted tree of all templates used by a given article together with parameters names and values as understood by the MediaWiki’s wikitext parser:
Representing the data
Different infoboxes are used to model different types of entities. We assumed the following:
- Infobox Biography template describes a person (either a player or a manager, or even both),
- Infobox Club template describes a team,
- Fs player template is used inside the team’s article content to connect teams with players (i.e. set up a relation for the current team squad).
As we parse the list of parameters passed to templates we assume that:
- parameters with links form a relation (e.g. player played in this club, the club has this person as a manager),
- parameters with plain values are the source of data for properties (club foundation year, birth date, player’s height).
schema.org types
We plan to use more than just a single wiki to build a wiki-powered knowledge database. Hence, we do not want to rely on a specific naming convention used by each of the wikis.
That’s why we model collected data using industry standard schema.org types and properties:
- https://schema.org/Person models players and managers
- https://schema.org/SportsTeam models football clubs
This approach will, later on, allow us to expose structured data and links between entities in HTML of rendered wiki articles.
Building a graph
We do not only want to store properties of all football players, managers and teams. We also want to have relations between them:
- player career is described as
[:athlete]
relation (withsince
anduntil
properties) withSportsTeam
node, - team’s current squad is described as
[:athlete]
relation withPerson
node. A player position on the field and number is stored in relation's properties.
In order to store such data set, we decided to use a graph database. In our project, we picked RedisGraph — a module for Redis with graph support.
Modeling nodes and relations + examples
A Python script was used to do the following:
- get the list of all football players and a team from Football Wiki
- fetch JSON tree with a list of templates used by each of the above articles
- using specific templates (described above) build models that have both bio properties (name, birth date, nationality, and height) and relations (athlete and coach).
Here’s the example for Ole Gunnar Solskjær:
<PersonModel https://schema.org/Person (Ole_Gunnar_Solskjr:Person) name = "Ole Gunnar Solskjær", birthDate = "1973", birthPlace = "Kristiansund", nationality = "Norway", height = "1.78">
--[:athlete {"until": 1994, "since": 1990}]->(Clausenengen_FK:SportsTeam)
--[:athlete {"until": 1996, "since": 1994}]->(Molde_FK:SportsTeam)
--[:athlete {"until": 2007, "since": 1996}]->(Manchester_United_F_C:SportsTeam)
--[:coach {"until": 2011, "since": 2008}]->(Manchester_United_F_C_Reserves_and_Academy:SportsTeam)
--[:coach {"until": 2014, "since": 2011}]->(Molde_FK:SportsTeam)
--[:coach {"until": 2014, "since": 2014}]->(Cardiff_City_F_C:SportsTeam)
--[:coach {"until": 2016, "since": 2014}]->(Clausenengen_FK:SportsTeam)
--[:coach {"until": 2018, "since": 2015}]->(Molde_FK:SportsTeam)
--[:coach {"until": 2019, "since": 2018}]->(Manchester_United_F_C:SportsTeam)
And here’s how we modeled Manchester United article:
<SportsTeamModel https://schema.org/SportsTeam (Manchester_United_F_C:SportsTeam) name = "Manchester United F.C.", sport = "Football", foundingDate = "1878", ground = "Old Trafford", memberOf = "Premier League", url = "http://www.manutd.com/">
--[:coach ]->(Ole_Gunnar_Solskjr:Person)
--[:athlete {"number": 1, "position": "GK"}]->(David_de_Gea:Person)
--[:athlete {"number": 2, "position": "DF"}]->(Victor_Lindelf:Person)
--[:athlete {"number": 3, "position": "DF"}]->(Eric_Bailly:Person)
--[:athlete {"number": 4, "position": "DF"}]->(Phil_Jones_born_1992:Person)
--[:athlete {"number": 6, "position": "MF"}]->(Paul_Pogba:Person)
--[:athlete {"number": 7, "position": "FW"}]->(Alexis_Snchez:Person)
--[:athlete {"number": 8, "position": "MF"}]->(Juan_Mata:Person)
--[:athlete {"number": 9, "position": "FW"}]->(Romelu_Lukaku:Person)
--[:athlete {"number": 10, "position": "FW"}]->(Marcus_Rashford:Person)
--[:athlete {"number": 11, "position": "FW"}]->(Anthony_Martial:Person)
--[:athlete {"number": 12, "position": "DF"}]->(Chris_Smalling:Person)
--[:athlete {"number": 13, "position": "GK"}]->(Lee_Grant_born_1983:Person)
--[:athlete {"number": 14, "position": "MF"}]->(Jesse_Lingard:Person)
--[:athlete {"number": 15, "position": "MF"}]->(Andreas_Pereira:Person)
--[:athlete {"number": 16, "position": "DF"}]->(Marcos_Rojo:Person)
--[:athlete {"number": 17, "position": "MF"}]->(Fred_born_1993:Person)
--[:athlete {"number": 18, "position": "DF"}]->(Ashley_Young:Person)
--[:athlete {"number": 20, "position": "DF"}]->(Diogo_Dalot:Person)
--[:athlete {"number": 21, "position": "MF"}]->(Ander_Herrera:Person)
--[:athlete {"number": 22, "position": "GK"}]->(Sergio_Romero:Person)
--[:athlete {"number": 23, "position": "DF"}]->(Luke_Shaw:Person)
--[:athlete {"number": 25, "position": "DF"}]->(Antonio_Valencia:Person)
--[:athlete {"number": 27, "position": "MF"}]->(Marouane_Fellaini:Person)
--[:athlete {"number": 31, "position": "MF"}]->(Nemanja_Mati:Person)
--[:athlete {"number": 36, "position": "DF"}]->(Matteo_Darmian:Person)
--[:athlete {"number": 39, "position": "MF"}]->(Scott_McTominay:Person)
--[:athlete {"number": 24, "position": "DF"}]->(Timothy_Fosu_Mensah:Person)
--[:athlete {"number": 38, "position": "DF"}]->(Axel_Tuanzebe:Person)
--[:athlete {"number": 40, "position": "GK"}]->(Joel_Castro_Pereira:Person)
Querying the database
Now let’s get back to the questions that we wanted to make on top of Football Wiki.
RedisGraph can be queried using Cypher — a language designed for Neo4j database. We want to ask it to give us all teams (t node) that are linked by persons (p nodes) using athlete relation. Additionally, teams are only limited to members of the Premier League. And we‘re only interested in Icelandic players:
127.0.0.1:56379> GRAPH.QUERY football "MATCH (t:SportsTeam)<-[a:athlete]-(p:Person) WHERE t.memberOf = 'Premier League' AND p.nationality = 'Iceland' RETURN t.name,p.name,a.since,a.until"
1) 1) 1) "t.name"
2) "p.name"
3) "a.since"
4) "a.until"
2) 1) "Arsenal F.C."
2) "\xc3\x93lafur Ingi Sk\xc3\xbalason"
3) "2001.000000"
4) "2005.000000"
3) 1) "Tottenham Hotspur F.C."
2) "Gylfi \xc3\x9e\xc3\xb3r Sigur\xc3\xb0sson"
3) "2012.000000"
4) "2014.000000"
4) 1) "Everton F.C."
2) "Gylfi \xc3\x9e\xc3\xb3r Sigur\xc3\xb0sson"
3) "2017.000000"
4) "NULL"
5) 1) "Burnley F.C."
2) "J\xc3\xb3hann Berg Gu\xc3\xb0mundsson"
3) "2016.000000"
4) "NULL"
2) 1) "Query internal execution time: 11.935366 milliseconds"
Using a similar query we can get the graph of all German midfielders in Premier League:
127.0.0.1:56379> GRAPH.QUERY football "MATCH (t:SportsTeam)-[a:athlete]->(p:Person) WHERE t.memberOf = 'Premier League' AND a.position = 'MF' AND p.nationality = 'Germany' RETURN t.name,p.name,a.number"
And last, but not least — who played for both Liverpool and Manchester United?
127.0.0.1:56379> GRAPH.QUERY football "MATCH (t:SportsTeam)<-[:athlete]-(c:Person)-[:athlete]->(t2:SportsTeam) WHERE t.name = 'Liverpool F.C.' and t2.name = 'Manchester United F.C.' RETURN c.name,t.name,t2.name"
1) 1) 1) "c.name"
2) "t.name"
3) "t2.name"
2) 1) "Peter Andrew Beardsley"
2) "Liverpool F.C."
3) "Manchester United F.C."
3) 1) "Paul Emerson Carlyle Ince"
2) "Liverpool F.C."
3) "Manchester United F.C."
4) 1) "Michael James Owen"
2) "Liverpool F.C."
3) "Manchester United F.C."
Summary
By parsing articles content and using infoboxes as data source we managed to build a quite extensive football knowledge graph (with focus on Premier League and Serie A) that includes:
- 3133 football teams
- 4760 football players
And yes we’re even able to get the current team squads.
Next step: use Futhead’s huge database to further improve our graph.