Turning wiki content into a graph for football fans

Football Wiki is an extensive database of everything football fan can be interested in. It describes teams, players, managers, cup finals and important tournaments.

Published in

Fandom Engineering

6 min readFeb 15, 2019

While being a great source of information about a given football player or a team, this wiki (and wikis in general) does not really work great when we want:

to find out which players played for both Liverpool and Manchester United,
to list all midfielders in Premier League born in Germany,
to check who played for Liverpool in 2012.

Wikis are “just” content. Interlinked, but still just a text (or a wikitext, to be precise). In order to answer the above questions, we need to understand the content and model it properly.

Graph of all midfielders playing in Premier League with transfers data.

Understanding the content of the wiki

At a first glance, article content may seem to be just a blob of text. However, wikitext does provide features like links and templates — content gets a bit of structure. Links model relations between articles (“Ole Gunnar Solskjær” article will link to “Manchester United” and “Molde FK”). However, way more data can be extracted from templates.

Templates as a structured data source

Infobox for Ole Gunnar “The Baby-faced Assassin” Solskjær

Infoboxes, a specific type of templates, provide a set of properties for a given article. When it comes to the world of football these can be:

biodata — date and place of birth, nationality,
player height and position on the field,
club foundation date, stadium capacity,
list of clubs given person played in / trained.

All the properties for an infobox are provided in an article’s wikitext:

In order to avoid error-prone parsing of wikitext of templates invocation, we implemented a simple API end-point that provide us with a JSON-formatted tree of all templates used by a given article together with parameters names and values as understood by the MediaWiki’s wikitext parser:

From raw wikitext via JSON metadata to structured data

Representing the data

Different infoboxes are used to model different types of entities. We assumed the following:

Infobox Biography template describes a person (either a player or a manager, or even both),
Infobox Club template describes a team,
Fs player template is used inside the team’s article content to connect teams with players (i.e. set up a relation for the current team squad).

As we parse the list of parameters passed to templates we assume that:

parameters with links form a relation (e.g. player played in this club, the club has this person as a manager),
parameters with plain values are the source of data for properties (club foundation year, birth date, player’s height).

schema.org types

We plan to use more than just a single wiki to build a wiki-powered knowledge database. Hence, we do not want to rely on a specific naming convention used by each of the wikis.

That’s why we model collected data using industry standard schema.org types and properties:

https://schema.org/Person models players and managers
https://schema.org/SportsTeam models football clubs

This approach will, later on, allow us to expose structured data and links between entities in HTML of rendered wiki articles.

Building a graph

We do not only want to store properties of all football players, managers and teams. We also want to have relations between them:

player career is described as [:athlete] relation (with since and until properties) with SportsTeam node,
team’s current squad is described as [:athlete] relation with Person node. A player position on the field and number is stored in relation's properties.

In order to store such data set, we decided to use a graph database. In our project, we picked RedisGraph — a module for Redis with graph support.

RedisGraph — a graph database module for Redis

Edit description

oss.redislabs.com

Modeling nodes and relations + examples

A Python script was used to do the following:

get the list of all football players and a team from Football Wiki
fetch JSON tree with a list of templates used by each of the above articles
using specific templates (described above) build models that have both bio properties (name, birth date, nationality, and height) and relations (athlete and coach).

Here’s the example for Ole Gunnar Solskjær:

<PersonModel https://schema.org/Person (Ole_Gunnar_Solskjr:Person) name = "Ole Gunnar Solskjær", birthDate = "1973", birthPlace = "Kristiansund", nationality = "Norway", height = "1.78">
    --[:athlete {"until": 1994, "since": 1990}]->(Clausenengen_FK:SportsTeam)
    --[:athlete {"until": 1996, "since": 1994}]->(Molde_FK:SportsTeam)
    --[:athlete {"until": 2007, "since": 1996}]->(Manchester_United_F_C:SportsTeam)
    --[:coach {"until": 2011, "since": 2008}]->(Manchester_United_F_C_Reserves_and_Academy:SportsTeam)
    --[:coach {"until": 2014, "since": 2011}]->(Molde_FK:SportsTeam)
    --[:coach {"until": 2014, "since": 2014}]->(Cardiff_City_F_C:SportsTeam)
    --[:coach {"until": 2016, "since": 2014}]->(Clausenengen_FK:SportsTeam)
    --[:coach {"until": 2018, "since": 2015}]->(Molde_FK:SportsTeam)
    --[:coach {"until": 2019, "since": 2018}]->(Manchester_United_F_C:SportsTeam)

And here’s how we modeled Manchester United article:

<SportsTeamModel https://schema.org/SportsTeam (Manchester_United_F_C:SportsTeam) name = "Manchester United F.C.", sport = "Football", foundingDate = "1878", ground = "Old Trafford", memberOf = "Premier League", url = "http://www.manutd.com/">
	--[:coach ]->(Ole_Gunnar_Solskjr:Person)
	--[:athlete {"number": 1, "position": "GK"}]->(David_de_Gea:Person)
	--[:athlete {"number": 2, "position": "DF"}]->(Victor_Lindelf:Person)
	--[:athlete {"number": 3, "position": "DF"}]->(Eric_Bailly:Person)
	--[:athlete {"number": 4, "position": "DF"}]->(Phil_Jones_born_1992:Person)
	--[:athlete {"number": 6, "position": "MF"}]->(Paul_Pogba:Person)
	--[:athlete {"number": 7, "position": "FW"}]->(Alexis_Snchez:Person)
	--[:athlete {"number": 8, "position": "MF"}]->(Juan_Mata:Person)
	--[:athlete {"number": 9, "position": "FW"}]->(Romelu_Lukaku:Person)
	--[:athlete {"number": 10, "position": "FW"}]->(Marcus_Rashford:Person)
	--[:athlete {"number": 11, "position": "FW"}]->(Anthony_Martial:Person)
	--[:athlete {"number": 12, "position": "DF"}]->(Chris_Smalling:Person)
	--[:athlete {"number": 13, "position": "GK"}]->(Lee_Grant_born_1983:Person)
	--[:athlete {"number": 14, "position": "MF"}]->(Jesse_Lingard:Person)
	--[:athlete {"number": 15, "position": "MF"}]->(Andreas_Pereira:Person)
	--[:athlete {"number": 16, "position": "DF"}]->(Marcos_Rojo:Person)
	--[:athlete {"number": 17, "position": "MF"}]->(Fred_born_1993:Person)
	--[:athlete {"number": 18, "position": "DF"}]->(Ashley_Young:Person)
	--[:athlete {"number": 20, "position": "DF"}]->(Diogo_Dalot:Person)
	--[:athlete {"number": 21, "position": "MF"}]->(Ander_Herrera:Person)
	--[:athlete {"number": 22, "position": "GK"}]->(Sergio_Romero:Person)
	--[:athlete {"number": 23, "position": "DF"}]->(Luke_Shaw:Person)
	--[:athlete {"number": 25, "position": "DF"}]->(Antonio_Valencia:Person)
	--[:athlete {"number": 27, "position": "MF"}]->(Marouane_Fellaini:Person)
	--[:athlete {"number": 31, "position": "MF"}]->(Nemanja_Mati:Person)
	--[:athlete {"number": 36, "position": "DF"}]->(Matteo_Darmian:Person)
	--[:athlete {"number": 39, "position": "MF"}]->(Scott_McTominay:Person)
	--[:athlete {"number": 24, "position": "DF"}]->(Timothy_Fosu_Mensah:Person)
	--[:athlete {"number": 38, "position": "DF"}]->(Axel_Tuanzebe:Person)
	--[:athlete {"number": 40, "position": "GK"}]->(Joel_Castro_Pereira:Person)

Querying the database

Now let’s get back to the questions that we wanted to make on top of Football Wiki.

RedisGraph can be queried using Cypher — a language designed for Neo4j database. We want to ask it to give us all teams (t node) that are linked by persons (p nodes) using athlete relation. Additionally, teams are only limited to members of the Premier League. And we‘re only interested in Icelandic players:

127.0.0.1:56379> GRAPH.QUERY football "MATCH (t:SportsTeam)<-[a:athlete]-(p:Person) WHERE t.memberOf = 'Premier League' AND p.nationality = 'Iceland' RETURN t.name,p.name,a.since,a.until"
1) 1) 1) "t.name"
      2) "p.name"
      3) "a.since"
      4) "a.until"
   2) 1) "Arsenal F.C."
      2) "\xc3\x93lafur Ingi Sk\xc3\xbalason"
      3) "2001.000000"
      4) "2005.000000"
   3) 1) "Tottenham Hotspur F.C."
      2) "Gylfi \xc3\x9e\xc3\xb3r Sigur\xc3\xb0sson"
      3) "2012.000000"
      4) "2014.000000"
   4) 1) "Everton F.C."
      2) "Gylfi \xc3\x9e\xc3\xb3r Sigur\xc3\xb0sson"
      3) "2017.000000"
      4) "NULL"
   5) 1) "Burnley F.C."
      2) "J\xc3\xb3hann Berg Gu\xc3\xb0mundsson"
      3) "2016.000000"
      4) "NULL"
2) 1) "Query internal execution time: 11.935366 milliseconds"

Using a similar query we can get the graph of all German midfielders in Premier League:

127.0.0.1:56379> GRAPH.QUERY football "MATCH (t:SportsTeam)-[a:athlete]->(p:Person) WHERE t.memberOf = 'Premier League' AND a.position = 'MF' AND p.nationality = 'Germany' RETURN t.name,p.name,a.number"

German midfielders in Premier League (see an interactive graph, powered by Alchemy.js)

And last, but not least — who played for both Liverpool and Manchester United?

127.0.0.1:56379> GRAPH.QUERY football "MATCH (t:SportsTeam)<-[:athlete]-(c:Person)-[:athlete]->(t2:SportsTeam) WHERE t.name = 'Liverpool F.C.' and t2.name = 'Manchester United F.C.'  RETURN c.name,t.name,t2.name"
1) 1) 1) "c.name"
      2) "t.name"
      3) "t2.name"
   2) 1) "Peter Andrew Beardsley"
      2) "Liverpool F.C."
      3) "Manchester United F.C."
   3) 1) "Paul Emerson Carlyle Ince"
      2) "Liverpool F.C."
      3) "Manchester United F.C."
   4) 1) "Michael James Owen"
      2) "Liverpool F.C."
      3) "Manchester United F.C."

Summary

By parsing articles content and using infoboxes as data source we managed to build a quite extensive football knowledge graph (with focus on Premier League and Serie A) that includes: