Turning wiki content into a graph for football fans

Football Wiki is an extensive database of everything football fan can be interested in. It describes teams, players, managers, cup finals and important tournaments.

Maciej Brencz
Fandom Engineering
6 min readFeb 15, 2019

--

While being a great source of information about a given football player or a team, this wiki (and wikis in general) does not really work great when we want:

  • to find out which players played for both Liverpool and Manchester United,
  • to list all midfielders in Premier League born in Germany,
  • to check who played for Liverpool in 2012.

Wikis are “just” content. Interlinked, but still just a text (or a wikitext, to be precise). In order to answer the above questions, we need to understand the content and model it properly.

Graph of all midfielders playing in Premier League with transfers data.

Understanding the content of the wiki

At a first glance, article content may seem to be just a blob of text. However, wikitext does provide features like links and templates — content gets a bit of structure. Links model relations between articles (“Ole Gunnar Solskjær” article will link to “Manchester United” and “Molde FK”). However, way more data can be extracted from templates.

Templates as a structured data source

Infobox for Ole Gunnar “The Baby-faced Assassin” Solskjær

Infoboxes, a specific type of templates, provide a set of properties for a given article. When it comes to the world of football these can be:

  • biodata — date and place of birth, nationality,
  • player height and position on the field,
  • club foundation date, stadium capacity,
  • list of clubs given person played in / trained.

All the properties for an infobox are provided in an article’s wikitext:

Wikitext that renders the infobox above

In order to avoid error-prone parsing of wikitext of templates invocation, we implemented a simple API end-point that provide us with a JSON-formatted tree of all templates used by a given article together with parameters names and values as understood by the MediaWiki’s wikitext parser:

From raw wikitext via JSON metadata to structured data

Representing the data

Different infoboxes are used to model different types of entities. We assumed the following:

  • Infobox Biography template describes a person (either a player or a manager, or even both),
  • Infobox Club template describes a team,
  • Fs player template is used inside the team’s article content to connect teams with players (i.e. set up a relation for the current team squad).

As we parse the list of parameters passed to templates we assume that:

  • parameters with links form a relation (e.g. player played in this club, the club has this person as a manager),
  • parameters with plain values are the source of data for properties (club foundation year, birth date, player’s height).

schema.org types

We plan to use more than just a single wiki to build a wiki-powered knowledge database. Hence, we do not want to rely on a specific naming convention used by each of the wikis.

That’s why we model collected data using industry standard schema.org types and properties:

This approach will, later on, allow us to expose structured data and links between entities in HTML of rendered wiki articles.

Building a graph

We do not only want to store properties of all football players, managers and teams. We also want to have relations between them:

  • player career is described as [:athlete] relation (with since and until properties) with SportsTeam node,
  • team’s current squad is described as [:athlete] relation with Person node. A player position on the field and number is stored in relation's properties.

In order to store such data set, we decided to use a graph database. In our project, we picked RedisGraph — a module for Redis with graph support.

Modeling nodes and relations + examples

A Python script was used to do the following:

  • get the list of all football players and a team from Football Wiki
  • fetch JSON tree with a list of templates used by each of the above articles
  • using specific templates (described above) build models that have both bio properties (name, birth date, nationality, and height) and relations (athlete and coach).

Here’s the example for Ole Gunnar Solskjær:

And here’s how we modeled Manchester United article:

Querying the database

Now let’s get back to the questions that we wanted to make on top of Football Wiki.

RedisGraph can be queried using Cypher — a language designed for Neo4j database. We want to ask it to give us all teams (t node) that are linked by persons (p nodes) using athlete relation. Additionally, teams are only limited to members of the Premier League. And we‘re only interested in Icelandic players:

Using a similar query we can get the graph of all German midfielders in Premier League:

German midfielders in Premier League (see an interactive graph, powered by Alchemy.js)

And last, but not least — who played for both Liverpool and Manchester United?

Summary

By parsing articles content and using infoboxes as data source we managed to build a quite extensive football knowledge graph (with focus on Premier League and Serie A) that includes:

  • 3133 football teams
  • 4760 football players

And yes we’re even able to get the current team squads.

Next step: use Futhead’s huge database to further improve our graph.

--

--

Maciej Brencz
Fandom Engineering

Poznaniak z dziada-pradziada, pasjonat swojego rodzinnego miasta i Dalekiej Północy / Enjoys investigating how software works under the hood