Finding the best tennis players of all time using Weighted PageRank

In the latest release of the Neo4j Graph Algorithms library we added support for the weighted variant to the PageRank algorithm.

I’m a big tennis fan and my colleague Ryan shared a paper titled Who Is the Best Player Ever? A Complex Network Analysis of the History of Professional Tennis which uses a variant of PageRank, so I was keen to take the algorithm for a spin.

I got myself ready to do some screen scraping to get the data, but it turns out that Kevin Lin has done all the hard work and put all the results for matches up until the end of 2017 as CSV files in the atp-world-tour-tennis-data Github repository.

Thanks Kevin!

Import the data

Before we import anything, we’ll create some constraints so that we don’t end accidentally import duplicate players:

CREATE constraint on (p:Player)
ASSERT p.id is unique;
CREATE constraint on (m:Match)
ASSERT m.id is unique;

Now we’re ready to import the data into Neo4j. We need to copy the CSV files that Kevin created into the import folder of our Neo4j installation.

Once we’ve done that we can import the data into Neo4j using Cypher’s LOAD CSV command:

LOAD CSV FROM "file:///match_scores_1968-1990_UNINDEXED.csv" AS row
MERGE (winner:Player {id: row[8]}) 
ON CREATE SET winner.name = row[7]
MERGE (loser:Player {id: row[11]}) 
ON CREATE SET loser.name = row[10]
MERGE (m:Match {id: row[22]})
SET m.score = row[15], m.year = toInteger(split(row[0], "-")[0])
MERGE (m)-[w:WINNER]->(winner) SET w.seed = toInteger(row[13])
MERGE (m)-[l:LOSER]->(loser) SET l.seed = toInteger(row[14]);
LOAD CSV FROM "file:///match_scores_1991-2016_UNINDEXED.csv" AS row
MERGE (winner:Player {id: row[8]})
ON CREATE SET winner.name = row[7]
MERGE (loser:Player {id: row[11]})
ON CREATE SET loser.name = row[10]
MERGE (m:Match {id: row[22]})
SET m.score = row[15], m.year = toInteger(split(row[0], "-")[0])
MERGE (m)-[w:WINNER]->(winner) SET w.seed = toInteger(row[13])
MERGE (m)-[l:LOSER]->(loser) SET l.seed = toInteger(row[14]);
LOAD CSV FROM "file:///match_scores_2017_UNINDEXED.csv" AS row
MERGE (winner:Player {id: row[8]}) 
ON CREATE SET winner.name = row[7]
MERGE (loser:Player {id: row[11]}) 
ON CREATE SET loser.name = row[10]
MERGE (m:Match {id: row[22]})
SET m.score = row[15], m.year = toInteger(split(row[0], "-")[0])
MERGE (m)-[w:WINNER]->(winner) SET w.seed = toInteger(row[13])
MERGE (m)-[l:LOSER]->(loser) SET l.seed = toInteger(row[14]);

Our schema is very simple. We can run the following query to see a visual representation:

CALL db.schema()

And here we go:

Cool, looks good. Before we continue let’s write a query so that we can quickly see what the data looks like in graph form:

MATCH p=()<-[:LOSER]-()-[r:WINNER]->() 
RETURN p
LIMIT 25

Biggest winners

Now let’s write a query to find the players that have the most wins:

MATCH (p:Player)
WITH p,
size((p)<-[:WINNER]-()) AS wins,
size((p)<-[:LOSER]-()) as defeats
RETURN p.name, wins, defeats,
CASE WHEN wins+defeats = 0 THEN 0
ELSE (wins * 100.0) / (wins + defeats) END
AS percentageWins
ORDER BY wins DESC
LIMIT 10

If we run that query we’ll see this output:

If you follow tennis you’ll probably recognise most of the names on that list. Most of them are considered amongst the best players of all time, but just counting the number of wins doesn’t seem very satisfying.

It’s time to try something a bit fancier! Enter the PageRank algorithm…

Building a projected credibility graph

The way that the PageRank algorithm works is that nodes receive credibility from their incoming relationship. For example, in graph of web , one web page gives credibility to another by linking to it. The amount of credibility it gives can be determined by a weight property on that relationship.

In our tennis graph credibility between players can be modelled based on the number of wins that players have in matches between each other. For example the following query finds out how many times Federer and Nadal have beaten each other:

MATCH (p1:Player {name: "Roger Federer"}),
(p2:Player {name: "Rafael Nadal"})
RETURN p1.name, p2.name,
size((p1)<-[:WINNER]-()-[:LOSER]->(p2)) AS p1Wins,
size((p1)<-[:LOSER]-()-[:WINNER]->(p2)) AS p2Wins

If we run that query we’ll see this output:

Our projected graph should have relationships between Federer and Nadal with a weight equal to the number of times they beat each other. The weight of the relationship from Federer to Nadal will be 23 as Federer is giving Nadal that much credibility for those 23 wins. The weight of the relationship from Nadal to Federer will be 15.

We can write the following query to project this graph:

MATCH (p1)<-[:WINNER]-(match)-[:LOSER]->(p2) 
WHERE p1.name IN ["Roger Federer", "Rafael Nadal"]
AND p2.name IN ["Roger Federer", "Rafael Nadal"]
RETURN p2.name AS source, p1.name AS target, count(*) as weight
LIMIT 10

If we run that query we’ll see this output:

Now all we need to do is remove the WHERE clause from the query and we have a generic query that we can use across our whole graph.

Finding the best tennis player ever using weighted PageRank

Now we’re ready to call the PageRank algorithm. We want to use the weighted variant of the algorithm, which means that we’ll need to pass the weightProperty parameter. By default the PageRank algorithm assumes it’s running the non-weighted variant.

The following query runs weighted PageRank over the whole graph:

CALL algo.pageRank.stream(
"MATCH (p:Player) RETURN id(p) AS id",
"MATCH (p1)<-[:WINNER]-(match)-[:LOSER]->(p2)
RETURN id(p2) AS source, id(p1) AS target, count(*) as weight
",
{graph:"cypher", weightProperty: "weight"})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).name AS player, score
ORDER BY score DESC
LIMIT 10

If we run that query we’ll see this output:

We see a different order at the top of the rankings than in Filippo Radicchi’s paper. The main difference is that Federer, Nadal, and Djokovic have pushed up into the top 5 of all time. Radicchi’s analysis only covers up to 2010 and these 3 players have been extremely dominant in the 8 years since then, so perhaps it’s not all that surprising.

We can simulate that by only including matches from 2010 and earlier. The following query does this:

CALL algo.pageRank.stream(
"MATCH (p:Player) RETURN id(p) AS id",
"MATCH (p1)<-[:WINNER]-(match)-[:LOSER]->(p2)
WHERE match.year <= $year
RETURN id(p2) AS source, id(p1) AS target, count(*) as weight
",
{graph:"cypher", weightProperty: "weight", params: {year: 2010}})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).name AS player, score
ORDER BY score DESC
LIMIT 10

And we’ll get this output:

Notice in this query that we’re passing in the year as a parameter to the Cypher projection query via the params key in our config map.

Our top 2 are now the same as Radicchi although he has Federer down in 7th position rather than in the top 3. Nadal and Djokovic have dropped out of the top 10 in our version.

We can also find the PageRank for matches within one calendar year. The following query calculates the PageRank for 2017:

CALL algo.pageRank.stream(
"MATCH (p:Player) RETURN id(p) AS id",
"MATCH (p1)<-[:WINNER]-(match)-[:LOSER]->(p2)
WHERE match.year = $year
RETURN id(p2) AS source, id(p1) AS target, count(*) as weight
",
{graph:"cypher", weightProperty: "weight", params: {year: 2017}})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).name AS player, score
ORDER BY score DESC
LIMIT 10

If we run that query we’ll see the following output:

Below are the ATP World Tour end of year rankings for 2017:

The order of the players isn’t exactly the same, but the PageRank rankings aren’t that different to the official ones. One difference is that the official rankings give different weighting for the performance in different tournaments, whereas with our PageRank approach each win has the same weight.

I look forward to seeing what graph analytics problems you’re now able to solve with the addition of weights for this algorithm. Let me know what you come up with — devrel@neo4j.com

Enjoy!