Jean-Michel Garcia on unsplash

Exploring Power Laws with Neo4j

Nathan Smith
Neo4j Developer Blog
6 min readApr 27, 2020

--

Chapter 18 of Networks, Crowds, and Markets by David Easley and Jon Kleinberg covers Power Laws and Rich-Get-Richer Phenomena. The authors describe how a simple model of individual decision making can lead to a network where a few nodes have many more connections than the rest. The distribution of node degree follows a pattern called a power law that is found many times in the real world, from the populations of cities to the diameters of dust devils on Mars.

To see how a power law distribution might arise, Easley and Kleinberg provide this algorithm based on links among web pages:

  1. Pages are created in order, and named 1, 2, 3, . . . , N.
  2. When page j is created, it produces a link to an earlier Web page according to the following probabilistic rule (which is controlled by a single number p between 0 and 1).
    (a) With probability p, page j chooses a page i uniformly at random from among all earlier pages, and creates a link to this page i.
    (b) With probability 1−p, page j instead chooses a page i uniformly at random from among all earlier pages, and creates a link to the page that i points to.
    (c) This describes the creation of a single link from page j; one can repeat this process to create multiple, independently generated links from page j. (However, to keep things simple, we will suppose that each page creates just one outbound link.)

We can execute this algorithm in Neo4j to see a power law in action.

First, start a free Neo4j sandbox. When you log in, choose “New Project,” and then “Blank Sandbox.” Next click “Launch Sandbox.” After your sandbox starts, choose “Launch Browser.”

We’ll seed our simulation with two pages that point to each other.

CREATE (p1:Page {pageId:1})-[:LINKS_TO]->(p2:Page {pageId:2}),
(p2)-[:LINKS_TO]->(p1)
RETURN *

The algorithm has a parameter called p that represents the probability that a page will link to either a randomly selected node or to the page linked to from a randomly selected node. You can experiment with different values for the p. Execute this code in Neo4j browser to set the value to 0.5.

:param p => 0.5

Now we’ll use a Cypher statement that creates a new page, selects a random previously created page, and links to either the selected random page or the target of the random page’s outbound link.

//find the maximum prior page id
MATCH (p:Page)
WITH max(p.pageId) AS maxId
//create a new node with a page id one higher than the previous max
CREATE (newPage:Page {pageId:maxId + 1})
WITH newPage
//choose one existing page and its outbound target at random
MATCH (source:Page)-[:LINKS_TO]->(target:Page)
WITH newPage, source, target,
RAND() AS randSort,
RAND() AS randProb
ORDER BY randSort
LIMIT 1
//with probability $p link to either source or target
WITH newPage, source, target,
CASE WHEN randProb < $p THEN source ELSE target END AS linkTo, randSort, randProb
MERGE (newPage)-[:LINKS_TO]->(linkTo)
RETURN *

In the code block above, we call Cypher’s RAND function twice to generate two random variables. The first random variable randSort is used to sort the results of our (source:Page)-[:LINKS_TO]->(target:Page) pattern search in random order. The LIMIT 1 clause gives us the first result from this randomly sorted list.

The second random variable randProb is compared with the value of the parameter p. If randProb is less than p, we create a link from the newly created page to the randomly selected existing page (called source in our pattern search). If randProb is greater than or equal to p, we create a link from the newly created page to the target of the randomly selected existing page’s outbound link.

Run the code multiple times to create new page nodes. You will probably notice that page 1 and page 2 are coming up frequently as link targets for your new nodes.

The apoc library has a charmingly-named function called apoc.periodic.rock_n_roll that will allow us to repeat query many times. The query below will run the code 990 times to create new nodes and links.

CALL apoc.periodic.rock_n_roll(
"UNWIND RANGE(1,990) AS times RETURN times",
"MATCH (p:Page)
WITH MAX(p.pageId) AS maxId
CREATE (newPage:Page {pageId:maxId + 1})
WITH newPage
MATCH (source:Page)-[:LINKS_TO]->(target:Page)
WITH newPage, source, target, RAND() AS randSort, RAND() AS randProb
ORDER BY randSort LIMIT 1
WITH newPage, source, target, CASE WHEN randProb < .5 THEN source ELSE target END AS linkTo, randSort, randProb
MERGE (newPage)-[:LINKS_TO]->(linkTo)",
100)

Let’s return the first 100 page nodes created and look at them visually.

MATCH (p:Page) WHERE p.pageId < 100 RETURN p

You can see that one central page has many more incoming links than the others.

Run this query to see the distribution of incoming degrees.

MATCH (target:Page)
OPTIONAL MATCH (source:Page)-[:LINKS_TO]->(target)
WITH target, COUNT(source) AS inLinks
ORDER BY target.pageId
RETURN inLinks,
COUNT(*) AS nodeCount,
COLLECT(target.pageId) AS nodeList
ORDER BY inLinks

In my simulation, one node ended up with 108 inbound links. About 66% of the nodes ended up with no inbound links at all. I exported my results to CSV to plot them.

Here’s a chart showing the number of number of nodes with each in-degree.

It’s pretty hard to see much on those axes. I ran the model for 5,000 nodes and plotted the results with the x and y axes on a log scale. You can see that the result starts to fall along a straight line. That’s what we would expect in a power law distribution. A log-log plot like this is one way to check if our data is approximating a power law distribution.

We can also visualize the data to show the number of nodes having at least a given in-degree. You might have seen a chart like this in discussions of the long tail of online sales.

At this point, you can delete all the nodes or start a new sandbox and run the experiment again with a different value for p. The numbers will change each time you try the experiment, but if you include enough nodes the overall shape of the power law distribution will be recognizable.

--

--

Nathan Smith
Neo4j Developer Blog

Senior Data Scientist at Neo4j. Organizer of the Kansas City Graph Databases Meetup.