The making of a network chart

sweemeng
Neo4j Developer Blog
4 min readJun 14, 2018

--

This post explains how we took the investigative journalism data from the Sinar Project and imported it into Neo4j for further analysis.

Prequel

Note: This is actually older, but I never showed the method how we built the graph, how to organize data, and why our method works for this.

Image export of query from neo4j

Graph is one more interesting data structure available to us software developers. It began with the question whether a person can cross all the bridges in Königsberg only once, which was solved by Euler in 1736 (btw. you can’t).

Graph is used for the path finding problem, which is why shortest distance algorithm is best expressed with a graph. But it turns out relationships between people can be well represented by graphs! For example, below are people involved directly in the company “1mdb”.

Showing relationship between 1mdb and people

The relevation

Enough of theory. Going to the present (or actually 3 years back — yes I procrastinated). I worked on an API for the Sinar Project that stores the connections between people and organizations in Malaysia. It is build on a standard called popolo.

Here is an example API call for that data.

An example query to the API

As you can see from the query results from our API, the memberships field has a set of string ids, those are foreign keys to membership objects.

Query in memberships

Within the membership object there are a foreign key to an organization and to a person. As you can see this represents a relationship. In graph theory terms this will be an Edge, and person/organization will be a Node. You can represent this as a graph!

If you follow each person or organization, it can go on and on because a person has multiple memberships to organizations, each organization has many members. That’s one reason why representing this data as graph is very useful.

The hack

My favorite way to work with a graph is via a tool called Neo4j. Networkx can work too, but we have at least 4552 people, 3475 memberships, 351 organizations, using the tools can get complicated quickly.

One reason I like Neo4j is, that it has an interactive query workbench. It uses Cypher, a very expressive query language for graphs. For networkx we needed to write different Python scripts/functions for different queries. Which is cool if we have fixed questions. A query language is useful if we are exploring data.

But the thing is, our data was not stored in Neo4j by default. We needed to massage the data into a form usable by Neo4j for data import. You can get the python script at https://github.com/sinar/popit_relationship.

We use py2neo because it provides an abstraction that I am comfortable with. Data is fetched using the requests library.

Before we start here’s a few thing

  1. There are 4 type of object that we store in popit that are relevant here. Organization, Person, Post and Membership
  2. All the data is accessible via the HTTP JSON API. https://api.popit.sinarproject.org/docs

The process is essentially

  1. Person and Organization stored as Node
  2. Post is stored as a node, a post is a thing like CEO, CFO etc
  3. Membership is stored as a relationship with labels.
  4. Most of the objects in popit have start_date and end_date. This is stored as timestamp for both Nodes and Relationships
  5. Each have popit_id property which is the id of the object stored in popit. It is useful for debugging
How data from popit is represented in neo4j

In the end

There is a good point to this exercise. From this research done by the Sinar Project, having a way to visualize relationships is very useful to investigate transparency issues. 1MDB as a example shows that some director is member of the board of multiple organizations involved. It is not a smoking gun for corruption but a strong indicator of one.

What we want to show is:

  1. If you have foreign key in database, it can often be a relationship
  2. A graph database can be expressive, more importantly it can visualize relationships between objects.
  3. Tools like Neo4j can have strong impact on real world issues.
  4. Though not every problem is similar to government transparency, so one should pick and choose parts of what we did here.

The thing we can do better in the future is:

  1. Standardize on naming convention for relationships
  2. Post should be just label on a relationship.
  3. Querying time information — it would be nice to see the evolution of an issue.

Source:

Research on data and 1MDB at https://sinarproject.org/transparency/research-notes/uncovering-1mdb-with-popit-open-data

Source code of data loader is at https://github.com/sinar/popit_relationship

The popit “People and Org API” is publicly accessible at https://api.popit.sinarproject.org/docs

The popit source code is at https://github.com/sinar/popit_ng

--

--

sweemeng
Neo4j Developer Blog

Codemonkey coding for tea. Startup developer at day, community organizer at night. I also happens to be on patreon https://www.patreon.com/sweemeng