Panama Papers meet Knowledge graph

JJ Zhang
Analytics Vidhya
Published in
11 min readMar 31, 2020

Keywords: knowledge graph, network analysis, panama papers, Python.

This blog is a gentle walk through the application of a knowledge graph approach to the Panama papers dataset. Starting from brief introduction of the Panama papers and knowledge graph, then I performed some basic analysis on the Panama papers database, followed by centrality and component analysis and case study of the third largest component. Through these analysis, it turned out that Chinese officiers are one of the Mossack Fonseca’s most important clients. 80% of the nodes(about 400,000) are connected by the Mossack Fonseca network. The third largest component reveals how Monaco acts as a tax haven, whose most prominent clients are in Italy.

If anyone fancies any of my statistical graphs/ slides, download and use them as you like, but please cite this blog/my github link. Many thanks and hope you enjoy reading. If you have any questions about the results, please feel free to send me a msg via LinkedIn/Twitter.

1. Introduction of Panama papers

The “Panama papers” refer to the leaked documents of the Panamanian law firm, Mossack Fonseca, detailing financial and client-attorney information for 214,000 offshore tax havens. To date, at least 140 offshore entities have been connected to politicans or public officials, with direct links found to 12 national leaders, leading to the resignation of Iceland’s prime minister and widespread protests.

For more about the story, there are two good summaries from the Guardian, which has written comprehensive reports on the Panama papers.

2. Introduction of knowledge graph, ontology and network

First, I would like to discuss these terms which are often associated with each other, and sometimes used interchangeably: Knowledge graph, Ontology and Network. I’ll work with the following definitions:

Ontology has two disparate meanings, according to Oxford dictionary:

For its uncountable form: Ontology is the branch of philosophy that deals with existence, i.e., to be or not to be.

For its countable form: Ontology is the list of concepts and categories in a subject areas that show the relationship between them.

Naturally, the latter form is the main focus here. To put it more simply, ontology is a form of the representation of interlinked entities (including abstract concepts).

Next, let’s take a look at knowledge graphs. Below is a timeline of the development of the knowledge graph.

Fig that explains the development of knowledge graph-related concept, from the 1960s til now.

Knowledge graphs can be envisaged as large networks of entities, their semantic types, as well as properties, and relationships between entities.

Thus, the relationship between these three terminologies could be summarized as such:

A network is the representation form of a knowledge graph; while a knowledge graph is the application form of Ontology.

Nonetheless, there are a wide variety of definitions and taxonomies published. Several nice intro-blogs are listed below. For clarity, I will stick with the term knowledge graph.

3. Overview of Panama papers as knowledge graph network

Analysis of panama papers with knowledge graph helped reveal how the super-rich avoid tax using offshore entities. ICIJ conducted a lot of analysis using neo4j software. I am a Python person, so I use open-source python libraries, networkx and graph-tool, to conduct analysis on panama papers. To acquire the original database from ICIJ, please click this link.

Data structure of panama-paper knowledge graph. Circles with filled colors are nodes, the arrows pointing from one node to another are edges.(source:https://guides.neo4j.com/sandbox/icij-paradise-papers/datashape.html)

An abstract illustration of data structures to present the Panama papers is shown above. Nodes represent related concepts and catogories, drawn as circles; edges represent relationships between nodes, drawn as arrows. There are 5 different kinds of nodes (each colored by an unique color), and 33 different kinds of edges between these nodes. Here are the descriptions of 5 nodes, adopted from github.com/REDeLapp/Panama-Papers-Network-Analysis:

3.1 Nodes

  • Entity (offshore)
    Company, trust or fund created in a low-tax, offshore jurisdiction by an agent.
  • Officer
    Person or company who plays a role in an offshore entity.
  • Intermediary
    A go-between for someone seeking an offshore corporation and an offshore service provider — usually a law-firm or a middleman that asks an offshore service provider to create an offshore firm for a client.
  • Address
    A contact postal address as it appears in the original databases obtained by ICIJ.
  • Others
    Unknown/unlabeled nodes.
Fig of overview of nodes. The pie chart on the left side shows the proportion of different kinds of nodes found in Panama papers. On the right side are bar charts of top 10 nodes with the highest edges connected to them. Same color code is applied in both charts.

As a basic overview before any fancy network analysis, I plotted a pie chart to understand the proportion of different kinds of nodes. As shown in the figure above, the pie chart on the left reveals that “officer” and “entity ” nodes are most common, addresses about half as frequent, and that the “intermediary” portion is a twentieth that of officers. However, their prevalence is not the most importance properties of these nodes; the number of edges connected to them better captures their relative importance, i.e., their centrality. Accordingly, I calculated the number of nodes connected to each node and listed the top ranked nodes on the left side along with the number of edges attached to it. Surprisingly, all of these are “intermediary” nodes. This indicates the role of “intermediary” as the hub in the panama papers knowledge graph study.

Figure of prevalence of countries found in the Panama paers. Most nodes have notations of country, shown in this bar chart are the countries have most node notations.

Second, I am interested to know which territories are most represented across all the nodes in the Panama papers dataset. From the bar chart above, Hong Kong is top-ranked, with China and Switzerland coming second and third. Jersey, one of the Channel Islands, and Panama, come fourth and fifth respectively. Interpreting this is a broader context, we see some of the ‘usual suspects’; Hong Kong is one of the leading financial centres, and its proximity to China honed its top #1 position. China has the second largest economy in the world behind the US, and was the country with most Mossack Fonseca offices, likely owing to its restricted capital market. Switzerland, long considered the “grandfather of bank secrecy”, has been one of the largest offshore financial centers and tax havens in the world since the mid-20th centrury. Both Jersey and Panama are recognized as leading offshore financial centres, and have an international reputation as tax havens.

Figure of the number of categorized nodes by country and type. Each bar represents the the number of different type of nodes found for the top 5 countries.

Based on the previous result, it would be worthwhile to know which type of nodes are most prevalent in these top countries. The figure above shows that among the top 5 countries, except for China, offshore entities are prevalent. China ranks second via the contributions of addresses and officers.

3.2 Edges

Enough preliminary analysis on nodes, now let’s look at the edges. Among the 33 different kinds of edges, 99.9% of them belong to the four types: “shareholder of”, “intermediary of”, “registered address” and “beneficiary of ”. Details are shown in the bar chart below.

Top 5 edge types found in the Panama papers.

3.3 Component analysis

After individual analysis of nodes and edges respectively, a joint analysis of how edges connect different nodes is my next step: are all nodes connected with each other by these edges? Translating this into a more technical jargon: a component analysis should be performed. Here I use functions of networkx to conduct component analysis on 559,600 unique nodes and 674,102 edges extracted from the Panama papers. Following is a list of what I found.

167 components were found, each was composed of 1 nodes. 
3807 components were found, each was composed of 2 nodes.
1483 components were found, each was composed of 3 nodes.
1158 components were found, each was composed of 4 nodes.
738 components were found, each was composed of 5 nodes.
755 components were found, each was composed of 6 nodes.
435 components were found, each was composed of 7 nodes.
383 components were found, each was composed of 8 nodes.
255 components were found, each was composed of 9 nodes.

In total 11,043 independent components were identified. Categorized by the number of nodes of which they are composed, the number of components is listed above. One of the benefits provided by component analysis is to delete useless nodes from the database. For example, it is quite obvious that both zero-degree nodes ( i.e. the one-component where nodes have no edge pointed to/from them.) and one-degree nodes (every edge connect two nodes exclusively) are not very informative, thus they could be deleted from future investigations.

Fig of small components composed of no more than 10 nodes. Nodes within each component shown in this figure were colored by a unique color, with the number of nodes are drawn next to each component. The annotations of nodes and edges were shown on the top of the figure. Abbreviation of country were annotated on top of each node. The type of edge were annotated along the edge. This figure was created via graph-tool.

Another benefit of component analysis is that it provide a better angle to inspect the larger network. I was curious to know how nodes were connected in these small components and whether they are worthwhile to investigate further or not. I randomly chose a example from 8 different categories of components, and depicted the components using graph-tool, as shown in the figure above.

Some interesting connection patterns, such as “intermediary” → “entity”→ “address”“officer” are observed. Quite often, the first two were connected via the same officer as well. This reveals a simple way to avoid tax: Officers hire intermediary middlemen/law-firms to establish offshore entities to hide their money, and as such the link between officers and their offshore entities could be revealed by their links to the same address. Another intriguing phenomenon is the high degree of the “intermediary” nodes. This indicates the importance of intermediary nodes as the hubs of the Panama papers, allowing one to focus on them to conduct further investigations of money flow, and hidden connections.

Figure of distribution of 175 kinds of components.

Now let’s take a look at the components which have more than 10 nodes. Figure above shows the distribution of all components found in the Panama papers. Interesingly, the largest component is composed by 455,479 nodes, which contain 81% of all nodes of Panama papers knowledge graph. The remaining 19% of nodes form into components which vary widely in size. We assume that the largest component is of most general importance. Nonetheless, some interesting results could be revealed by focusing on those small components.

3.4 Summary

559,600 unique nodes, 674,102 edges, and 11,043 independent components.

The previous analysis of nodes, edges and components captures the huge operation networks organized by Mossack Fonseca to avoid tax and establish offshore entities etc. So huge that it is not possible to manually extract all knowledge from papers. However, by building the knowledge graph of Panama papers, it is possible to save all the properties of nodes and the interactions between them for later queries. One could always focus on one officer, one entity, or one intermediary and conduct analysis accordingly, such as neo4j employing SQL-ish language to extract the nodes/edges of specific properties by user’s request, then depicting a local small network.

Overall, there are many different ways to depict/inspect the knowledge graph, the most important steps being the preprocessing the database, understanding the data structure, and choosing a methodology that suits the question/problem at hand.

4. Something fancy

Figure of the third largest component that contains 730 nodes. Looks a bit like virus, right?

The graph shown here is a network of the Fruchterman-Reingold spring-block layout. This component is the third largest, with 730 nodes. There is only one intermediary node, but it has the highest degree, i.e. most connections with the remaining nodes. The patterns observed support the result of the component analysis, that intermediaries are the hubs of the Panama paper knowledge graph. Some new patterns were observed as well: the clusters of officers were created by their common link to either the same offshore entity or the same address, this indicate that these officers could be a family or business partners.

Nonetheless, the convoluted connections with Fruchterman-Reingold layout does not help to evaluate every clusters/nodes. Thus I tried another layout: radical tree with the intermediary node in the center:

Figure of the third largest component that contains 730 nodes with radical tree layout.

It seems like that a majority of the nodes and connections fit into the patterns:

Many officers → an address

Many officers → an entity

Many offshore entities → an intermediary

Now we know the pattern of how the super-rich hide their money via intermediary and offshore entities. It would also be interesting to where are mostly of these entities, and where do these super-rich come from.

Figure of the connection map between different countries. The same color code was applied here.

In this figure, I drew connections between the nodes based on their “country” labels and projected them onto this orthographic map. From this connection map we can see that Monacan intermediary firms facilitate the establishment of Monacan accounts for Italian, Cypriot, Czech, British, Spanish, Swiss, American and Salvadoran super-rich established offshore entities in Monaco.

Personal suggestions for related work

  • Which term to use: ontology, knowledge graph and network?

I personally believe that the differences between these are rather nuanced in practice. Nevertheless, here are my suggestions if you’re wondering about which to use: For a CV/interview, ontology/knowledge graph is recommended; typically this is what recruiters will be looking for, and captures more directly the standard applications of these methods in introducing an order on complex data. When searching for technical resources though, such as open source programming libraries, research papers, etc., if no satisficatory results pop out with the former two keywords, perhaps one could replace the keyword with “network” for better luck.

  • How to start a network analysis/knowledge graph study?

For data scientists, pre-processing the database is often more important than choosing the ‘right’ tool/algorithm for later analysis. It is important to understand the data structure of the problem before one conducts any analysis. For example, before I started the analysis here, I read reports about the Panama papers, collected the database provided from the ICIJ, and searched for published analyses on the same database. Third, I decided which aspects of the Panama papers interested me most and focused on those.

Finally, I started my analysis of the Panama papers, focusing on a particular aspect. My usual strategy was performing analysis on the nodes first, then the edges, before a component analysis, depicting and polishing the network. I found this strategy very intuitive, as I try my best to correlate the statistical results with the broader context of these relationships. Helpful visualisations aid communication between data scientists and non-data scientists, and can help to incorporate expert knowledge into the graph analytical workflow by stating the assumptions and findings thus far clearly.

Thank you for your reading, hope you enjoyed it. :)

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

JJ Zhang
JJ Zhang

Written by JJ Zhang

Data scientist, python programmer, machine learning/artificial intellegence specialist, knowledge graph lover, chemoinformatics.