Part I: Graph databases for journalists

Using Neo4j to explore public contracting data

Investigative reporting such as the stories behind the Panama Papers and Swiss Leaks are complex, involving multiple persons, roles, organizations, countries, and intricate relationships between them.

Understanding knotty connections between elements scattered across thousands or millions of documents can really be like looking for a needle in a haystack. Excel spreadsheets or relational databases are powerful tools, but they were not designed to carry out this type of explorations focused on relationships. Instead, relationships are first-class citizens of graph databases.

The goal of this tutorial is to provide the foundations needed to analyze and visualize open contracting data using a real graph database (Neo4j). The primary audience is journalists, because they are particularly hungry for new tools to explore relationships.

Neo4j

Neo4j is a popular graph database management system with an open-source community edition. Even though this is not the only graph database, Neo4j is well known for the fact that the International Consortium of Investigative Journalists (ICIJ) has made intensive use of this technology to aid investigations like the Panama Papers. Graphs have proven to be very effective to represent these networks in an intuitive way.

Open Contracting Data

Every year, governments devote gigantic amounts of money to public contracts, from papers and pencils to large infrastructure projects such as airports, roads, schools, and hospitals.

The size of public procurement represents a considerable segment of the global economy. In the case of the OECD, the volume of these transactions represents approximately 12% of GDP.

A large amount of resources along with the complex interaction between public and private interests exposes public contracting processes to multiple risks, including collusion and corruption.

The Open Contracting Data Standard (OCDS) is a global, non-proprietary data standard designed to reflect the entire contracting life-cycle. The stages of the procurement process are depicted in figure 1. The OCDS was created to increase contracting transparency and facilitate deeper analysis of contracting data by a wide range of users. This standard is now being implemented by several governments and cities around the world. Check the complete list here.

An OCDS document is made up of a number of sections. These are:release metadata — contextual information about each release of data;parties — information about the organizations and other participants involved in the contracting process;planning — information about the goals, budgets and projects a contracting process relates to;tender — information about how a tender will take place, or has taken place;awards — information on awards made as part of a contracting process;contract — information on contracts signed as part of a contracting process;implementation — information on the progress of each contract towards completion.Source: Open Contracting Data Standard

These sections are represented in a JSON document as follows:

Mexico

The Mexican economy is the 15th largest in the world, and the second in Latin America. At the same time, this developing country is the 138 most corrupt nation out of 180 countries, according to the 2018 Corruption Perceptions Index reported by Transparency International.

The irregularities in public procurement have definitely contributed to this poor rating. Recent cases, such as “La Estafa Maestra” (Master Scam), reveal that the magnitude of fraud schemes in public procurement in Mexico is counted by hundreds of millions of dollars.

Mexico was one of the pioneering countries in the adoption of the OCDS. Currently, approximately 300,000 contracting processes of the federal public administration under this standard can be accessed. A portion of these contracts will be analyzed through our graph.

Data Model

Neo4j follows the property graph model to manage the stored data.

  • The graph is made of nodes for entities, their relationships, and properties.
  • Nodes are represented using circles. A node may have zero to many labels. Labels are used to group nodes into sets.
  • Relationships are represented using arrows connecting nodes. They are directed, which means that each relationship has a start node and an end node. A relationship has only one relationship type (name).
  • Properties are used to add qualities to nodes and relationships.

In any scenario, the analyst is responsible for carefully formulating the best possible model to capture the characteristics of the phenomenon to be studied. The quality of the insights that can be extracted from the graph through queries and visualizations depends on it. The initial data model proposed to analyze open contracting data is depicted in figure 2.

The graph model includes nodes with the following labels and their respective properties mapped to the corresponding OCDS fields:

  1. A contracting process (Contract). Take into account that a contracting process may involve one or more contracts, and some are declared void.
  • ocid: It is a globally unique identifier for a contracting process. This property is matched to the ocid field in the OCDS.
  • title: This field usually provides some idea about the object of the contract. It is matched to tender/title.
  • procurementMethod: There are three types of procurement methods: open, selective and direct. It is matched to tender/procurementMethod.
  • cycle: It refers to the year in which the contracting process took place, and is matched to cycle.
  • region: Although this is a feature initially associated with the procuring entity in the OCDS, we consider it convenient to place it directly in the contracting process to have an idea of where the money is spent without resorting to other nodes. It is matched to tender/procuringEntity/address/region.
  • amount: This property is very important to obtain conclusions. However, its interpretation must be cautious because most contracts do not report this value in the tender phase in our dataset. It is matched to tender/value/amount.

2. A company or person which offers goods or services (Tenderer)

  • rfc: It is a unique tax number in Mexico and is matched to parties/id when the role is tenderer.
  • name: It is matched to parties/name when the role is tenderer.
  • person: A boolean property to distinguish people (yes ) from companies (no). It is inferred from the RFC.

3. A public entity managing the contracting process (ProcuringEntity). It may be different from the buyer who pays for products or services.

name: It is matched to tender/procuringEntity/name.

4. An address (Address)

  • address: It is matched to parties/address/streetAddress when the role is tenderer.

5. A telephone (Telephone)

  • telephone: It is matched to parties/contactPoint/telephone when the role is tenderer.

6. A contact person (ContactPerson)

  • name: It is matched to parties/contactPoint/name when the role is tenderer.

7. An email (Email)

  • email: It is matched to parties/contactPoint/email when the role is tenderer.

There are two possible relationship types to capture the roles companies and persons play in a contracting process. (TENDER) identifies all entities competing in the tender phase, and (AWARD) only those winning contracts in the process.

Other relationships capture the connection between companies and their respective contact information: (ADDRESS), (TELEPHONE), (CONTACTPERSON), and (EMAIL). Finally, (MANAGE) connect procuring entities with their respective contracts.

Installing Neo4j and creating our first graph

Download a copy of Neo4j Desktop from the Neo4j download page. Execute the downloaded .exe file and accept the default configuration. You will see a detailed explanation of the installation of Neo4j after the download.

  1. When opening Neo4j, you will find a screen similar to the following. Click on Add Graph .

2. Click on Create a Local Graph .

3. Choose a graph name and password. Then click on Create .

Populating the graph with data

The first step consists of acquiring the desired open contracting records and parsing them to nodes and relationships according to the corresponding OCDS fields mentioned early.

For doing that, some programming tools that are not of interest in this tutorial are required, and therefore they are not described here. The CSV files with nodes and relationships required for this exercise can be downloaded directly from this link.

This sample only contains 21,428 contracting processes whose procurement method is open (all interested suppliers may submit a tender) or selective (only qualified suppliers are invited to submit a tender) in years 2017 and 2018. Direct contracts ( the procuring entity contacts a number of suppliers of its choice) were excluded this time to simplify the analysis.

1. In the graph, you just created, click on Start and then on Manage.

2. Click on theOpen Folder dropdown.

3. Visit the folder called Import and copy the CSV files there.

4. Go back to Neo4j and click on Open Browser .

The following lines contain the queries required to populate graph. Copy and paste each of them in the Neo4j browser and click on thePlay button. In a later section, we will explain the structure of Cypher queries. For now, the important aspect is to know that the LOAD CSV clause loads the CSV file with the nodes or relationships that you want to create, as well as their respective properties. The MATCH clause identifies graph patterns. And the CREATE clause creates nodes and relationships.

If you have a Neo4j version that has multi-line-editor enabled, you can also paste the whole script in one go into the command line.

Otherwise and if you want to follow more closely, do it one by one.

Create nodes for (Contract)

LOAD CSV WITH HEADERS FROM 'file:///contracts.csv' AS row 
CREATE (contract:Contract{ocid:row.ocid, title:row.title, cycle:row.cycle, region:row.region, procurementMethod:row.procurementMethod, amount:row.amount})

Create nodes for (Tenderer)

LOAD CSV WITH HEADERS FROM 'file:///tenderers.csv' AS row 
CREATE (p:Tenderer {rfc:row.rfc, name:row.name, person:row.person})

Create nodes for (ProcuringEntity)

LOAD CSV WITH HEADERS FROM 'file:///procuringEntities.csv' AS row CREATE (procuringEntity:ProcuringEntity{name:(row.name)})

Create nodes for (Address)

LOAD CSV WITH HEADERS FROM 'file:///addresses.csv' AS row 
CREATE (address:Address{address:(row.address)})

Create nodes for (Telephone)

LOAD CSV WITH HEADERS FROM 'file:///telephones.csv' AS row 
CREATE (telephone:Telephone{telephone:(row.telephone)})

Create nodes for (ContactPerson)

LOAD CSV WITH HEADERS FROM 'file:///contactPersons.csv' AS row CREATE (contactPerson:ContactPerson{name:(row.name)})

Create nodes for (Email)

LOAD CSV WITH HEADERS FROM 'file:///emails.csv' AS row 
CREATE (email:Email{email:(row.email)})

Now, let’s create some indexes. These are data structures that improve the speed of lookups of nodes by property.

Create an index for (Contract:ocid)

CREATE INDEX ON :Contract(ocid)

Create an index for (Tenderer:rfc)

CREATE INDEX ON :Tenderer(rfc)

Create an index for (ProcuringEntity:name)

CREATE INDEX ON :ProcuringEntity(name)

Create an index for (Address:address)

CREATE INDEX ON :Address(address)

Create an index for (Telephone:telephone)

CREATE INDEX ON :Telephone(telephone)

Create an index for (ContactPerson:name)

CREATE INDEX ON :ContactPerson(name)

Create an index for (Email:email)

CREATE INDEX ON :Email(email)

Finally, let’s create relationships.

Create a relationship (TENDER) between (Tenderer) and (Contract)

LOAD CSV WITH HEADERS FROM 'file:///rcc.csv' AS row 
MATCH (tenderer:Tenderer{rfc:(row.rfc)}),(contract:Contract{ocid:(row.ocid)})
CREATE (tenderer)-[:TENDER]->(contract)

Create a relationship (AWARD) between (Tenderer) and (Contract)

LOAD CSV WITH HEADERS FROM 'file:///raw.csv' AS row 
MATCH (tenderer:Tenderer{rfc:(row.rfc)}),(contract:Contract{ocid:(row.ocid)})
CREATE (tenderer)-[:AWARD]->(contract)

Create a relationship (MANAGE) between (ProcuringEntity) and (Contract)

LOAD CSV WITH HEADERS FROM 'file:///rp.csv' AS row 
MATCH (procuringEntity:ProcuringEntity{name:(row.name)}),(contract:Contract{ocid:(row.ocid)})
CREATE (procuringEntity)-[:MANAGES]->(contract)

Create a relationship (ADDRESS) between (Tenderer) and (Address)

LOAD CSV WITH HEADERS FROM 'file:///ra.csv' AS row 
MATCH (tenderer:Tenderer{rfc:(row.rfc)}),(address:Address{address:(row.address)})
CREATE (tenderer)-[:ADDRESS]->(address)

Create a relationship (TELEPHONE) between (Tenderer) and (Telephone)

LOAD CSV WITH HEADERS FROM 'file:///rt.csv' AS row 
MATCH (tenderer:Tenderer { rfc:(row.rfc)}),(telephone:Telephone {telephone:(row.telephone)})
CREATE (tenderer)-[:TELEPHONE]->(telephone)

Create a relationship (CONTACTPERSON) between (Tenderer) and (ContactPerson)

LOAD CSV WITH HEADERS FROM 'file:///rc.csv' AS row 
MATCH (tenderer:Tenderer{rfc:(row.rfc)}),(contactPerson:ContactPerson{name:(row.name)})
CREATE (tenderer)-[:CONTACTPERSON]->(contactPerson)

Create a relationship (EMAIL) between (Tenderer) and (Email)

LOAD CSV WITH HEADERS FROM 'file:///re.csv' AS row 
MATCH (tenderer:Tenderer{rfc:(row.rfc)}),(email:Email{email:(row.email)})
CREATE (tenderer)-[:EMAIL]->(email)

Click on Database to see the nodes, relationships, and properties that have been created. Now, your graph is ready to be analyzed.

An Introduction to Cypher

Cypher is Neo4j’s graph query language. The objective of this section is explaining the basic concepts of Cypher through a series of basic queries. You can find a detailed explanation of this language in the following link. There is also a very useful Cypher Refcard to give an overview over all options.

The simplest queries consist of a MATCH and a RETURN clause. The MATCH clause searches for the pattern specified in it. The RETURN clause defines what would be included in the result set. Patterns are expressed using the following notation for (nodes) and -[relationships]->.

Again, copy and paste each query in the Neo4j browser and click on Play.

a. Retrieve all the tenderers in the graph.

MATCH (a:Tenderer)
RETURN a

If you select a specific node, you will see its properties at the bottom of the query page.

Double-click on a specific node and select Expand/Collapse child relationships to see connected nodes.

Note that the browser is actually displaying just a part of the entire set of nodes, with a default count of 300. You can change the default settings to modify that number. It is also possible to see and download the entire set in text format clicking on Text.

b. Retrieve all tenderers whose name is “acondicionamiento en potencia y comunicaciones sa de cv” (power conditioning and communications).

MATCH (a:Tenderer 
{name:'acondicionamiento en potencia y comunicaciones sa de cv'})
RETURN a

Only one tenderer node satisfies the condition of having a name property with the specified value. It is a company and its RFC is APC0206109QA.

c. Retrieve all tenderers whose name is “acondicionamiento en potencia y comunicaciones sa de cv” and the contracts in which it has participated as a tenderer.

MATCH (a:Tenderer 
{name:'acondicionamiento en potencia y comunicaciones sa de cv'})
-[:TENDER]->(b:Contract)
RETURN a, b

The result shows that two selective contracts , ocds-07smqs-1405253 and ocds-07smqs-1422505, are associated with the tenderer “acondicionamiento en potencia y comunicaciones sa de cv”.

d. Retrieve all tenderers whose name is “acondicionamiento en potencia y comunicaciones sa de cv”, the contracts in which it has participated as a tenderer, and other tenderers competing in these contracts.

MATCH (a:Tenderer {name: 'acondicionamiento en potencia y comunicaciones sa de cv'})-[:TENDER]->(b:Contract)<-[:TENDER]-(c:Tenderer)RETURN a,b,c

The picture shows that the company named “acondicionamiento en potencia y comunicaciones sa de cv” has tendered and won two selective contracts in the year 2017 (ocds-07smqs-1405253 and ocds-07smqs-1422505) against the same competitor, “faragauss del golfo sa de cv”.

Red flags and more complex queries

Red flags of suspicious behavior are abnormalities which require additional investigation to confirm or rule out the presence of corruption or collusion in a contract.

In 2016, Open Contracting Partnership in close collaboration with Development Gateway published the first guide to identify red flags throughout the entire procurement process: Red flags for integrity: Giving the green light to open data solutions.

It is very important to remember that the presence of a red flag is not an indisputable proof that there is an irregularity in the contract, but suggests that it is necessary to continue investigating the case.

Some of these flags involve exploring connections between the entities that participate in a contracting process. For example, business similarities between bidders (common addresses, personnel, phone numbers, etc) may suggest collusion between competitors.

Collusion involves a horizontal relationship between bidders procurement, who conspire to remove the element of competition from the process. In the normal course, independent bidders in a procurement process compete against each other to win the contract, and it is via this mechanism that best value for money for the purchaser is achieved (OCDE, 2011)

The following query retrieves all tenderers who shared contact information while competing for the same contract. The return statement includes all types of nodes. For that, an asterisk is used in the RETURN clause. Note that this query may take a few minutes to run.

MATCH (p1:Tenderer)-[:TENDER]->(p4:Contract)
<-[:TENDER]-(p3:Tenderer)
WHERE p1 <> p3
AND exists ((p1)-[:CONTACTPERSON|TELEPHONE|ADDRESS|EMAIL]->
()<-[:CONTACTPERSON|TELEPHONE|ADDRESS|EMAIL]-(p3))
RETURN *

As you can see, this query retrieves an intricate network of tenderers, contracts, and contacts. Take into account that in many of these contracting processes an official of the procuring entity appears by default as a common contact of all tenderers who participate. This does not imply any irregularity, but it makes it difficult to detect true cases of collusion.

Let’s focus our attention on one of the suspicious contracts detected in the previous query. In the selective contracting process ocds-07smqs-1520579, 3 companies tendered, two of them share the same address. The following query explores the first and second-degree relationships [*1..2] of that specific contract.

MATCH (c:Contract{ocid:'ocds-07smqs-1520579'})-[*1..2]-(p) 
RETURN *

When exploring the resulting network, we discovered that:

  • Three tenderers (“internet movil s de rl de cv”, “tecnologia de diseno integral sa de cv”, and “maysoft global s de rl de cv”) have competed in two other contracts.
  • internet movil s de rl de cv” received two awards in these common contracts.
  • The query detected that “tecnologia de diseno integral” and “maysoft global” share the same address.

But this common address is not the only coincidence among these tenderers. There are other matches that were not detected by the query, but can be identified by visually inspecting the graph.

  • Removing a prefix, “internet movil s de rl de cv” and “maysoft global s de rl de cv” share the same telephone.
  • The domain of the email address of “maysoft global s de rl de cv” is @internetmovil.com.mx, which seems more related to its competitor “internet movil s de rl de cv”.

Again, the presence of a red flag is not proof of an integrity problem, but it is a reason for future investigations.

Using external data sources

Open contracting data is a powerful tool that can be used to tackle corruption at different levels. And its potential can be even greater when it is combined with other critical related datasets to respond to increasingly complex corruption schemas. See some examples in the following link.

In Mexico, the Tax Administration Service (SAT) regularly updates a list of taxpayers which has been formally declared as shell companies.

When the tax authority detects that a company has been issuing receipts without having the assets, personnel, infrastructure or material capacity to provide the services or produce the goods that support such receipts, or that such taxpayers are not located, it will be presumed the inexistence of its operations.

7,889 taxpayers have been mentioned in the definitive list to April 2019. These companies are identified by RFC, the same property included in the Tenderer node.

By crossing the list of shell companies with tenderers, we realize that 30 of these companies participated in contracts during 2017 and 2018. The next query looks for those tenderers and their respective contracts.

MATCH (c:Tenderer)-[r:TENDER]-(p) 
WHERE c.rfc IN
['IBI120525IV9','SSC161011DS3','QAL101119IG2','PAG1207103F3','CUR1405141F4','DCE0908051U8','GVI100428IM8','CPY110413I37','BAS090909A56','ADS101126FL0','GCA110429V28','SAM1401219X5','MCC091121EI3','GCC120614SA8','CIP050715F83','EPJ1011206R1','CEU150403968','CBX150529BJ5','HIN131218B16','FAC130318H39','SMB071126NX9','SAM1411187K9','KGI151019BG4','AGI1512015H0','CAC160129NA3','TDI120709PH7','LBL1505117S1','COC101213AK9','GPS980316TI6','GRE111110I43']
RETURN *

The result of the query allows us to appreciate that some of these shell companies participated in dozens of contracts and even won some bids. There are also a couple of contracting processes (ocds-07smqs-1654909 and ocds-07smqs-1496466) in which several shell companies participated together, although they were not winners.

Another example of an external source of information that can be used is the Public Registry of Commerce. Here you can review for example who are the shareholders of the companies. The only drawback is that entering the system requires registering a username and password.

In the previous section, we saw the case of “internet movil s de rl de cv” and “maysoft global s de rl de cv”. Two companies that seem to have business similarities based on their contact information.

When looking for both companies in the Public Registry of Commerce, we realize that they have common shareholders. Again, these types of findings are not conclusive proofs of collusion, especially if the many exceptions stipulated in Mexican legislation for the assignment of selective contracts are taken into account.

But they are a good starting point to inquire about the criteria used to invite and select the tenderers that participate in the suspicious contract.

Limitations

So far the queries presented suppose exact equalities between the textual terms, which supposes a very limited approach. The use of regular expressions and techniques such as edit distance can expand possible search patterns.

Conclusion & Invitation

This tutorial did not pretend to be an exhaustive review of graph databases and its applications. Instead, we encourage our readers to explore and learn more about this amazing technology.

Through the examples presented here, we can appreciate the enormous potential and flexibility that graph databases have to offer to the world of journalistic research. In the specific case of open contacting data, note that with small adjustments the proposed model can be applied to any of the countries or cities that are publishing data under the Open Contracting Data Standard.

Now it is time to use what we learned to replicate and improve this exercise with similar datasets.

Additional resources

Written by Dagoberto José Herrera Murillo (@djherreram, e-mail: d.j.herrera.murillo @ student.tue.nl).

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher, Data Science, Graph Analytics, GraphQL and more.

Dagoberto José Herrera Murillo

Written by

“En mi país no hay ley de acceso a la información.” Espero algún día eliminar estas palabras de aquí. Entusiasta de los datos abiertos.

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher, Data Science, Graph Analytics, GraphQL and more.