Find Circular Money Flow with Neo4j
How to use the subgraph technique to detect the circular money flow (money laundering) in the Neo4j Database of a finance system
Goal
This article is written for educational purposes to share cool stuff about Neo4j graph data modelling, subgraph technique, and Cypher query building. Our goal is to illustrate a type of money laundering activity called “Circular money flow”, of a very simplified financial system using a graph database. This post provides a conceptual view of how this can be approached using graph techniques. It can be useful study-case material for general purposes for beginners, who are already familiar with basic concepts of the Neo4j Database.
Database Schema
We will use a financial system to represent how money transfer from one business entity to another. Business entities involved in this “money-game” are split into three types: Client, Company, and ATM.
The domain model initially based on the explanations from the article “How to generate a huge financial graph with money laundering patterns?” written by M grin. Thank you, M grin, for a well-explained domain.
Entity Nodes
(:Company)
— a business entity, that can send and receive money.(:Client)
— a business entity, that can send and receive money.(:ATM)
— a business entity, used as an exit point for the money, so ATM cannot be a money sender.
In a real system, nodes would likely contain specific information about entities, such as identifiers and other attributes. However, for the purposes of this example, we are only concerned about the existence of nodes. In this tutorial, we don’t need any properties on our nodes.
The arrows represent money transfers. From the diagram, you may think, that Transactions can be relationships in our graph. Well, it is a possible option, but, I do not recommend you use relationships to represent Transactions in this instance. The Transaction should be a node, and here is why.
Data modelling of Transactions
Transaction — is a noun. A good old definition of a node for data modelling.
From the practical experience, it is also a valid choice. We want to manage them as independent entities. For example, querying all Transaction that were made in the last month is a trivial task on a :Transaction
node label:
MATCH (t:Transaction)
WHERE t.timestamp.month = date().month - 1
RETURN t
Last, but not least: Indexes. The index is a well-known feature of Node’s properties and unfortunately non-trivial story for Relationships.
Transaction Node & Relationships
(:Transaction)
— an entity, that represents an operation of money transfer between two members. It is a self-sufficient fact, that entity “A” sends money to entity “B”. Relationships between (A) — (Transaction) — (B) helps us to understand who is a sender and who is a receiver.
[:OUT]
relationship type — connects the entity of the source node and Transaction
. Relationship means that Entity
is a money sender.[:IN]
relationship type — connects the entity of a destination node and Transaction
. Relationship means that Entity
is the receiver of money.
CALL db.schema.visualization()
Reminder: ATM
used as an exit point for the money, so ATM
cannot be a money sender.
The Transaction
node has 2 properties: the amount
of money and timestamp
of operation. This is important information for our task because we need to know how much money was sent and when it happened.
Visualization of a graph
Now, knowing all of the domain details, we can imagine our graph. Business
entity nodes are connected to each other via Transaction
nodes. It is also worth mentioning that between the same entities there can be many transactions.
Task Requirements
Circular money flows is a type of “money laundering” activity in which the money is sent to different accounts, and sent back to the starting node after a certain number of hops. Our goal is to find all of these circular flows in the graph. We don’t need to find extremely long chains, so we will limit our investigations to 10 transactions in the money flow.
Each participant in the money flow may send a marginally less money in the next transaction compare to the previous one. The total “payment” for the whole travel should not exceed some % loss of the source amount.
The flow of transactions will expect that each “next” transaction happens only after the “previous” one has completed. Transactions should be ordered by the creation timestamp.
Graph Generation
Now, the practical part. First of all, we need an empty Neo4j Database on a local machine. I will use my Neo4j docker boilerplate to build a simple Docker image from the official one, containing the APOC & GDS plugins.
Simple way is to pull ready-to-use blank image from my docker hub profile and build it.
docker pull vladbatushkov/neo4j-apoc-gds:latestdocker run -p 7474:7474 -p 7473:7473 -p 7687:7687 --name=money1 vladbatushkov/neo4j-apoc-gds:latest
Important to setup enough space in the heap and page cache to be able to process a huge amount of data. My image configured for next settings.
ENV NEO4J_dbms_memory_heap_initial__size=4G
ENV NEO4J_dbms_memory_heap_max__size=4G
ENV NEO4J_dbms_memory_pagecache_size=2G
You can also build totally same image and configure it manually doing this.
Dockerfile
Build and run a custom docker container.
docker build . -t=money:dev --no-cachedocker run -p 7474:7474 -p 7473:7473 -p 7687:7687 --name=money1 money:dev
Data Generation
Now let’s talk about how to generate the required graph. The generation of a graph data set is always an interesting task to solve. In this particular case, I will use a well-known APOC Graph Generator procedure.
apoc.generate.er(noNodes, noEdges, label, type)
// generates a random graph according to the Erdos-Renyi model
My approach to building a required graph is based on the following steps:
- Generate N nodes (
Client
) with M random relationships (TRANSFER_TO
) between them— our initial set of nodes and relationships of business entities. - Partition them between
Clients
,Companies
, andATM
. - Break the previously created relationship
TRANSFER_TO
into theTransaction
node structure, as follows:
()-[OUT]->(:Transaction)<-[:IN]-()
The final thing is to clean up the TRANSFER_TO
relationships and add Indexes on Transaction
properties. Do not forget to turn on the multi-statement mode in your browser.
The script is idempotent, you can run it many times with parameters to fit your needs. Parameters I used are:
500 000 Entities of
- 5 000 ATMs
- 50 000 Companies
- 445 000 Clients
5 000 000 Transactions of
- amount of money is a random value up to 1000 conventional units
- timestamp is a random value from today minus up to 1000 minutes earlier
MATCH (n)-[:OUT]->(t:Transaction)<-[:IN]-(m) RETURN n,t,m LIMIT 500
Solution idea
We now need to write a query to find all the chains, starting and ending at the same Client
node. In addition, the amount of money on each Transaction
node within the chain should not lose more than x% of the source amount. The number of transactions involved in the flow can be up to 10, and they should be ordered by time.
The solution we are thinking of will involve building a proper path pattern.
(c1)->(t1)->(e2)->(t2)->(e3)->(t3)->...->(t9)->(e10)->(t10)->(c1)
Here we have start and end in the same Client
(c1), in the middle with some Client
or Company
nodes (e2, e3,…), and in total the longest path can have up to 10 Transactions
(t1, t2, …).
(tj)->(ej)->(tj+1)
Each neighbouring Transactions
should match the following conditions:
t1.timestamp < t2.timestamp
t1.amount > t2.amount
(t1)->...->(tN)
First and last Transactions
also should match these additional conditions:
(t1.amount — tN.amount) / t1.amount < X
- start and ends in the same Client:
(t1)<-[:OUT]-(:Client)-[:IN]->(tN)
As you can see, it would be a hard problem to query with so many conditions and node types involved: Transactions
and Business
entities. We need to think about some smart way to minimize the complexity of the query. We need to think about how actually we can solve this problem.
Subgraph
The main idea behind a subgraph technique — to reshape the graph through adding new relationships. This can help us query the data easier. Our solution query going to work with a subgraph of original graph; with additional structures created to solve our specific problem. This approach is very useful and can be used in many other scenarios.
What is the money flow? It is a chain of Transactions. Let’s forget about business entities for a while and actually find-out, that all we need is actually just Transactions
. We can build a new “network” of only Transactions
, that we didn’t have before (our subgraph). We can then explore that “transaction network”, writing a query to detect circular money flows.
Now let’s build a subgraph by introducing a HOP
relationship.
[:HOP]
Relationship
The [:HOP]
relationship between Transactions
exists for the following conditions:
- Transaction A { timestamp } earlier than Transaction B { timestamp }
- Transaction B { amount } less than Transaction A { amount }
- Transaction B { amount } compare to Transaction A { amount } not less than a defined loss percentage
Without the last condition, the subgraph becomes wide-range acceptable. You can use the same subgraph and apply different loss percentages in your future query. The disadvantage of this approach is that it gives us more relationships, and as a result query performance overhead.
In this generation script, I set the exact value of loss percentage, based on my needs. I define a loss as less than 25%, assuming that in my data 25% loss is still an acceptable value for circular money flow. Why it is so big? I have a random graph, and, assume, that a 10% loss for 10 hops is statistically a jackpot for randomly generated data. With a 25% loss, I’ll have a better chance to find a long-chain result.
MATCH (t1:Transaction)-[:HOP*2..10]->(t2:Transaction)
RETURN t1,t2 LIMIT 500
Solution Query
Now it is time to write an actual query to solve the task. Once again, a reminder of the circular money flow requirements: the same Client
in the beginning and at the end, time-ordered and match less than 25% loss each progressive traversal.
As you can see, the query is really short and clear.
It is worth highlighting the MATCH
pattern used to find a path from first to the last Transaction
via the :HOP
relationships — this is only one valid money flow, that match all chain requirements. Thanks to the :HOP
relationship, we can be sure, that not only first and last Transactions satisfy requirements, but also all Transactions
in between:
path = (t1:Transaction)-[:HOP*3..10]->(t2:Transaction)
The PROFILE
statement is used only to look at the execution plan. Here is the query result and execution plan for 5 449 818 Nodes and 15 996 048 Relationships. Circular money flow result with longest circular money flow of 8 Transactions.
761571154 total db hits in 1041340 ms
It takes around 17 minutes to analyse the whole graph. But do we need to analyze everything?
Use-case example
Depends on your situation you may think about query changes and apply more ideas, how to use this query.
For example, when a new Transaction created, we can run a query against only that Transaction. We will verify, that this money not came from circular money flow, organized by Client, who received money (incoming Transaction).
In this story, I have nothing except id of Node to use as a filter, so, my query change will be looks like this.
PROFILE MATCH path = (t1:Transaction)-[:HOP*3..10]->(t2:Transaction)
WHERE ID(t2) = 2801989 AND ...
...
I catch a potential scammer and it was really fast.
171 total db hits in 94 ms
Conclusion
I hope this story gives you a good understanding that graphs are a very good tool in these types of situations, and a perfect place to apply your creativity. Pick a small problem and try to solve it with a graph. If you need any help — ping me in LinkedIn, I would gladly chat with you about graphs.
Thanks for reading and enjoy Neo4j. Clap-clap-clap to catch all the scammers.