Graph databases, Why are they important
So before I start to give a high-level description of this topic, it is important to inform my readers on what Graph databases are and what they are not.
Graph databases are not to be mistaken for GraphQL, GraphQL is a query language for your API, and a server-side runtime for executing queries by using a type system you define for your data. It is important to note it is not tied to any databases or storage engines.
Graph databases have gained a good amount of popularity from Social networking models and as such a lot of people assume that this is their only use case which is not accurate. Graph databases are presently applied in so many more areas like Fraud detection, Recommendation engines, Master Data Management, Artificial Intelligence and Machine Learning.
To define graph databases without making reference to graph theory would be a great disservice and I am sure at this point, a lot of non-Mathematicians are probably rolling their eyes at the article, Graph theory in its entirety is a really complex topic and thankfully beyond the scope of this article but for the more inquisitive of you, here is a link where it has been introduced gently. You do not actually need a vast amount of knowledge in this area to understand Graph databases but it is important to at least acknowledge where the underlying theory for constructing Graph databases comes from. Graphs are essentially a collection of nodes(vertices) and edges or in more high-level terms they are collections of nodes with relationships that connect them.
The Node is the fundamental unit within a graph. You can have several nodes that exist in a graph, all possessing information about the resource they represent while the edges are the relationships or lines that connect these nodes to each other. Each edge has at least two nodes associated with it.
So bringing it all together Graph databases are types of NoSQL databases which utilise the graph model consisting of nodes and edges.
How Graph databases work
With the introduction of Amazon Neptune and Azure Cosmos, there has been a spike in the popularity of Graph databases this year. More companies are deciding to utilise this in their production environments.
It has earlier been established that graph databases would consist of both nodes and vertices. The way you would move around a graph is to traverse along specific edge types or across the entire graph. Traversing is the act of journeying through or visiting nodes in the graph. For graph databases, traversing to nodes via their relationships can be compared to the JOINs on the relational database tables, but these traverses are much quicker than the JOINs. This is due to the fact the relationships are stored in the graph which makes retrieval of information from nodes a lot easier. When querying data with a traversal, the graph only takes into account the data that is needed without worrying about any grouping operations on the entire dataset which occurs in relational databases.
Relationships on graph databases are treated as first-class entities, we will come to an explanation why this is a better approach in some use cases to relational databases. But this approach means relationships and connections can be persisted through every part of the life cycle of the data which means there is no reliance on things like a foreign key for lookups in tables.
Relational databases vs Graph databases
Jonas Partner and Aleksa Vukotic performed an experiment using social networks. They built a query in both MySQL and Neo4j (a popular type of graph database) with a database of 1,000,000 users. The tables comprised of the user’s friends and friends of those friends.
So from the results, you can see a huge dip in the time taken to come back with the query results as the depth increases. For the simple friends of friends query, the Neo4j query was 60% faster than the MySQL query. For friends of friends of friends, Neo4j was 180 times faster. And for the depth four query, Neo4j was 1,135 times faster. Unfortunately, MySQL failed to return the 5th query.
For the relational databases, the problem encountered here was the foreign keys which act as a reference to other tables. The foreign keys act like pointer meaning relationships between rows are calculated at query time using JOIN like operators. Relational databases do not have the benefit of traversing meaning for all queries, the entirety of the table needs to be examined. These type of operations have an expensive cost and this cost grows exponentially with the number of new tables added to the database.
One of the greatest weaknesses of the relational database model is their inflexibility. Presently in the year 2018, business needs change on a daily basis with constant evolution and change taking place, relational databases are not built to keep up with ever-changing scenarios. You could decide to use nullable columns that could be editable later but this isn’t really a good solution to the problem. Graph databases do not have this problem as they rely on relationships and nodes that can easily be created and manipulated depending on the variables considered.
Now it is true, you do require more storage space for a graph database as you have to store your relationships. this means as the graph grows the storage size grows proportionally. However, storing the relationships makes for faster traversing. This way the database engine can walk over the graph following these relationships, without the needing of JOIN operations. This is another reason why relationships are called “first-class entities in graph databases”.
However, this Article is not a call to overhaul or scrap relational databases we have loved through the years as they are still the best kind of database in a variety of scenarios. Relational databases are still the best to use:
- For data analyses with a huge number of records on a single table
- As mentioned before they do use a lot less memory storage as relationships do not need to be stored.
- They are very good to use with highly structured data which makes them an awesome tool for reporting.
Uses of Graph databases
So I have outlined some of the uses in the introduction above. However, for this article, I will only be touching on the use in Fraud Detection.
To understand how graph databases impacts on fraud detection systems it is important we understand the problems experienced by the current systems used. Throughout the article, the reference to fraud detection deals with online fraud (banking fraud, insurance fraud, e-commerce fraud). With criminals starting to move activities online it is essential that continually develop systems to curtail their activities.
Frequently, fraud happens relatively quickly with most banking systems not possessing the capabilities of detecting fraud fast enough, meaning in most cases, accounts that are exposed to fraud are not blocked while the fraud is ongoing. In Nigeria especially, there is almost an over-reliance on customers to help detect illegal activity on their bank accounts so they can be blocked. For unreported activities, stealing can go on for months, even years in some cases without any sort of detection. Banking software designed with relative databases are not designed to look at fraud patterns of criminals because this is something that can hardly be measured and usually varies across individuals.
With the traditional relational database systems, we require a lot of complex series of joins. These queries are often times quite complex to build and come with the added cost of running them. Trying to scale these large queries in order for them to return real-time data is virtually impossible in most cases due to performance because as the query grows performance declines due to more parameters being added. A good example of this is the experiment discussed above by Jonas Partner and Aleksa Vukotic.
Graph databases are an ideal tool for getting over these challenges as using the relationships to traverse the graph is an optimal solution. Real-time queries can be made at the point an odd transaction occurs giving the financial institution a unique ability to utilise various fraud patterns in real-time which can apply different individuals or groups. These can help the financial institution be proactive in identifying fraud rings before major crimes occur. A way to achieve this could be to combine fraud detection capabilities already existing or being adopted by the financial institution(bank, insurance company) and combine them with a graph database by running queries with up to date data or running a variety of checks during the various stages of the customer & account cycle.
The stages could comprise of:
- Account Creation
- During an Investigation
- When a suspicious transaction is flagged
- When the credit card threshold is hit
An example scenario we can consider is an online transaction with the following identifiers (phone, user Id, IP address, geolocation, cookie, credit card number). The assumption is that the relationships between these identifiers(nodes and edges) would be fairly close to each other but one of these relationships could start to exceed a certain number e.g. an individual having multiple credit cards with different owners in different geo-locations, which could be a case where this individual has robbed the owners of their cards or duplicated them. This query can be easily implemented and the individual caught. In addition to this, there is an immediate application for Machine Learning to build a system that would be able to help us detect any complex patterns criminals might share, to help with easier detection of individuals or groups who share similar patterns to the ones in the system in the future.
In this article, we have learned about graph databases, what they consist of at a high level and their use in fraud detection. If you are interested in building a graph database or reading more I would suggest you start with the Neo4j website as they have excellent documentation in the area. If you need to read more about Fraud detection in Graph databases there is a really good paper by Emil Eifrem, Neo Technology, “Graph databases: the key to foolproof fraud detection?” It is an excellent read.
- Emil Eifrem, Neo Technology- “Graph databases: the key to foolproof fraud detection?”, 2016
- Graham Cox, “Introduction to Graph Databases, 2017”, https://www.compose.com/articles/introduction-to-graph-databases/
- Bryce Merkl Sasaki, 2018 “Graph Databases for Beginners: Why Graph Technology Is the Future” https://neo4j.com/blog/why-graph-databases-are-the-future/
- Kenny Bastani, 2013 “Bank Fraud Detection” https://github.com/neo4j-contrib/gists/blob/master/other/BankFraudDetection.adoc