Why graph databases are the right choice for many data-centric organizations

Steve Sarsfield
Jun 20 · 8 min read

Graph databases are becoming more important to analytics by offering a capability to store relationships and perform unique algorithms.

Graph databases show relationships, true. But the real power might just be in the difficult analysis they can perform.

In the graph database world, the graph relationship diagram highlights one of the unique values of graph, namely the ability to keep track of connections in the data. Graph visualizations are the first place to start when it comes to understanding the connections in the data and how the puzzle fits together. However, it is just one of the features that makes graph databases potentially valuable for your organization. Let’s look at a couple of examples of that potential and how they come together to empower analytics.

Graph Algorithms

So, why do it in a graph DB at all?

These type of calculations are comparably simple on graph databases and very efficient. If you were to try a PageRank in SQL, it is possible, but you’d likely end up with a multi-step iterative process that would perform calculations on an entire databases, join two or more tables, calculate aggregates like SUM. It’s a very taxing process on a SQL database that would need to be repeated time and time again when new data is introduced. It’s just easier and more efficient in a graph database. I’ve even seen companies use multiple databases and a lot of data transfer to crunch through a PageRank calculation, since it is computationally expensive. With graph database, you don’t need it and you can perform both your normal daily analytics and the tough stuff like PageRank.

Statistical Algorithms on Diverse Linked Data

  • Correlations — Correlation analysis is used when a data scientist wants to establish if there are possible connections between variables/factors. There may be a correlation between the time of day and the amount of energy generated by a solar panel. When establishing connections between variables (or machine learning factors) you can use a correlation algorithm.
  • Profiling — Profiling analysis is used where you’re trying to use several factors/variables to describe a ‘profile’ of a person, place or thing. Your company might have better success by marketing certain product and price points to categories of customers and profiling is where you’d start. This type of analysis is tangential to machine learning and profiling analysis in a graph DB provides superior results.
  • Distributions — Back in grade school, your teacher may have graded you on the classic Bell curve. Distributions are a broad category of algorithms, like a bell curve can have many apply various algorithms to help you understand and sometimes predict outcomes. They can be used to understand housing prices or retail buying patterns. Financial analysts and investors often use a distribution when analyzing the returns of a security or of overall market sensitivity and volatility.
  • Entropy — Entropy detects how remarkable an ongoing analysis is. For example in IoT, you might expect a device to send a consistent value or set of values. When the device varies from the norm, you can capture the degree of variance with an algorithm.

The interplay between schemas and inferencing

The following example from the W3C Semantic Web Inference documentation illustrates the inference concept:

  • A data set might include the relationship Flipper isA Dolphin.
  • An ontology might declare that “every Dolphin is also a Mammal.”

A graph database with inferencing understands the notion of “X is also Y” adds the statement Flipper isA Mammal to the set of relationships even though it was not specified in the original data. It’s hugely powerful for many uses. One use is in supply chain, where relationships between parts and substitutions are important. For customer databases, you can gain better understanding of buyers while using inferencing.

Greater intelligence through Inferencing and Ontologies

Graph databases tap into Web Ontology Language (called OWL) and can easily generates new inferred relationships according to the OWL rules. You can also set up your own ontologies.

My favorite example has to be the Pizza ontology designed and used by Manchester University and Stanford University for learning about ontologies. It describes pizza and its potential toppings. A named pizza, like a Papa John’s “Garden Fresh” could be described in an ontology. Although the order entry system will show the sold item as Garden Fresh, I may also want to know the the components of that pizza are crust, sauce, and the toppings are cheese, peppers, onions, olives and so on. Ontology does that for me and can benefit when it comes time to estimate how many olives Papa John’s needs to buy.

Ontologies are a standard way of describing things that when combined with graph databases offer additional analytical power. There are widely used ontologies for automotive, TV and broadcasting, books and publishing, merchandise from online and retailers and much more.

Unlimited Facts

Graph databases, specifically RDF triple stores like AnzoGraph, deal with data that’s almost always the same SUBJECT-PREDICATE-OBJECT also known as triples. Of course, the facts are in a format that is specified by the RDF specification, but essentially, you’ll see facts like:

  • John is a person
  • John is married to Sue
  • John buys BMW
  • John resides in New York
  • John is the son of Andrew

In a relational database, a column is a vertical division of facts that must be specified and defined ahead of time. However, you store facts in a graph database and outside the constraints of columns and rows. In this system, you don’t have to know anything ahead of time about what you want to store and what type of analytics you want to run. You can add any facts about John at any time. If new data comes along about John on any subject, you can store it in a triple and not a separate table. You don’t need to create separate tables and joins with graph databases. If there is missing information about John, there are no nulls — you just go on the facts that you have. You can get a lot out of triple stores.

If you need more complexity, you can use labelled properties (LPG). Under the RDF new standards, AnzoGraph and Blazegraph are two examples of triple stores that can supports LPG. If you want, you can use properties to identify, for example, when John bought the BMW, or how much he likes the brand.

Contrast this with more rigid solutions. In RDBMS, I have to know what I’m going to store about each person. It’s also a good idea to know what kind of analysis I will run so that the queries run fast. Only then can I design a schema and factor the database correctly. As the database ages, there is a chance that data quality issues creep in as users begin to use non-standard ways to represent data. Rather than ask an admin to add a column to a table, it’s easier to use a “Notes” field to store many forms of important information that shouldn’t be in a notes field, for example. Graph databases offer you freedom from these fixed schemas.

Where the rubber hits the road on graph databases

  • Scientific Data Discovery in Clinical Trials — Graph Databases enable scientists at leading biopharma organizations to accelerate vital new research discoveries. Due to the power of ingesting gnarly, unstructured data into a three column table of facts, researchers freely bring in new data, “pivot” their analytics to ask new, ad hoc questions, without being limited by rigid database schemas
  • Anti-Fraud & Money Laundering — The relationship side of graph database can help detect fraudulent trading patterns and transactions in real time. You can semantically identify and understand the intricate relationships between entities and transactions, including the many individuals and organizations involved with those transactions
  • 360° View of the Customer — With the power of the simplified schema, algorithms and inferencing, you can gain new insight into each customer’s likes and dislikes in relation to other customers with similar location, similar demographics, etc. Discover new correlations between customers with inferencing, for more personalized and engaging customer experiences. Enable highly effective targeted customer marketing offers and loyalty programs geared to increase customer loyalty and “share of wallet”
  • Genomic Research — Specifically the algorithms and relationship capabilities of graph database are great for storage and querying of conditional relationships between molecular (genetic and epigenetic) events. They can be used to observe the characteristics of certain cancers when compared to healthy patients.
  • Recommendation Engine — Graph databases can leverage profile algorithms that you customize to present relevant items to a particular user during a shopping experience. Other algorithms group the buyers into persona profiles.
  • IT Management — Seeing the relationship between nodes in computer networks opens a world of additional complexity in how the nodes connect and interact with each other can be very powerful.
  • Social Network — One of the original use cases for graph databases is keeping track of social networks and understanding influence.
  • Machine Learning — Correlation analysis can be the first step in understanding which factors are important to machine learning.

About the author: Steve Sarsfield, VP of Product at Cambridge Semantics / AnzoGraph, has experience at Talend, Vertica and now Cambridge Semantics. He is also author of the book The Data Governance Imperative. Steve has more than 20 years of experience in databases, analytics, information quality, big data and data governance.

Steve Sarsfield

Written by

VP Product, AnzoGraph (Cambridge Semantics). Formerly from IBM, Talend and Vertica. Author of the book the Data Governance Imperative.