Why graph databases are the right choice for many data-centric organizations
Graph databases are becoming more important to analytics by offering a capability to store relationships and perform unique algorithms.
In the graph database world, the graph relationship diagram highlights one of the unique values of graph, namely the ability to keep track of connections in the data. Graph visualizations are the first place to start when it comes to understanding the connections in the data and how the puzzle fits together. However, it is just one of the features that makes graph databases potentially valuable for your organization. Let’s look at a couple of examples of that potential and how they come together to empower analytics.
Even though you may not necessarily visualize certain algorithms with traditional graph ball visualization, graph algorithms including Pagerank, shortest path, all paths and many others help you solve your data analytics challenges at scale with ease. For example, Pagerank uses a mathematical formula to help you understand things like who is most influential, or what network router gets the most traffic to offer a few examples. PageRank outputs a positive rational number, which really doesn’t lend itself to a graph-type visualization.
So, why do it in a graph DB at all?
These type of calculations are comparably simple on graph databases and very efficient. If you were to try a PageRank in SQL, it is possible, but you’d likely end up with a multi-step iterative process that would perform calculations on an entire databases, join two or more tables, calculate aggregates like SUM. It’s a very taxing process on a SQL database that would need to be repeated time and time again when new data is introduced. It’s just easier and more efficient in a graph database. I’ve even seen companies use multiple databases and a lot of data transfer to crunch through a PageRank calculation, since it is computationally expensive. With graph database, you don’t need it and you can perform both your normal daily analytics and the tough stuff like PageRank.
Statistical Algorithms on Diverse Linked Data
Graph databases are a new generation of NoSQL solutions that allow for statistical algorithms to be used on diverse linked data. You can imagine that both the simplified three column table of graph and the connected information of graph save you from performing costly joins and thereby speed up analytic. In several solutions on the market, you can now perform analysis like:
- Correlations — Correlation analysis is used when a data scientist wants to establish if there are possible connections between variables/factors. There may be a correlation between the time of day and the amount of energy generated by a solar panel. When establishing connections between variables (or machine learning factors) you can use a correlation algorithm.
- Profiling — Profiling analysis is used where you’re trying to use several factors/variables to describe a ‘profile’ of a person, place or thing. Your company might have better success by marketing certain product and price points to categories of customers and profiling is where you’d start. This type of analysis is tangential to machine learning and profiling analysis in a graph DB provides superior results.
- Distributions — Back in grade school, your teacher may have graded you on the classic Bell curve. Distributions are a broad category of algorithms, like a bell curve can have many apply various algorithms to help you understand and sometimes predict outcomes. They can be used to understand housing prices or retail buying patterns. Financial analysts and investors often use a distribution when analyzing the returns of a security or of overall market sensitivity and volatility.
- Entropy — Entropy detects how remarkable an ongoing analysis is. For example in IoT, you might expect a device to send a consistent value or set of values. When the device varies from the norm, you can capture the degree of variance with an algorithm.
The interplay between schemas and inferencing
If you were able to plan your schema in a RDBMs that could store relationships, it might help you achieve certain types of relationship analytics. However, graph databases often include inferencing capabilities that can create new relationships based on the vocabularies or ontologies in the existing data.
The following example from the W3C Semantic Web Inference documentation illustrates the inference concept:
- A data set might include the relationship Flipper isA Dolphin.
- An ontology might declare that “every Dolphin is also a Mammal.”
A graph database with inferencing understands the notion of “X is also Y” adds the statement Flipper isA Mammal to the set of relationships even though it was not specified in the original data. It’s hugely powerful for many uses. One use is in supply chain, where relationships between parts and substitutions are important. For customer databases, you can gain better understanding of buyers while using inferencing.
Greater intelligence through Inferencing and Ontologies
Graph databases tap into Web Ontology Language (called OWL) and can easily generates new inferred relationships according to the OWL rules. You can also set up your own ontologies.
My favorite example has to be the Pizza ontology designed and used by Manchester University and Stanford University for learning about ontologies. It describes pizza and its potential toppings. A named pizza, like a Papa John’s “Garden Fresh” could be described in an ontology. Although the order entry system will show the sold item as Garden Fresh, I may also want to know the the components of that pizza are crust, sauce, and the toppings are cheese, peppers, onions, olives and so on. Ontology does that for me and can benefit when it comes time to estimate how many olives Papa John’s needs to buy.
Ontologies are a standard way of describing things that when combined with graph databases offer additional analytical power. There are widely used ontologies for automotive, TV and broadcasting, books and publishing, merchandise from online and retailers and much more.
Switch gears a bit, it’s an interesting fact that RDF graph databases are called triple stores for a reason; they store data mostly in three columns subject-predicate-object. The three column table of an RDF graph database is a powerful asset that allows you to store the facts you know when you design your database and any future and unanticipated facts. Handling the schemas and slowly changing dimensions can be a challenge and time-consuming. By configuring all of your data into triples, you limit your need to have to set up rigid schemas, complicated ETL and data transformation, multiple tables and tricky, expensive JOINs.
Graph databases, specifically RDF triple stores like AnzoGraph, deal with data that’s almost always the same SUBJECT-PREDICATE-OBJECT also known as triples. Of course, the facts are in a format that is specified by the RDF specification, but essentially, you’ll see facts like:
- John is a person
- John is married to Sue
- John buys BMW
- John resides in New York
- John is the son of Andrew
In a relational database, a column is a vertical division of facts that must be specified and defined ahead of time. However, you store facts in a graph database and outside the constraints of columns and rows. In this system, you don’t have to know anything ahead of time about what you want to store and what type of analytics you want to run. You can add any facts about John at any time. If new data comes along about John on any subject, you can store it in a triple and not a separate table. You don’t need to create separate tables and joins with graph databases. If there is missing information about John, there are no nulls — you just go on the facts that you have. You can get a lot out of triple stores.
If you need more complexity, you can use labelled properties (LPG). Under the RDF new standards, AnzoGraph and Blazegraph are two examples of triple stores that can supports LPG. If you want, you can use properties to identify, for example, when John bought the BMW, or how much he likes the brand.
Contrast this with more rigid solutions. In RDBMS, I have to know what I’m going to store about each person. It’s also a good idea to know what kind of analysis I will run so that the queries run fast. Only then can I design a schema and factor the database correctly. As the database ages, there is a chance that data quality issues creep in as users begin to use non-standard ways to represent data. Rather than ask an admin to add a column to a table, it’s easier to use a “Notes” field to store many forms of important information that shouldn’t be in a notes field, for example. Graph databases offer you freedom from these fixed schemas.
Where the rubber hits the road on graph databases
I’ve given some specific differences with graph database that empower more ambitious analytical challenges, but exactly do these discreet advantages this come together to drive the case for graph?
- Scientific Data Discovery in Clinical Trials — Graph Databases enable scientists at leading biopharma organizations to accelerate vital new research discoveries. Due to the power of ingesting gnarly, unstructured data into a three column table of facts, researchers freely bring in new data, “pivot” their analytics to ask new, ad hoc questions, without being limited by rigid database schemas
- Anti-Fraud & Money Laundering — The relationship side of graph database can help detect fraudulent trading patterns and transactions in real time. You can semantically identify and understand the intricate relationships between entities and transactions, including the many individuals and organizations involved with those transactions
- 360° View of the Customer — With the power of the simplified schema, algorithms and inferencing, you can gain new insight into each customer’s likes and dislikes in relation to other customers with similar location, similar demographics, etc. Discover new correlations between customers with inferencing, for more personalized and engaging customer experiences. Enable highly effective targeted customer marketing offers and loyalty programs geared to increase customer loyalty and “share of wallet”
- Genomic Research — Specifically the algorithms and relationship capabilities of graph database are great for storage and querying of conditional relationships between molecular (genetic and epigenetic) events. They can be used to observe the characteristics of certain cancers when compared to healthy patients.
- Recommendation Engine — Graph databases can leverage profile algorithms that you customize to present relevant items to a particular user during a shopping experience. Other algorithms group the buyers into persona profiles.
- IT Management — Seeing the relationship between nodes in computer networks opens a world of additional complexity in how the nodes connect and interact with each other can be very powerful.
- Social Network — One of the original use cases for graph databases is keeping track of social networks and understanding influence.
- Machine Learning — Correlation analysis can be the first step in understanding which factors are important to machine learning.
About the author: Steve Sarsfield, VP of Product at Cambridge Semantics / AnzoGraph, has experience at Talend, Vertica and now Cambridge Semantics. He is also author of the book The Data Governance Imperative. Steve has more than 20 years of experience in databases, analytics, information quality, big data and data governance.