# Why graph databases are the right choice for many data-centric organizations

Graph databases are becoming more important to analytics by offering a capability to store relationships and perform unique algorithms.

In the graph database world, the graph relationship diagram highlights one of the unique values of graph, namely the ability to keep track of connections in the data. Graph visualizations are the first place to start when it comes to understanding the connections in the data and how the puzzle fits together. However, it is just one of the features that makes graph databases potentially valuable for your organization. Let’s look at a couple of examples of that potential and how they come together to empower analytics.

# Graph Algorithms

Even though you may not necessarily visualize certain algorithms with traditional graph ball visualization, graph algorithms including Pagerank, shortest path, all paths and many others help you solve your data analytics challenges at scale with ease. For example, Pagerank uses a mathematical formula to help you understand things like who is most influential, or what network router gets the most traffic to offer a few examples. PageRank outputs a positive rational number, which really doesn’t lend itself to a graph-type visualization.

So, why do it in a graph DB at all?

These type of calculations are comparably simple on graph databases and very efficient. If you were to try a PageRank in SQL, it is possible, but you’d likely end up with a multi-step iterative process that would perform calculations on an entire databases, join two or more tables, calculate aggregates like SUM. It’s a very taxing process on a SQL database that would need to be repeated time and time again when new data is introduced. It’s just easier and more efficient in a graph database. I’ve even seen companies use multiple databases and a lot of data transfer to crunch through a PageRank calculation, since it is computationally expensive. With graph database, you don’t need it and you can perform both your normal daily analytics and the tough stuff like PageRank.

# Statistical Algorithms on Diverse Linked Data

Graph databases are a new generation of NoSQL solutions that allow for statistical algorithms to be used on diverse linked data. You can imagine that both the simplified three column table of graph and the connected information of graph save you from performing costly joins and thereby speed up analytic. In several solutions on the market, you can now perform analysis like:

# The interplay between schemas and inferencing

If you were able to plan your schema in a RDBMs that could store relationships, it might help you achieve certain types of relationship analytics. However, graph databases often include inferencing capabilities that can create new relationships based on the vocabularies or ontologies in the existing data.

The following example from the W3C Semantic Web Inference documentation illustrates the inference concept:

A graph database with inferencing understands the notion of “X is also Y” adds the statement Flipper isA Mammal to the set of relationships even though it was not specified in the original data. It’s hugely powerful for many uses. One use is in supply chain, where relationships between parts and substitutions are important. For customer databases, you can gain better understanding of buyers while using inferencing.

Greater intelligence through Inferencing and Ontologies

Graph databases tap into Web Ontology Language (called OWL) and can easily generates new inferred relationships according to the OWL rules. You can also set up your own ontologies.

My favorite example has to be the Pizza ontology designed and used by Manchester University and Stanford University for learning about ontologies. It describes pizza and its potential toppings. A named pizza, like a Papa John’s “Garden Fresh” could be described in an ontology. Although the order entry system will show the sold item as Garden Fresh, I may also want to know the the components of that pizza are crust, sauce, and the toppings are cheese, peppers, onions, olives and so on. Ontology does that for me and can benefit when it comes time to estimate how many olives Papa John’s needs to buy.

Ontologies are a standard way of describing things that when combined with graph databases offer additional analytical power. There are widely used ontologies for automotive, TV and broadcasting, books and publishing, merchandise from online and retailers and much more.

# Unlimited Facts

Switch gears a bit, it’s an interesting fact that RDF graph databases are called triple stores for a reason; they store data mostly in three columns subject-predicate-object. The three column table of an RDF graph database is a powerful asset that allows you to store the facts you know when you design your database and any future and unanticipated facts. Handling the schemas and slowly changing dimensions can be a challenge and time-consuming. By configuring all of your data into triples, you limit your need to have to set up rigid schemas, complicated ETL and data transformation, multiple tables and tricky, expensive JOINs.

Graph databases, specifically RDF triple stores like AnzoGraph, deal with data that’s almost always the same SUBJECT-PREDICATE-OBJECT also known as triples. Of course, the facts are in a format that is specified by the RDF specification, but essentially, you’ll see facts like:

In a relational database, a column is a vertical division of facts that must be specified and defined ahead of time. However, you store facts in a graph database and outside the constraints of columns and rows. In this system, you don’t have to know anything ahead of time about what you want to store and what type of analytics you want to run. You can add any facts about John at any time. If new data comes along about John on any subject, you can store it in a triple and not a separate table. You don’t need to create separate tables and joins with graph databases. If there is missing information about John, there are no nulls — you just go on the facts that you have. You can get a lot out of triple stores.

If you need more complexity, you can use labelled properties (LPG). Under the RDF new standards, AnzoGraph and Blazegraph are two examples of triple stores that can supports LPG. If you want, you can use properties to identify, for example, when John bought the BMW, or how much he likes the brand.

Contrast this with more rigid solutions. In RDBMS, I have to know what I’m going to store about each person. It’s also a good idea to know what kind of analysis I will run so that the queries run fast. Only then can I design a schema and factor the database correctly. As the database ages, there is a chance that data quality issues creep in as users begin to use non-standard ways to represent data. Rather than ask an admin to add a column to a table, it’s easier to use a “Notes” field to store many forms of important information that shouldn’t be in a notes field, for example. Graph databases offer you freedom from these fixed schemas.

# Where the rubber hits the road on graph databases

I’ve given some specific differences with graph database that empower more ambitious analytical challenges, but exactly do these discreet advantages this come together to drive the case for graph?

About the author: Steve Sarsfield, VP of Product for AnzoGraph, has experience at Talend, Vertica and now Cambridge Semantics. He is also the author of the book The Data Governance Imperative. Steve has more than 20 years of experience in databases, analytics, information quality, big data and data governance.

Written by