What the heck is a “graph database”?

…and why they help you be cool.

Timo Klimmer
12 min readMar 8, 2022

TLDR: Brief intro to graph databases and real-world use cases. Some technical content but primarily focusing on concepts and examples for newbies.

Graph databases are on the rise. They are not new, but more and more companies are just about to discover them.

Compared to other technologies, they often deliver business value faster, and they facilitate innovation better because they are closer to the mindset of business people.

So it’s no wonder that graph databases are already long on the radar of analysts like Gartner or Forrester. Many strategy consulting companies like the Big Four see value in them, and recently, several graph database companies have received notable fundings by investors. Business magazine Forbes has even published an article named “Why Your Next Database Is A Graph”.

But what exactly is a graph database?

A graph database is a database which focuses on relations.

Something — is related to → Something else.

The world is full of relations. Production processes, customers and devices interacting with all sorts of things, financial transactions between different parties, supply chains, energy grids, data from crime investigations, media content networks, social networks, computer networks, and so on and so forth…

There is likely not a single industry which does not deal with relations somehow. It’s rather that the more relations matter in your business, the more will you benefit from a graph database. If you haven’t dealt with graph databases before, this is your chance!

Experiment

To get started and show you “why graph database”, let’s make an experiment. Imagine you are responsible for a supply chain. Due to the Corona crisis, one of your suppliers is not able to deliver certain parts anymore, and your warehourses are empty, too. Especially, you are out of hoses. What does that mean to your existing orders? Which customers are affected? Which orders need to be cancelled?

For the sake of the experiment, let’s assume all required data is stored in two ways: once in a relational database, and once in a graph (see below). Your task is to find all customers and orders that are affected by the outage. Remember: hoses are out of stock.

Tabular representation
Graph representation
  1. The first time, try to answer the questions above using the tables above, and

2. The second time, do the same using the graph.

3. Then check for which you needed less time.

Done?

Ok, let me guess — the graph way was easier and faster. The reason for it is: graphs are better to digest because they are more aligned to how humans think. We don’t think in tables. We think in objects and their relations — and that’s exactly what a graph database is about.

Computers are similar to hu️mans here. The less data computers have to join and lookup, the more efficient. And the more efficient, the more is enabled, and the more enabled, the more $$$ 🙂.

Real World Use Cases

There is a broad spectrum of use cases where graph databases are beneficial. Often, graph databases outperform pre-existing solutions by cost and complexity because they can handle certain tasks very efficiently. Here is a small selection of use cases.

Process Analytics and Simulation

Production processes can be mega complex due to all the dependencies that are usually come with them.

  • How do you keep control and make sure your business is resilient against outages? Where are the bottlenecks in your processes? How can you mitigate risks?
  • How do you determine the impact of changes in your value chain?
  • How do you holistically optimize processes in an entire area— instead of doing only minor local optimizations?

👉 Exactly. Using a graph!

Photo by Dimitry Anikin on Unsplash

Personalized Recommendations / Digital Twin

Personalized recommendations is another case for graphs. Especially

  • if realtime updates of the model are important,
  • if geo data is involved, and/or
  • if exact explanations from the model are required,

they can come in handy.

Example: a car driving on a highway needs to find the next best point of interest for its car driver and his wife, based on their personal preferences, under the condition that there is also enough battery power left to reach the next charging station.

It’s hard to impossible to implement such solutions with classic neural networks or classic tables, for example.

Photo by Lex Melony on Unsplash

Customer 360

“If our company only knew what our company knows.” — With the power of graphs, fuzzy/duplicate data from multiple sources can be corrected such that single objects from the real world are indeed single objects in your data again. That process of resolving entities, is called entity resolution.

Combined with enrichments through AI (for example: natural language processing, speech recognition or computer vision), more and more companies know exactly about their customers — and other things they deal with. It’s needless to say that clean and complete data is so much more worth than thousand tables.

Photo by Arlington Research on Unsplash

Crime Reduction

Due to their ability to resolve entities, and due to their ability to bring things into context, graph databases are also used to fight crime in all sorts of flavors. Many fraud detection, crime investigation, or cybersecurity solutions are based on graphs. Given the current cybercrime trend, more and more companies are arming themselves with graph databases to build custom, domain-aware fraud prevention systems.

Photo by Campbell Jensen on Unsplash

Diving a bit deeper — so what can I technically do with a graph database?

CRUD

Well, obviously, the most basic operations are CRUD operations. Creating, reading, updating and deleting nodes, edges and properties. (If you haven’t heard about “nodes” and “edges” before, just think of them as the objects and their relations in the graph.)

For some applications, CRUD is all you need. A web application allowing users to interactively browse through a network of digital twins, or an application that needs to identify objects nearby can be implemented using simple CRUD.

Analytics

The next level then is analyzing the graph. That includes all sorts of aggregating data, for example: counting nodes having certain properties or summing metrics about edges. However, it is more than just that.

For example, graph databases let you find patterns.

Example: Assume you wanted to find all machines in a network that are embedded into the network the same way like a defect machine you have just detected. By querying the graph, you can quickly identify the machines at risk and proactively prevent issues before they occur. When doing the same with a relational database in the backend, the task gets much more complex — if possible at all.

Impact assessment through identification of identical machine

Most graph databases have very powerful analytical features. For example:

  • centrality computation — which of the objects in the network are critical because they are at a central place in the graph,
  • similiary algorithms — how similar are two objects — not only based on their properties but also based on how they are connected with other objects?
  • path finding algorithms — what is the shortest, cheapest and/or quickest way to something
  • cycle detection — are there cycles in the graph and where are they?
  • detection of communities — what communities are there?

For the sake of brevity and not to bore you too much, let’s keep as is for now and head over to the machine learning part.

Machine Learning

The ultimate step is to combine graph data with machine learning. There are several types of how “graph (database) + ML” can look like.

  1. Graph Feature Extraction
Graph data as input for an ML model with tabular input data

The main challenge in ML is usually not the algorithms. It is rather the information that we give to the algorithms so they can produce a good prediction model from it. If you feed an algorithm with data that does not contain the right information, you can train the model forever and with unlimited data — it won’t perform.

Some ML models simply can’t deliver the desired quality when they are not fed with data derived from a graph. And graph databases deliver just that data.

Example: imagine you build an ML model that evaluates prospective business partners. Your model will probably consider things like: how long do they exist on the market, what company size do they have, how bad is their feedback etc.

If you are lucky, the model performs reasonable already. However, if you want to boost your model and make it even more precise (which finally means more $$$ for your business), you also need to feed it with data that gives insights about the business partners’ relations: how close are they collaborating with your competition, is this company working with bad partners you already know, how well does their portfolio fit to your needs?

To understand all this and compile the input data for the classification model, a graph database is exactly what is needed.

2. Embeddings

Modern graph engines train so-called embeddings in-database. An embedding is a reduced representation of an object, expressed in the form of a vector (with a custom predefined length).

Embedding example

The more similar two objects are, the closer their embeddings in the respective space. With the help of the embeddings, we can relatively easily find clusters, build recommenders etc.

Multiple embeddings — v1 … v3. v1 and v2 are close to each other, and so do the persons they represent. v3 is further away, which means that the person represented by v3 is more different than the other two.

The important point here is that the embeddings trained in a graph database not only respect the properties of an object, but also how the object is embedded into the graph. That makes embeddings trained from graph data much more powerful than “classic” embeddings based on simple object properties only.

The advantage of training embeddings in-database is that data does not have to be moved to external components — at least in theory. Depending on the graph database engine, additional nodes and replications in the cluster might be required before embeddings can be trained.

3. External Machine Learning

Not uncommon is also the use of graph-based ML libraries outside of a graph database — sometimes even without using any graph database at all. The approach here is to load graph data from a graph database or other source into memory, either RAM or even GPU memory, and then apply graph algorithms using special ML frameworks in memory.

In-memory graph-based machine learning

This approach is especially suited for scenarios where custom algorithm models are trained, and when respective ML features are not available in the graph database engine (yet). Especially for training graph neural networks (= neural network based on graph data), this approach is quite common.

Popular graph-based ML tasks are for example:

  • Classify a graph or subgraph
  • Apply a regression to a graph or subgraph (=assign a number to a graph or subgraph)
  • Predict links between nodes
  • Generate graphs with special properties (eg. a molecule with special properties)

Is a graph database engine a graph database engine?

No. Graph database engines differ by various aspects. It’s important to be aware of the intentions, capabilities and differences of the different engines upfront. Else you might end up with a bad architecture and miss all the goodness.

OLTP/OLAP/ML

OLTP graph databases. An OLTP graph database is a database which is optimized to handle single transactions. That type of database is well suited for applications where single nodes or edges are looked up, and some rudimentary navigation is required, but not much more.

OLAP graph databases. The purpose of an OLAP graph database is to provide analytical capabilities to the graph. OLAP graph databases can find paths or patterns, find central nodes, compare nodes, detect communities, etc.

Machine Learning. Finally, there is graph database engines which have built-in machine learning capabilities, for example for training embeddings (see above).

Native vs. Multi-Mode

Besides OLTP/OLAP/ML, there is also

  • native graph databases and
  • graph databases that only pretend to be a graph database, but in reality, they are multi-mode and only provide a graphy query language on top of a common storage backend. Multi-mode databases are usually not optimized for graph workloads, therefore native graph databases are usually faster.

Query Language

Graph databases also differ in their query language. Essentially, there is two major query languages: Gremlin and Cypher (for the sake of brevity, let’s skip derivatives and other minor languages here).

While Gremlin is essentially about chaining function calls, Cypher has a more natural syntax, therefore is more popular. Some Cypher dialects such as GSQL by TigerGraph even support a control flow or variables on node level, which can make writing algorithms much easier.

Engines and Frameworks

Graph Database Engines

Comparison of graph database engines

Graph-based ML frameworks

DGL, PyTorch Geometric are popular frameworks for training graph neural networks. RAPIDS cuGraph and nvGraph offer graph computation on GPUs, for example, while networkx is a popular library for working with graphs in Python — clearly not comparable to a graph database engine, however. All of these frameworks, and nearly any other framework, can be run on Microsoft Azure Machine Learning, which can provide masses of compute power if needed.

Cognitive Capabilities

When working with graphs, you might first need to extract data before you can add it to the graph. There is numerous frameworks out there for extracting information of all kinds. For the sake of brevity, I cannot go into details here, but I recommend to take a look into what Microsoft Azure provides out-of-the-box (Link), especially Azure’s Cognitive Services and Applied AI Services. Both are a good means to, for example, extract custom named entities, classify documents to custom categories, transcribe audio/speech, translate speech/text to other languages, or recognize objects in images. Unless there is good reasons to build these capabilities on your own, it can be faster, cheaper and better to reuse what is already there.

What is a graph database not suited for?

Well, pretty obvious — but JSON documents and large tables are better stored and processed somewhere else. Same for timeseries data.

If there is the need or desire to query data via a single interface, GraphQL has become quite popular as a standard for a query language. Interestingly, GraphQL is only partially related to graph databases. There are GraphQL implementations which retrieve their data from classic relational databases completely without graph databases in the backend. (Some graph database engines do support GraphQL, however.)

Also be aware that there is already very powerful semantic search engines out there. If you build a knowledge graph to implement a search engine that “only” returns documents with highlighted answers, consider using a semantic search engine instead. It will save you from having to build your own NLP models etc.

Furthermore, in some cases, when the scenario requires only a quick computation in memory without any persistence, it might be easier to just simply load the data into memory and use a graph library in the programming language of your choice to do the computation.

Alright — that’s all I wanted to write for now.

I hope this article was enlightening and will help you survive your next graph database conversation. If you liked the article, let me know 🙂

Bonus Tracks

Storage and Cost Savings

Besides computational efficiency/performance, it’s not uncommon that graph databases store their data in much smaller data volumes compared to other database types. The reason is simple — if an object is “once and only once”, there is no redundant duplicates. That again can lead to significant storage (and cost) savings.

Streaming Data Into Graphs

Sometimes, data is directly streamed into graphs. An example could be an online shop. Whenever a new customer registers, the new customer is immediately ingested into the graph. By such streaming, near realtime analytics is enabled.

Temporal Context

As time goes by, graphs change. The customer whose last name was “Doe” yesterday, might have married and changed her name to something else. The sensor that was installed in a certain location last week, might have been moved to a completely different location yesterday.

The importance of the temporal context differs from use case to use case, but if you mess up with time, you can easily end up in big trouble. To avoid the trouble, make sure your graphs can cope with temporal changes or make sure that your graph has only data that was valid at a given point in time.

Relational Database vs. Graph Database

Interestingly, although they are all about relations, graph databases are completely different from so-called “relational databases”. A relational database deals with relations between columns, whereas a graph database is about the relations between objects. It is exactly that “relation between objects” which makes graph databases so powerful.

Answer to the Experiment

The answer is: Only order “DXI-2093-S” by customer “CRONUS Intl.” is affected. All other orders are not affected.

--

--

Timo Klimmer

AI Global Blackbelt at Microsoft. Collaborates with customers and potential customers to design intelligent solutions that add real value.