Processing highly connected Data using Azure Cosmos DB and Gremlin 🚀

Introduction

In this article, we take a look at processing graph-oriented data using Azure Cosmos DB. A graph allows us to represent connections between entities in a convenient and natural way. In many data processing scenarios, we want to efficiently store and analyze huge amounts of these connections.

For example, think about social networks like Twitter or Facebook: They basically consist of a vast number of people that are connected to each other in various ways. By analyzing the connections between them, we can infer social relationships, optimize content delivery and make meaningful content suggestions to users.

There are a lot of other problems that can be solved using graphs: Route optimization in logistics, automated product recommendation or monitoring of network topologies, to name a few.

Azure Cosmos DB enables us to efficiently store and analyze highly connected data using graph structures. Graphs can be read and manipulated using Gremlin, a popular graph traversal language originated from the Apache TinkerPop project. As Cosmos DB supports seamless scaling and multi-region replication of graphs, it is well-suited for large-scale data processing scenarios.

In this article, we will do the following:

  • We will learn more about graph databases and Cosmos DB
  • We utilize Gremlin queries to construct a graph and store it inside Cosmos DB
  • We take a look at some of Gremlin’s query capabilities
  • Finally, I want to discuss the usage of graph databases alongside other database systems. This is a very important aspect when designing larger systems, as you may want to take advantage of graph database capabilities but still use other databases, e.g. a relational database.

Let’s jump right in. ✨

About Vertices, Edges and Graph Databases

A graph consists of objects and a set of relationships between them. From an abstract, mathematical point of view, the objects are called vertices and the relationships are called edges.

An edge connects a pair of vertices and defines a relationship between them. In an undirected graph, the edges do not have a direction. Thus, when we have a pair of vertices named x and y, the edge (x, y) is identical to the edge (y, x).

In a directed graph, edges do have a direction. Below are two graph visualizations as an example:

A simple, undirected graph
A simple, directed graph visualizing a sociogram

By traversing (or following) the edges of these graphs, we are able to answer the following questions:

  • Is there any path from A to F in the undirected graph?
     What is the shortest path from C to E?
  • Which person is most popular (is liked most) in the above sociogram?

A graph database is a database that represents and stores data using graphs, that is to say vertices and edges. The graph can be enriched with arbitrary semantic information, stored as a set of properties. Properties are key/value-elements that are either attached to a vertex or an edge. Just like in a document-oriented database, the properties do not follow a predetermined schema.

Queries against a graph database are expressed in a language which is suitable for traversing graphs. The efficient execution of graph traversals is one big strength of graph databases. By contrast, in a relational database, this is a task that does not scale very well.

If we were to implement a sociogram describing friendships using an RDBMS, we might come up with a table containing people and a table containing friend-relationships:

The representation of a sociogram within an RDBMS

Traversing the graph now basically results in recursive JOINs: Do any friends of Alice, living in Springfield, cultivate a friendship with people living in Shelbyville? To answer this question, we have to fetch all relationship records involving Alice, fetch the records of her friends living in Springfield, fetch relationship records involving them, … and so on.

The query execution time in such scenarios easily gets unpredictable, let alone the complexity of expressing them. Of course we could invest some time to optimize these queries a bit by altering above table structure. However, we had to invest that work due to the fact that relational databases do not offer us any good tools to handle highly connected data. The above scenario is a typical example of accidental complexity, a term coined by Fred Brooks in his famous “No Silverbullet” paper. It is the technical complexity that is not relevant to the problem, often resulting from the wrong choice of language and tools.

By contrast, using Gremlin, fetching Shelbyville-based friends of Springfield-based friends of Alice might look like this:

g.V()
.has('name', 'Alice').out()
.has('city', 'Springfield').out()
.has('city', 'Shelbyville')

First Steps with Cosmos DB and Gremlin

Microsoft Azure Cosmos DB is a globally distributed, multi-model database platform that supports storage of documents, key-value collections and graphs. It offers elastic scaling for storage and throughput as well as multi-region replication.

In this section, we will create a Cosmos DB graph database and use the Data Explorer from the Azure Portal to run some basic Gremlin commands.

Creating a Graph Database

First, start by creating a new Cosmos DB account within the Azure Portal. Do not forget to select the Gremlin API as the instance API:

Creating a new Cosmos DB Account

Switch to your newly created Cosmos DB account and navigate to the Data Explorer. The data explorer gives you an overview of existing databases and graphs. It also can be used to visualize graphs and even manipulate vertices, edges and their properties.

Add a new graph as shown below:

Creating a new Cosmos DB Graph Database

After creating the database, you can switch to the “Keys” section and grab both the URI and primary key values. We will need both properties in the next section to establish a connection via C#.

Getting the Cosmos DB Read-Write Key needed by the C# SDK

First Steps with Gremlin

Gremlin is a query language specifically designed for handling graph data structures. We can get familiar with some basic queries using the Data Cosmos DB Data Explorer. Go to the Data Explorer, select your previously created Graph Database and open the associated Graph. You will see an input box that allows you to run Gremlin queries.

Run the query g.addV('server').property('id', 'API Gateway') , as shown in the screenshot below:

Running Gremlin queries in the Azure Cosmos DB Data Explorer

The query will add a new vertex to the graph. The parameter of addV specifies the label of the vertex, which denotes its type. Let’s add some more vertices by executing the following queries:

g.addV('server').property('id', 'Storage Service')
g.addV('client').property('id', 'Mobile Device')
g.addV('client').property('id', 'Web App')

By executing g.V() , we can retrieve all the vertices of the graph. The Data Explorer allows us to explore the query result either as JSON or as a graph visualization. We can see the following JSON result:

A query result shown by the Data Explorer

Of course, we can query vertices by label and property values. The following query retrieves all server vertices:

g.V().hasLabel('server')

And the following two queries both retrieve the vertex that has the ID “Web App”:

g.V().has('id', 'Web App')
g.V('Web App')

By adding some edges, we can make our graph more interesting:

g.V('Web App').addE('connects').to(g.V('API Gateway'))
g.V('Mobile Device').addE('connects').to(g.V('API Gateway'))
g.V('API Gateway').addE('connects').to(g.V('Storage Service'))

This will connect both client vertices to the API Gateway and the API Gateway to the Storage Service. By executing g.V('API Gateway') , we can examine the result:

Resulting Graph

Just like the vertices, edges also have a label, denoting their type. In our case, all edges are labeled “connects”. We can retrieve these edges by executing g.E().hasLabel('connects') .

Exercise 1:
Experiment with the following queries using the Data Explorer and describe the result:

A) g.V('API Gateway').inE()
B) g.V('API Gateway').inE().outV()
C) g.V('API Gateway').bothE()
D) g.V('API Gateway').inE().hasLabel('connects').count()
E) g.V('Web App').outE().inV().id()

Answer:

A) Retrieves incoming edges of the API Gateway
B) Retrieves the source vertices of the incoming edges of the API Gateway
C) Retrieves both incoming and outgoing edges of the API Gateway
D) Counts the incoming ‘connects’-edges of the API Gateway
E) Traverses from the Web App to the target vertices and get their ID

After the next section, we will take a look at some more advanced Gremlin queries. But before, we might need some more complex data to play with.

Graph Construction using Gremlin.Net

Gremlin.Net is an implementation of the Gremlin language based on .NET Standard. It offers both an API for executing Gremlin query strings, as well as a fluent API that allows a typesafe approach towards graph queries. However, at the time of writing, Cosmos DB does not yet support the fluent API due to a missing bytecode stream implementation. There is a related discussion on Github: https://github.com/Azure/azure-cosmosdb-dotnet/issues/439

There is an open source project, Gremlin.Net.CosmosDb, that aims to provide fluent API queries by serializing them into a Gremlin string representation. You can get more information on the Github project page: https://github.com/evo-terren/Gremlin.Net.CosmosDb

For starters, we will send some plain Gremlin query strings to build and upload a graph. In the following example, we will construct a sociogram and upload it to Azure Cosmos DB. The resulting graph will look like this:

Vertices and Edges visualized by the Cosmos DB Data Explorer

You can grab the example on Github: https://github.com/marco-bue/azure-cosmos-gremlin

Create a new .NET Core 2.0 console application and add the Gremlin.net package as a dependency.

We start by adding a simple class that allows us to store a Gremlin query alongside a query description:

We introduce two methods to create the necessary Gremlin statements:

The following method takes an array of names and returns a list of Gremlin statements constructing the sociogram:

Now that we got a list of queries that represents our graph, we need a way to run these queries. Thus, we add an async method to execute a list of Gremlin queries on a given Gremlin server connection. The connection is represented by a GremlinServer instance.

The dynamic returned by SubmitAsync corresponds to the JSON responses that we have already seen in the previous chapter. Thus, SubmitAsync can also be used to read query results:

Observing a Gremlin response while debugging

Within our Main method, we finally wire everything together:

Please note that async Main methods require C# 7.1 or above. After adjusting above constants from the keys and credentials of your Cosmos DB account, you can run the program and examine the resulting graph within the Data Explorer. As the edge construction is based on a random permutation of vertex indices, the results may vary.

A small subset of the resulting Sociogram

Gremlin revisited

Now that we have created a bigger graph, we can take a look at some more advanced Gremlin queries. Again, we will do so by executing them within the Cosmos DB Data Explorer.

Traversing to adjacent Vertices

First, let’s see how many people Celia knows:

g.V('Celia').out().count()

out steps from a source vertex to all adjacent target vertices. Two vertices are called adjacent if they are connected by an edge. In the previous Gremlin chapter, we already queried an adjacent vertex by explicitly stepping over the edge, using outE().inV() . Of course we can also query adjacent source vertices:

g.V('Celia').in()

or all adjacent vertices:

g.V('Celia').both() .

Let’s traverse one step further and list the names of all the people that are known to all the people known to Celia:

g.V('Celia').out().out().id()

It is very likely that the result contains some duplicates. If Celia knows two people a, b and both a, b know person c, then c will be listed twice. We can use dedup to de-duplicate them:

g.V('Celia').out().out().dedup().id()

Using Paths to query a Traversal

Until now, all our queries returned a certain subset of our graph, be it a set of vertices, an edge or the property of a vertex. Using gremlin, we can also retrieve information about the walk that we have taken through a graph.

A path denotes a walk through the graph by returning all objects that were visited during a particular traversal. Let’s take a look at an example:

g.V('Celia').out().path().limit(1)

This will give us the following result:

[
{
"labels": [
[],
[]
],
"objects": [
{
"id": “Celia”,
"label": “person”,
"type": “vertex”
},
{
"id": “Benjamin”,
"label": “person”,
"type": “vertex”
}
]
}
]

The path object consists of two properties:

  • A list of labels of the steps we have made. The first step was the retrieval of Celia using g.V('Celia') and as a second step, we moved to an adjacent vertex using out() . Until now, we did not assign any labels to the steps, hence the list of empty labels.
  • A list of objects that has been visited on our traversal.

Of course, using g.V('Celia').out() , we did not only traverse to Benjamin but to all other people Celia is connected to by an outgoing edge. We can see all these other paths when removing the limit(1) step from our query.

Exercise 2:
Experiment with the following queries using the Data Explorer and describe the result:

A) g.V('Benjamin').as('start').out().out().as('move').path()
.limit(1)
B) g.V('Ron').in().outE().path()

Answer:

A) This time, the path consists of three steps. The path object contains the labels 'start' and 'move', as we have assigned these labels to the first and third step of our traversal.
Using as(x), we can assign the label x to the previous step. 
B) The path object contains a vertex as well as an edge.

Walk the Loop

So far, we explicitly expressed every single step within a traversal. Using repeat , we can enrich queries with various loops. For example, we already executed the following query:

g.V('Ron').out().out().id()

We get the same result by repeating the out statement using a loop:

g.V('Ron').repeat(out()).times(2).id()

Basically, there are two kind of loops we can utilize:

  • The repeat…times loop executes a sequence of steps a given number of times. The repeated steps are specified within the parameter of the repeat statement.
  • The repeat…until loop executes a sequence of steps until a certain condition has been met. The condition is specified within the parameter of the until statement.

Let’s take a look at another example. Before we express some query, we assign a unique property to one of Celia’s friend-friends:

g.V('Celia').out().out().limit(1).property('city', 'Speyer')

(Speyer is one of Germany’s oldest cities and is definitely worth a visit! :) )
The next query searches a path from the Celia vertex to another vertex that has the property “city = Speyer”. The query above ensures that we find a match:

g.V('Celia').repeat(out()).until(has('city', 'Speyer'))
.path().limit(1)

Of course, there are plenty of paths that lie between both vertices. By increasing the limit parameter, you can take a look at them. The first path will be the shortest.

This section has just shown some very basic query capabilities of Gremlin. By visiting the official Tinkerpop page, you can learn about more advanced query scenarios: http://tinkerpop.apache.org/docs/current/recipes/.

Using Graph Databases alongside other Databases

In larger systems, we usually face a variety of different data storage problems. Some components might need good transaction support, while other components might need to efficiently store and query graphs, unstructured documents or blobs. In this case, we may want to take advantage of Polyglot Persistence.

Basically, Polyglot Persistence is the notion that a system can utilize multiple data storage technologies to solve different problems. It allows us to use the right tool for the the right use case. In the context of this article, it means that we can use a graph database like Azure Cosmos DB without using it for all our persistence concerns.

Of course, Polyglot Persistence comes at a price: Both the system architecture and the technology stack get more complex. Developers need to integrate multiple data access SDKs, there are multiple data models (which might also need to synchronize) and of course, there are multiple database systems that need to be deployed and managed. Still, the cost of accidental complexity, resulting from pushing everything into one database, might be significantly higher. At last, integrating a new database system is an architectural decision that must always be considered carefully.

Let’s assume we have decided to use a graph database alongside a relational database.

In a simple scenario, the graph data model and the relational data model are independent of each other. Each database manages write and read operations of its own set of entities. The only references between both data models are references to entity ids. The application, be it a monolith or a set of services, talks to one database depending on the use case:

A simple system that uses multiple databases

In a more complex scenario, both database models may contain the same entities and offer different representations in different use cases.

For example, think about a large enterprise organization model involving departments, groups and people, enriched by a complex set of roles and permissions at different levels. In some cases, you may want to filter and aggregate certain statistics based on all employees or groups. A relational database is well-suited for this task. In another use case, you may want to query certain rights of a single person, that belongs to certain groups, projects and a hierarchy of departments. To get this information, we basically need to traverse the organizational graph. In this case, a graph database model would be desirable.

In this case, we want to represent the enterprise organization both as a relational and graph-based data model.

Which software design could efficiently enable the above scenario? If both data models contain the same entities, they need to be in sync. However, applying every write statement to both databases is not an option that scales very well: To prevent inconsistencies, we would need to utilize heavyweight distributed transactions.

Instead, it is a good idea to declare one database (e.g. the relational db) as the source of truth: All write operations are exclusively applied to that database. The other database (e.g. the graph db) is used as a read store: The application logic may read from that store, but it never modifies its state. The read store is exclusively modified by a dedicated synchronization process, which runs asynchronously. The asynchronous update of the read store comes with some advantages: If the system is under heavy load, we can defer the update. If the update process fails, it can be retried.

By using the above asynchronous data synchronization, we can increase the throughput of write operations in our system. However, we also ease the consistency between both data models: The synchronization of the read store will be delayed by a finite amount of time. That architectural tradeoff between throughput and consistency has to be carefully aligned with the non-functional requirements of our system.

Implementing a read store synchronization would go far beyond the scope of this article and there are plenty of options. For example, the Event Sourcing pattern enables an efficient implementation that scales very well. However, that exercise will be left to another article.

That’s all for now. If you have any questions or feedback regarding Cosmos DB or graph databases in general, I would be happy to read your comments or messages.

— Marco