How graphs are distributed with Azure Cosmos DB’s Gremlin API and how to use that information while designing a scale-out graph application?

In this article, we will look at one such graph database platform, Azure Cosmos DB’s API — a distributed, geo-replicated, self-managed graph database, and how understanding its operational philosophy can help us make informed design choices while building a scalable graph application.

What we look to cover: We will start with an overview of Cosmos DB partitioned containers, followed by how graphs are distributed and it affects the cost of various graph operations. Finally, we will look at some of the important design considerations curated from a set of practical use-cases.

What we will not attempt to cover: For the article to be concise and self-contained, we won’t directly get into the topic of how to model data as graphs, or how to pick a partition key for a graph data-set. The goal of this article is to make developers/architects aware of the various decision points and design alternatives so that they can make the best choices for their application scenarios. …

Azure Cosmos DB’s Graph API supports turn-key geo-distribution which means that on a click of a button, one can replicate the entire graph to another Azure region. Moreover, Azure Cosmos DB being a ring-zero service one can pick a region from a large number of ever-growing Azure data centers. In a world where “apps” are becoming increasingly globalized, geo-replication provides excellent benefits like:

  1. Read scalability by distributing the reads to a number of read regions.
  2. Low latency by directing the users to the nearest read region.
  3. Easy roll-out of a product/service to a new geographic user base, with the local data privacy laws taken care by Azure. …

It is often required that multiple writes via the Gremlin API is required to executed as an atomic unit. One common example is to add a new vertex and add an edge to/from the new vertex from/to an existing vertex.

To make the example a bit explicit, let’s say that the task at hand is: Add a ‘tweet vertex’ and create an ‘edge’ between the tweet and ‘user vertex’ who created the tweet. The application semantics here is that a Tweet vertex must always be connected to the user that created it.

The standard way to write this in gremlin is the…

Why is pagination a hard problem for TinkerPop graph databases? What can we do as application developers? What else can we achieve as a by-product?

If you have looked for efficient solutions to paginate the results of your Gremlin queries, you may have stumbled upon this post on StackOverflow, however, perhaps only to realize that the pagination support is a difficult problem to solve for most TinkerPop-enabled graph databases. Here, I am referring to the response by Stephen who is one of the most prolific contributors to the graph technology community. Now, while there doesn’t appear to be any silver bullet for the problem, as application developers, we can do much better by shouldering some of the responsibilities ourselves.

TLDR: Supporting efficient pagination on generic graph queries is almost an impractical problem to solve for graph database providers. In fact, even if pagination was supported in graph databases, it may not have provided the client applications with much benefit, at least with respect to latency and cost. The reason is that the amount of state that needs to be maintained by the databases to efficiently retrieve a subsequent page can be arbitrarily complex. Moreover, the state itself can be very expensive to derive. In this situation though, as users of the graph databases, we can devise relatively efficient client-side pagination. The trick is to bucketize the traversal-scope (or exploration-scope) of a Gremlin query, and then run the traversal for each of the buckets to generate the pages. A minor downside is that, depending on the complexity of a query, the size of pages generated by these buckets may not be precisely known ahead of time. On the other hand, a major upside that one gets is somewhat automatic rate-limiting, specifically in the context of throughput-provisioned cloud databases. …

Gremlin is one of the most popular query languages for exploring and analyzing data modeled as property graphs. There are many graph-database vendors out there that support Gremlin as their query language, but in this article, we will focus on Azure Cosmos DB which is one of the world’s first self-managed, geo-distributed, multi-master capable graph databases.

To set the expectation, this article is not aimed at teaching Gremlin, rather it should be seen as a self-help article. …

Prerequisites: Cosmos DB Partitioning, Bulk Executor Library Overview, Sample: Bulk Importing Documents, Sample: Bulk Importing Graphs

Before addressing the primary topic, it is worth mentioning that bulk-loading data to a distributed, fault-tolerant, auto-scaled, and auto-indexed database poses a completely different set of challenges than uploading data to a centralized database. The fact that each ingested data point needs to be stored, 4-way replicated, indexed, and automatically load-balanced makes it a lot more complicated than loading data to centralized and index build-deferred systems. …


Jayanta Mondal

These opinions are my own and not the views of my employer (Microsoft).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store