Graph Cache: Caching data in N Dimensional structures

Ritesh Shergill
11 min readOct 13, 2020

I have been working in the IT Industry for about 14 years now. From Web applications to cloud infrastructure and even Machine learning, I have seen it all. You may think of me as being a Jack of all trades but Master of none.

If all those years of experience have taught me anything, it is this.

It is all about the data and how it is stored.

Repositories and databases have always fascinated me as it is fairly intuitive to understand that any program is only as good as the data behind it. Most of the time anyway.

In my quest I have come across 2 types of databases — SQL and NoSQL. Most of you are well aware of these 2 flavors. Even within NoSQL there are several types of databases —

— Document databases

— Key-Value stores

— Column Oriented

— Graph Databases …. etc etc etc

The one that fascinated me the most are Graph Databases. And let me tell you why

  1. Its about how the data is related which is always important to know
  2. They are great for generating insights as the relational metadata gives the necessary context to connect the dots.
  3. You start thinking in terms of Vertices and Edges which are extra dimensions to your data as opposed to just storing Key Value pairs or rows of data mapped by a primary key.
  4. A lot of data redundancy is eliminated intuitively

So its a whole new way of looking at data stored in a repository. You can say its a dimensionally and Non Asymptotic way of looking at data.

Walmart uses it. UBS uses it. Daimler and Toyota use it. And if so many established organizations are using Graph databases to drive recommendations, promotions and streamline logistics then there must be something to it!

Having waxed eloquent about Graph Databases though, we must talk about their shortcomings as well. Why not Graph Databases?

  1. Query optimization options are limited although this challenge is certainly being addressed in newer versions.
  2. Data partitioning is a tough proposition so horizontal scaling becomes challenging.
  3. Computationally intensive or costly queries — Most real-world graphs are highly dynamic and often generate large volumes of data at a very rapid rate. One challenge here is how to store the historical trace compactly while still enabling efficient execution of point queries and global or neighborhood-centric analysis tasks.
  4. Queries spanning multiple hierarchies can be time consuming and no suitable for real time querying.
  5. Design is only as good as the implementation and this is where the human element comes in. Not everyone can think in terms of graphs.

These are some of the most challenging aspects when it comes to Graph databases. But what if I were to tell you that most of these concerns can be alleviated. The options of course are limited for us but nonetheless there are options.

Enter Graph Cache. An in memory caching solution, designed to store data with relationships.. or without.

The Background

Graph datasets are proliferating nowadays, due to their ability to capture and allow for the analysis of complex relations among objects, by modelling entities with nodes and their relations/interactions with edges. Graphs have thus been used, with great success, in a wide variety of application areas, from chemical and bioinformatics datasets to social networks. Central to graph analytics is the ability to locate patterns in dataset graphs. Informally, given a query (pattern) graph g, the system is called to return the set of dataset graphs that contain g (subgraph query) or are contained in g (supergraph query), aptly named the answer set of g. Unfortunately, these operations can be very costly, as they entail the NP-Complete problem of subgraph isomorphism and even the popular algorithms are known to be computationally expensive.

In this day and age when memory and processing power can be constraints for web applications with high load, fast data retrieval in almost real time is the need of the hour.

Given the availability of sophisticated algorithms for graph traversal and softwares that enable developers to develop intelligent querying around data using open source APIs such as GraphQL, the need of the hour is a way to index the data contained in the graphs as well as the relationships between nodes in the graph.

As an example, suppose we have the following graph:

What would be the complexity to traverse from Node 1 to Node 8?

BFS: It would be O(V + E) Where V represents Vertices/Nodes and E represents the edges of the graph.

DFS: It would be O(v2) in the worst case.

For large graphs the time complexity grows exponentially as we increase the relationships between vertices and introduce bidirectional relationships as traversal can be in either direction of the relationship.

With a caching mechanism, we seek the use the benefit of indexes and have a datastructure in place that allows almost O(1) retrieval of data. As an example, if we were to start creating indexes for caching the Data contained in the graph above, it would look something like this:

Indexing by edges/relationships:

So given that you have the index/hash value associated with an edge, you can retrieve the subtree in O(1) time by providing the index/hash value to the caching datastructure.

We can make similar cases for caching

  • Node to Edges relationships
  • Caching entire subtrees against a unique index value

and so on.

The idea for building the graph caching structure is so that application developers are given the flexibility to store data in Graphs and then store the Graphs or information around the graphs in any manner they see fit to allow them to retrieve subtree data in real time from the Caching structure in O(1) (Near real time) .

Also, a graph structure allows applications to maintain temporal aspects around the data. This is especially useful for maintaining recent historical information as snapshots in vertices of the graph so that non archived recent historical data is readily available in memory to be retrieved in real time.

Example Number One:

As an example, lets assume that you are maintaining stock market trades for a week for a stock and also maintain this data across multiple stock exchanges where its listed.

Fig. 1
Fig. 1

Lets see how a graph might look for this data:

Fig. 2

Although the representation from Fig. 2 might look confusing, let me explain.

→Stock XYZ is the root node.

→SE1, SE2, and SE3 and stock exchanges 1, 2 and 3 respectively.

→Green arrows are the edges that represent the traded values in INR daywise.

→Maroon ones are the USD values daywise.

→Blue ones are the EUR traded values daywise.

So when I query the following:-

give me the target node for Edge INR of stock XYZ.

→That gives me SE1.

Then I query:-

give me the Traded value for this on Day 2.

→So I get the edge connecting SE1 and Day 2 and extract the IRN value from that.

As you can see from this representation,

  1. I have avoided repetition of values of Stock name and Days traded.
  2. I have also given meaning to the relation ship by adding metadata to the edges.
  3. I now have temporal data spanning across 7 days (I only show 4 in the image for sake of simplicity) and can query my Graph to get data for any day I require.

Supposing I had data for 1000 days and wanted the traded value in INR from the first day. Well great, I have that in my graph. Fetch will be fast enough to go from

XYZ → SE1→ Day1

This is great for shallow relationships. Now how about this

Continent→Country→State→State Exchange 1 …… State Exchange N

Just the mapping from a continent to the stock exchange is deep. A query spanning this kind of a dataset is likely to be computationally and time intensive.

Now what if I were to cache what I am looking for up front?

Lets assume I need traded value on Day 10 for Stock XYZ from Stock Exchange 15.

I already know which stock exchange it is and I have the relationship from Stock Exchange to Traded day storing the Traded value in INR. So I cache just that as follows:

Fig. 3

The alphanumeric value 2323H423 is a unique key in the Graph Cache that represents information for Stock Exchange 15. Now from Stock Exchange 15 I can point relationships towards the days and the relationships themselves can store the Traded value along with the currency.

You can even create partitions on this data as follows:

So I am actually partitioning on days as I store the data in the graph.

The final state that we see above takes care of the following shortcomings of a Graph Database

  1. Flexibility in querying and doing it intuitively
  2. Data partitioning
  3. No longer computationally intensive as unique keys now access slices of deeper relationships from the deeper Graphs.
  4. Faster and performant querying.

As we can see from this example, we have solved some of the shortcomings of a graph database.

Example Number two:

Another example when it comes to caching states of Parent child containers in a single page application:

As is evident, this is almost similar to a graph structure. Page 1 has a link that when clicked takes the user via

— Route 1 to ChildPage1 and provides data in a json format to ChildPage1.

— ChildPage1 has a link that when clicked takes the user via Route 3 to ChildPage11 and provides data in a json format to ChildPage11.

And so on.

Given our graph caching system that is sitting server side, instead of maintaining the json information for the pages client side, we instead maintain the json information at server side in our Graph Cache. The graph structure in the cache is similar to the above structure with routes being represented as edges of the graph and Nodes of the graph containing the JSON data pertaining to the respective web pages.

On the server side, the cache has indexes on

  • Route to Node/Page mapping
  • Index to Node/Page mapping

So given a scenario that user has clicked on a link in ParentPage1, we asynchronously in a non- blocking manner fetch data for ChildPage1 and ChildPage11 and store it in the Graph pertaining to the subtree for ChildPage1 and ChildPage11. As the user navigates to ChildPage1, we have already retrieved and stored data for ChildPage11 in the GraphCache so that when the user eventually navigates to ChidPage11, that data is already loaded and available in the cache.

So for such HTML applications Graph Cache provides the following benefits:

  1. Fast retrieval of page navigation routes from Server side instead of maintaining this information in the JS code which makes it configurable and changeable even as the application is running. This will allow dynamic restructuring of an entire website at runtime which can provide a unique experience to the user even as he is navigating though the site.
  2. Maintaining Static and recent historical data for the Site in the graph cache in server side memory lends to near real time retrieval of website data, not having to worry about Timeouts from the javascript code and building intelligence around recovering from failure scenarios when data might not be available from an upstream system but since it is already present in the cache, the website does not suffer from lack of data to display.
  3. Intelligently creating indexes around the data stored in the Webpages allows application and website developers to decide what graphs need to be cached and the depth of the Graph itself so that the way information is stored and fetched from the Cache is customizable, flexible and open to extension by development teams.

This is another example of how a Graph Cache can be used to create a website on steroids.

Several other use cases exist such as:

  1. Caching Application dependencies when your microservice depends on several upstream and downstream applications and you want to maintain health and other metrics information in the Graph.
  2. Storing data for significantly faster retrieval when training an ML algorithm that relies heavily on th relationship between data points stored in nodes.
  3. Storing JSON schemas as relationships when talking to databases like mongo db.

The possibilities are indeed limitless.

Some best practices..

  1. Have the cache in memory to keep it most performant
  2. Not caching deep graphs as that would hit the performance
  3. Storing the data as intuitively as possible.
  4. Ensure data partitioning for scale at the time of storing

Sharding and Distributed in Memory Caching

This can be achieved by creating an API gateway that maintains a pointer into instances that contain the shards of cached data. Doesn’t make sense? Lets use the following diagram to make sense of it:

The Data aggregator service is responsible for fetching the data across sharded instances using HTTP get calls or a rest API. The Data Aggregator service is always aware of the Service names that represent the actual data it requires.

The service name is provided to a central Shard service provider. The Shard service provider is like a service registry that is cognizant of the actual API GET addresses that provide the data pertaining to the service name that was provided. In addition, the Shard service provider is also responsible to maintain the Shard key into the Graph Cache and the actual Data key that is stored as a child to the Shard key.

Using the service address and by providing the Shard key and the data key, The Data aggregator service performs a post request to the Graph Cache instance that maintains Shard N information. Using the provided Shard Key and the Data key, the Shard N instance of Graph Cache returns the appropriate data to the Data aggregator service.

So essentially this is a programmatic way of creating a distributed cache and achieving sharding using multiple Graph Cache instance containers.

Summary

To summarize we saw how Graph Caches can be used to change the way we cache data. Although from a performance perspective I wouldn’t make tall claims that it is faster or superior to any other In memory caching solution.

No, the one problem that Graph Cache is not seeking to resolve is to be faster than other Caching solution.

The main USP of the Graph Cache is the way you will store data in it. Like I mentioned before, how the data is related is always of importance. Relationships between data points give us valuable insights into how we want to treat the data and how we inference it.

You can find the implementation at:

https://github.com/riteshshergill/graphcache

Go check it out today.

Cache a Parent node. Cache a child node. Cache an edge. Cache the whole graph.

— — — — — — — — — — — — —Go Crazy! — — — — — — — — — — — —-

References:

  1. https://graphql.org/
  2. https://openproceedings.org/2017/conf/edbt/paper-119.pdf
  3. https://neo4j.com

--

--

Ritesh Shergill

Cybersec and Software Architecture Consultations | Career Guidance | Ex Vice President at JP Morgan Chase | Startup Mentor | Angel Investor | Author