Stargate: Towards DynamoDB Compatibility for Cassandra

Amazon DynamoDB and Apache Cassandra® are two popular NoSQL databases. They share several similarities but Cassandra is free and open-source compared to DynamoDB. This open source ecosystem project for the Stargate data gateway examines what DynamoDB API compatibility for Apache Cassandra® based systems could look like. Read on to learn more.

To fulfill my capstone project requirement for the Carnegie Mellon University (CMU) Master of Computational Data Science (MCDS) program, two of my classmates and I, Ziyan Zhang and Xiang Yue, collaborated with DataStax to develop a new module in the Stargate system to bring Amazon DynamoDB compatibility to Apache Cassandra®.

In this post, we provide some background on the project and describe the overall design of our system. Then, we discuss some interesting challenges we encountered and how we solved them.

Figure 1. How Stargate can provide Cassandra compatibility.

Cassandra vs. DynamoDB

Cassandra and DynamoDB are two popular NoSQL databases inspired by Google’s BigTable and Amazon’s Dynamo papers. They have many similarities, but I think it’s more useful to look at some of their biggest differences:

  1. Cassandra is completely free, while DynamoDB is commercial. As a free product, Cassandra can be deployed either on-premise or in the cloud (private, public, or hybrid). For enterprise users, companies like DataStax have cloud offerings and enterprise support for Cassandra.

    In contrast, despite having a free-tier service, DynamoDB is a commercial and proprietary product, meaning that you have a vendor lock-in problem once you decide to use it. That is, you can only use DynamoDB in AWS but not in your private cloud or any other public cloud.

    You don’t have much choice if you begin to feel unsatisfied with the pricing or service because the migration cost would be too high.
  2. Cassandra is open-source, while Amazon DynamoDB is closed-source. The advantages of open-source products have been widely discussed so I’ll save you some time here. From my personal experience, the biggest advantage of open-source is the ability to make tailor-made changes, and the biggest disadvantage of closed-source is the black-box nature of system behavior — there are always things that are not documented.
  3. Cassandra enforces schema while DynamoDB is schemaless. Schemaless might be convenient and flexible for developers, but developers often still need to have some sort of schema on the application side for software engineering reasons.
  4. Both databases have their own query languages. Cassandra uses Cassandra Query Language (CQL), which is a variant of SQL, while DynamoDB (low-level API) uses JSON as a request payload.
  5. Both databases have their own query languages. Cassandra uses Cassandra Query Language (CQL), which is a variant of SQL, while DynamoDB (low-level API) uses JSON as a request payload.

There are many more differences, but the first two illustrate why it might be a good idea to use Cassandra instead of DynamoDB, and the last two point to the potential difficulties in switching to Cassandra if you decide to use DynamoDB and later regret that choice.

For users that are already using DynamoDB or already have expertise in using DynamoDB, switching from DynamoDB to Cassandra might be too costly.

How Stargate provides DynamoDB compatibility

It’s difficult for users that are already using and/or are familiar with DynamoDB to switch to Cassandra, so why don’t we make Cassandra compatible with DynamoDB?

Wouldn’t it be nice for users to be able to switch from DynamoDB to Cassandra without having to change a single line of their existing codebase if they want to?

Bingo! That’s what our project is for. Basically, we leverage a third framework called “Stargate” to build a middleware for Cassandra that’s compatible with DynamoDB.

Stargate is an open source data gateway that sits between your app and your databases.

Stargate is an open-source middleware that sits on top of a database, e.g. Apache Cassandra. It abstracts Cassandra-specific concepts entirely from app developers and supports different API options, removing barriers of entry for new software developers.

Right now, Stargate supports REST API, Document API, gRPC API, and GraphQL API. These different API options are pluggable and can be installed when needed.

Stargate Architecture

The figure below shows the Stargate (v2) architecture. As described in this blog post, Stargate (v2) is highly modular. There are already many services that provide different kinds of APIs. Our goal was to create a new service that provides a DynamoDB API. We wanted this API to be able to understand DynamoDB queries and transform them into Cassandra queries and for users to be able to continue using their existing DynamoDB client code to interact seamlessly with Cassandra.

Figure 2. The highly modular architecture of Stargate (v2).

What does a query workflow look like

We didn’t want users to have to change a single line of code when switching to Cassandra. But wait a minute…how is that ever possible given Cassandra and DynamoDB have different client libraries? The answer is simple: DynamoDB clients talk to DynamoDB servers in HTTP protocol.

By implementing a web service on top of Cassandra that behaves in the same way as the DynamoDB server, DynamoDB clients could continue to work without knowing it’s actually talking to Cassandra. We implemented such a service as a new module in Stargate — the Dynamo API Service. A typical workflow is shown in the following diagram.

Figure 3. Query workflow of implementing a web service on top of Cassandra in Stargate.

Sequence diagram for PutItem API

The sequence diagram above shows the workflow for DynamoDB PutItem API. Let’s ignore the first component AuthResource for now — all we need to know is that it helps with authentication. By using the DynamoDB client to put an item into the database, the client would send an HTTP request to the configured DynamoDB server endpoint. To use our system, users just need to change their endpoint from AWS to our service. It’s just one line of configuration change! Then everything is handled by our service and Cassandra.

Let’s get back to the sequence diagram. After receiving an HTTP request, DynamoResource, our REST API controller, will recognize the type of request, deserialize the parameters, and then dispatch them to an appropriate Proxy class, in this case, ItemProxy. The ItemProxy component takes the main responsibility of handling the request. Specifically, it needs to parse the request and transform it into a Stargate intermediate representation. You may ask, why an intermediate representation and not a Cassandra query directly? As we mentioned earlier, Stargate is a middleware that sits on top of your database. Although the database we are using is Cassandra, it could be any other database as long as Stargate supports it.

By transforming DynamoDB request into Stargate intermediate representation, we leverage the Cassandra adapter that is already implemented by Stargate. After transforming the request to Stargate intermediate representation, ItemProxy sends it to the Stargate coordinator via StargateBridgeClient which is essentially a gRPC client. The Stargate coordinator then talks to Cassandra nodes and returns the results back.

Note that some requests are straightforward and only need one round-trip between Proxy — Stargate Coordinator — Cassandra cluster. Other requests are more complicated and may need multiple roundtrips. In the PutItem example shown in the sequence diagram, three round trips, at most, are needed. How come we need three round trips for a single write operation? That’s due to the schema difference between Cassandra and DynamoDB. Remember in the beginning, we said DynamoDB is schemaless while Cassandra is not? That means you could insert an item to DynamoDB with new columns without pre-defining the schema (actually you cannot define a schema in DynamoDB), while you couldn’t do the same for Cassandra.

In Cassandra, if a write operation contains columns that are unknown, the request fails. Therefore, ItemProxy needs to first check whether the schema needs to be updated and if so, it must update the schema first before actually persisting the data. This sounds very slow, isn’t it? Luckily, Stargate has a caching mechanism, and most of the time, the schema will be cached and the overhead is small unless new columns appear frequently.

How to authenticate?

This is the first problem we encountered. In DynamoDB (and many other AWS products), there are multiple ways to authenticate, among which a common way is to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environmental variables.

Usually, you don’t need to worry about authentication after you have your DynamoDB credentials set up. Stargate, on the other hand, requires you to provide a token in every request you make.

We could let users follow the authentication guide and fetch their token either manually or programmatically, but the problem was how to make the DynamoDB client aware of this token and carry the token every time it makes an HTTP request. Of course, we could rewrite the DynamoDB client by ourselves, but we would like to avoid that if possible so that users don’t need to change their client library.

Luckily, we found a trick to tackle this problem. We found out that the DynamoDB client always puts an authorization header that contains an unencrypted AWS_ACCESS_KEY_ID in every HTTP request it makes. This makes sense because a DynamoDB client has to use the HTTP protocol to authenticate itself with the DynamoDB server.

Now, in the Dynamo API Service for Stargate, we can easily read the token from this AWS_ACCESS_KEY_ID field. Problem solved. All the user needs to do is to put the Stargate token into their AWS_ACCESS_KEY_ID environment variable, and then they don’t need to worry about authentication while making requests!

The discrepancy in data types

Cassandra and DynamoDB have similar data models but they are not exactly the same. There is one-to-one mapping for basic types but not for the map, list, and set data types in DynamoDB. In DynamoDB, maps, lists, and sets can be deeply nested, meaning that you can have a list of maps of sets or even more complicated data structures. For example, you could insert an item whose goods attribute contains:

As we can see, the above data structure is a list of maps. Furthermore, the first map in the list contains images as a list, while the second map in the list contains images as a string. This heterogeneity might not be common in the real world but it’s undoubtedly allowed as DynamoDB is schemaless. Cassandra, despite its support for nested collections, enforces a schema. For example, if you create a column goods with:

goods list<frozen<map<text,text>>

Then everything in the inner map must be of text datatype. The Cassandra native support fails this use case.

We don’t really know how nested collections are stored in DynamoDB (recall it’s not open-sourced!). But one (good) thing we know is that you can only create indices for top-level attributes with basic data types.

In the previous example, you cannot index the name field because it’s a nested attribute under goods which is a top-level attribute. What does this mean for us? This means we can treat the whole nested collection as a BLOB (binary format) without sacrificing the ability to index.

In all, what we do is quite simple: whenever the user writes a collection data entry, we serialize it into a sequence of bytes and store it in Cassandra. Whenever the user needs to read it, we deserialize the collection from the sequence of bytes stored in Cassandra. And it works just fine!

Right now we are using the Kryo library for serialization and deserialization, but we might write our own methods for better performance in the future.

Parser is all you need

If you are building a database, then you almost certainly need to write parsers to parse the queries.

DynamoDB, at first glance, seemed to be an exception because it uses JSON as a request payload — there are so many JSON libraries that can help us with the parsing (deserialization). This seems to suggest we don’t need to worry about writing parsers. Unfortunately, this is not actually the case. DynamoDB queries have fields like FilterExpression that allow users to define certain conditions in plain text format. For example, in a query, you can have a filter expression like the following:

(Debt = :debt OR Deposit <> :deposit) AND Sex = :s

In this example, the results are filtered out if they don’t satisfy the above expression. This FilterExpression supports different comparison operators and can be nested. You might think about using a regular expression to match the above text, but it’s not feasible because REGEX generally cannot handle nested expressions. We actually tried writing regular expressions for simpler cases, but the code quickly turned obscure and we had to give up.

This is where ANTLR comes into play. ANTLR is a popular and powerful parser generator. By simply defining grammars, ANTLR helps us generate Java code that can parse the expressions into abstract parse trees. We then write code that visits the abstract parse tree and evaluates the expression. With the help of ANTLR, we keep our code concise and easy to maintain. It might be a bit challenging at the beginning to write clean grammar if you are not familiar with compiler knowledge, but it pays off!

Conclusion

This concludes our journey toward completion of our capstone project for our Spring semester in the Master of Computational Data Science (MCDS) program at Carnegie Mellon University (CMU). I’d like to thank DataStax for the help and guidance throughout this journey. We’ll continue working on the project in the Fall semester to complete the rest of the APIs and do a thorough performance benchmark. Our hope is to deliver a complete product in the end!

Follow the DataStax Tech Blog for more developer stories. Check out our YouTube channel for free tutorials and follow DataStax Developers on Twitter for the latest news in our developer community.

Resources

  1. Stargate
  2. DynamoDB
  3. Apache Cassandra
  4. BigTable
  5. DataStax
  6. Cassandra Query Language
  7. PutItem API
  8. REST API
  9. Document API
  10. gRPC API
  11. GraphQL API
  12. Kryo
  13. ANTLR
  14. DataStax Tech Blog
  15. DataStax YouTube Channel
  16. DataStax Developers Twitter
  17. DataStax Community

Special thanks to all the members of the Stargate community who supported this effort including Prabhat Jha, Sebastian Estevez, Tatu Saloranta, and Jeff Carpenter.

--

--

--

We’re huge believers in modern, cloud native technologies like Kubernetes; we are making Cassandra ready for millions of developers through simple APIs; and we are committed to delivering the industry’s first and only open, multi-cloud serverless database: DataStax Astra DB.

Recommended from Medium

Learn Python for Machine Learning

34c3 CTF minbashmaxfun writeup

Push Notification & Rich Notification with Firebase Cloud Messaging (FCM) Explained iOS — Swift 5

Being a Postman Supernova!!

Offloading compute from my laptop to Cloud

Let’s Write Spring! (Part 3)

Career Advancement — Choosing the right project for your future

Every LAN may or may not require WAN, but every WAN requires LAN

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.

More from Medium

Scaling Kafka Ingestion Using REST APIs in Rust

Rack awareness in Kafka Streams

Legacy Modernization and Hybrid Multi-Cloud with Kafka in Healthcare

Go (Golang): Unit of Work and Generics