Journey of Cassandra

Nur Erkartal
Trendyol Tech
Published in
6 min readJun 5, 2020

Once upon a time, there was a fast-growing tech company called Trendyol. And this company needs a new application for its happy customers. In this company needed to develop a new application, research and studies can be done before its release. After the investigation of the application, requirements should be prioritized and the application is developed by matching the requirements. Hence, our path came across with Cassandra. (If you get curious about the name of the application, you need to keep reading :)

In this blog, I am going to share some of the pitfalls to avoid, suggest a few good use cases and offer just a few data advice.

Apache Cassandra is an open-source, distributed, and decentralized storage system (database), for managing very large amounts of structured data spread out across the world. It provides a highly available service with no single point of failure.

It is better to keep in mind that Cassandra is a NoSQL database.

When You Should Think About Using Cassandra?

If you have below requirements on your system, you should think about using Cassandra for the right use cases;

  • To provide a distributed system that store data across multiple nodes.
  • Write operation exceeds read operation with a large margin.
  • The data should be updated rarely and when updates are made if they are idempotent.
  • Read operations should be performed by primary keys.
  • There is no need for joins or aggregates.
  • Ability to linearly scale the database by adding nodes. No need for more hardware on existing nodes.
  • To prefer partition tolerance and high availability over consistency (CAP theorem).

Cassandra Data Modeling

Cassandra data model consists of the keyspaces, column families, primary key, partition key, cluster key, composite key, order by and frozen.

Keyspaces — are the containers of data, similar to the schema or database in a relational database. Keyspaces contain many tables.

Strategy: There are two types of strategy declaration in Cassandra syntax:

* Simple Strategy: Simple strategy is used in the case of one data center. In this strategy, the first replica is placed on the selected node and the remaining nodes are placed in a clockwise direction in the ring without considering rack or node location.

* Network Topology Strategy: This strategy is used in the case of more than one data center. In this strategy, you have to provide a replication factor for each data center separately.

The replication factor is the number of replicas of data placed on different nodes. More than two replication factors are good to attain no single point of failure. So, 3 is a good replication factor.

> CREATE KEYSPACE IF NOT EXISTS example WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3 };> CREATE KEYSPACE IF NOT EXISTS example WITH replication = {'class':'NetworkTopologyStrategy', 'datacenter1' : 3 };

Column Family — is a container of a collection of rows. This is equivalent to a table in a relational database.

  • Row — Each row in Cassandra is identified by a unique key and each row can have different columns.
  • Column– Each Column in Cassandra is a construct that has a name, a value, and a user-defined timestamp with it.

A primary key — uniquely identifies a row. The simple primary key consists of only the partition key.

  • Simple primary key — Cassandra uses one column name as the partition key. The primary key consists of only the partition key in this case.
CREATE TABLE example.user
(
id varchar,
timestamp bigint,
name varchar,
PRIMARY KEY ( id )
);

In this expression, “id” is called as both partition key and simple primary key.

A partition key — is used to determine the specific node in a cluster to store the data by using its hash value.

A clustering key — is used to sort the data in each of the partitions

A compound Primary Key — Clustering keys are optional in a Primary Key. If they are not mentioned, it is stated as a simple primary key. When clustering keys are mentioned, it’s called a Compound primary key.

CREATE TABLE example.user
(
id varchar,
timestamp bigint,
name varchar,
PRIMARY KEY ( id , timestamp)
);

In this expression, “id” is partition key, “timestamp” is a cluster key. (“id” “timestamp” ) is compound primary key.

A composite Partition Key — is stated as using more than one column as the partition key.

CREATE TABLE example.user
(
id varchar,
timestamp bigint,
name varchar,
PRIMARY KEY ( (id, timestamp), name)
);

In this expression, (“id”, “timestamp”) call as the composite partition key, “name” is a cluster key. (“id”, “timestamp” ,“name”) is compound primary key.

Order By — to achieve ordering of data in Cassandra. The default setting of CLUSTERING ORDER on a table consists of your clustering keys in the “ASCending” sort direction.

CREATE TABLE example.user
(
id varchar,
timestamp bigint,
name varchar,
PRIMARY KEY ( id, timestamp )
) WITH CLUSTERING ORDER BY (timestamp DESC);

In this expression, “id” is partition key, “timestamp” is cluster key. The users are stored in the Cassandra descending ordered by “timestamp”.

Frozen — use on a set, map, or list to serialize multiple components into a single value. Non-frozen types allow updates to individual fields, but values in a frozen collection are treated like blobs, any upsert overwrites the entire value.

CREATE TABLE example.users
(
id varchar,
name varchar,
addresses set<frozen<address>>,
PRIMARY KEY ( id )
);

frozen<> is only allowed on collections, tuples and user-defined types

Our Data Modeling

In Trendyol, we want to provide a new service called “Audit Application” (belongs to Order Management System Team — OMS) to increase customer satisfaction and reliability of the system. In this application, Kafka events that are PackageCancelled, InvoiceAddressChanged, and ShipmentAddressChanged will be consumed and audit data will be stored in Cassandra. First of all, this application has the ability to retrieve audit logs by orderId. Secondly, this application must be idempotent which means the same Kafka events should not be consumed more than once.

In case of changing invoice/shipment address or canceling any item in the order by the customer, this application will create audit logs. For this reason, there are many write operations in Cassandra. As it is seen, there is no update operation on the audit data.

According to the requirements for retrieving audit logs by orderId, we select orderId as a partition key since Cassandra suggests that reading operations should be performed with primary keys.

As a requirement for audit logs, audit logs should be displayed in descending order by the date of operation. We have great news! Cassandra has the ability to insert data with the desired order by cluster key. The only thing that you can do is adding timestamp information in your data model and defining timestamp as a cluster key.

audit_example keyspace definition
User Defined Type Representation in Cassandra
Audit Table Definitions

The most important part of Cassandra Modeling is a sequence. Please be aware of the sequence of the primary key definition while integrating it into your system.

There is a challenge that we face as an issue to achieve idempotency in our application. Let me try to explain the detail of the issue. Firstly, before processing any events we need to check the event that is processed before. That is why we need to store each processed event. At the same time, we need to store audit logs. Therefore, we need to write audit logs and event-specific information at the same time. The major challenge is similar data insertion into two related tables in an atomic way. Then, we have realized that Batch operations can save our world.

Batch operations can be used to execute multiple modification statements (insert, update, delete) simultaneously. Batch statements can be used single and multiple partitions which ensure atomicity.

Good reasons for batching operations in Apache Cassandra are:

Inserts, updates or deletes to a single partition when atomicity and isolation is a requirement. Atomicity ensures that either all or nothing is written. Isolation ensures that partial insertion or updates are not accessed until all operations are complete.

Ensuring atomicity for small inserts or updates to multiple partitions when inconsistency cannot occur.

I would like to explain our “Batch” usage in detail with one of the significant use case example. Whenever customer changes invoice address, “InvoiceAddressChanged” event is received. The system checks the event that is already processed or not with the below query. P.S. we are storing all eventid’s with related orderIds in single table which is audit_by_event_id to have less complex system.

SELECT * FROM audit_example.audit_by_event_id WHERE eventid= 'd29c3d6e-301f-4df8-8165-5e5894b4d445';

If the event is processed before, the event will be ignored. Otherwise, invoice address audit information that comes from the event and event-specific information which are eventId and orderId will be stored in the Cassandra in an atomic way using Batch. Batch sample CQL can be found below.

Batch CQL for invoice address audit

Finally the application we’ve crafted is not only capable of storing audits idempotantly and swiftly, it also has excellent read performance with primary key.

Last Chapter

The first part of the Cassandra journey has been finished. We are open to your feedback. We kindly request your contributions to improve our journey. And #cometotrendyol ;)

See you soon on the next series.

--

--