Moving from Cassandra to DynamoDB — Part 1

Published in

hubbleconnected

6 min readJul 31, 2018

Introduction

Hubble Connected is an IOT company. We have an entire platform with end-to-end connectivity between firmware, cloud and apps.

Over the years, we have used Cassandra to store the time-series data. But we faced few problems with delete, update, compaction and maintenance in Cassandra. The regular operational procedures became complicated over a period of time. It became difficult to manage the Cassandra stack. We did optimize the stack to reduce the operational complexities etc. However things did not work as expected.

So we moved to dynamo from Cassandra which solved most of our problems. This article will explain the problems faced in Cassandra and things to be to taken care while using Cassandra.

Cassandra Summary

The CAP Theorem

The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

According to the theorem, a distributed system cannot satisfy all three of these guarantees at the same time.

Cassandra and CAP

Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency. But Cassandra can be tuned with replication factor and consistency level to also meet C.

Compaction strategy in Cassandra

Multiple Compaction Strategies are included with Cassandra:

SizeTiered Compaction Strategy (STCS).

Leveled Compaction Strategy (LCS).

DataTiered Compaction Strategy (DTCS).

Time Window Compaction Strategy(TWCS).

This document is mainly with respective to TWCS . So will go through how TWCS works.

The Time Window Compaction Strategy is designed to work on time series data. It compacts SSTables within a configured time window. TWCS utilizes STCS to perform these compaction's. At the end of each time window, all SSTables are compacted to a single SSTable so there is one SSTable for a time window. TWCS also effectively purges data when configured with time to live by dropping complete SSTables after TTL expiry.

TWCS(TimeWindowCompactionStrategy).

Why did we choose Cassandra to begin with?

We had 2 tables EventMaster and EventDetails also 4 local secondary indexes. Each table has close to 40GB of data. The incoming traffic was 16k writes and 5k reads.We are storing time-series data.

Below are the few reasons why we choose Cassandra:

Linear scale performance: Users can increase performance by adding new nodes without any downtime or interruption to applications. There is no single point of failure.

Write: Cassandra is perfect for very heavy write systems or time-series based data.We had traffic of 16k write.

Data Consistency: We can tune the data consistency.

What were the problems with Cassandra?

We used to delete 60k to 1lac rows with consistency QUORAM(CL.QUORUM). But in Cassandra every delete or update is a new insert, with marking as deleted. Cassandra stores its data in immutable files on disk(SSTable). Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk:

Logging data in the commit log
Writing data to the memtable
Flushing data from the memtable
Storing data on disk in SSTables

In such a system, to record the fact that a delete happened, a special value called a “tombstone” needs to be written as an indicator that previous values are to be considered deleted.

A delete does nothing more than insert a tombstone. When Cassandra reads the data it will merge all the shards of the requested rows from the memtable and the SSTables. It then applies a Last Write Wins (LWW) algorithm to choose what is the correct data, no matter if it is a standard value or a tombstone.

Sometimes some SSTables contain 95% of tombstones and are still not triggering any compaction due to the compaction’s options, the overlapping sstables.

Below is how the deleted row looks in sstable

[
{
“partition” : {
“key” : [ “apple” ],
“position” : 0
},
“rows” : [
{
“type” : “row”,
“position” : 19,
“clustering” : [ “20160617” ],
“deletion_info” : { “marked_deleted” : “2016–06–16T19:31:41.142454Z”, “local_delete_time” : “2016–06–16T19:31:41Z” },
“cells” : [ ]
}
]
}
]

Below are the options we can try to fix the above problem:

a) Run nodetool garbage collect(CASSANDRA-7019)

b) Single-SSTable compactions/ Running compaction on all the nodes. This in-turn have a side effect, it will start create SSTable of huge size.

2. Storing records of different TTL like TTL of 3,7,100 days in same table.Doing this will block the SS-table from being dropped completely when compaction runs.It is always better to design the tables with range like 1–7 , 7–30 etc .

3. We use Time Window Compaction Strategy(TWCS). Which runs compaction based on the compaction_window_unit. But if you set this to 6 hrs it creates 4*30 SSTables per month. When we read a specific row, the more SSTables we have to consult for row fragments, the slower the read becomes. Therefore, it is necessary to merge those fragments through the process of compaction in order to maintain a low read latency.
However when TWCS is used , it is better to avoid manual compaction by setting the right compaction_window_unit.
{‘compaction_window_size’: ‘1’,
‘compaction_window_unit’: ‘MINUTES’,
‘class’: ‘org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy’}

4. Running manual compaction. This created more problems than solving the actual problem. It stared creating huge sstable of size 15–20GB. Which will take part in compaction only after it gets some-other sstable which matches its size. So the big sstables will participate in compaction rarely. which indirectly increases tombstones. We can use sstablesplit to split the big sstable of required size. So that they will participate in compaction again.

sstablessplit [options] <filename> [<filename>]*

5. Tuning G1GC was also very challenging.There is very high probability of getting out-of memory. Below are the parameters which needs to be tuned if RAM is high.

-XX:ParallelGCThreads=n

Sets the value of the STW worker threads. Sets the value of n to the number of logical processors. The value of n is the same as the number of logical processors up to a value of 8.

If there are more than eight logical processors, sets the value of n to approximately 5/8 of the logical processors. This works in most cases except for larger SPARC systems where the value of n can be approximately 5/16 of the logical processors.

-XX:ConcGCThreads=n

Sets the number of parallel marking threads. Sets n to approximately 1/4 of the number of parallel garbage collection threads (ParallelGCThreads).

-XX:InitiatingHeapOccupancyPercent=45
If the InitiatingHeapOccupancyPercent value is higher the GC becomes aggressive. If the value is low , there is chance of gettign Out-of-memory. So we need to play around with it a bit , to get the right percentage.

6. Setting the column values to null , it will create cell tombstones which increases the store size as well as cell tombstones.
An upsert operation can generate a tombstone as well. Why? Because Cassandra doesn’t store ‘null’ values. Null means the absence of data. Cassandra returns a ‘null’ value when there is no value for a field. Therefore when a field is set to null Cassandra needs to delete the existing data.

Before Fix SSTABLEDUMP

f5f33642-e35f-47d2–914b-6388244221c7@0 Row[info=ts=1523434027944000 ttl=306839552, data=<tombstone> ts=1523434027944000 ldt=1523434027, eventcode=<tombstone> ts=1523434027944000 ldt=1523434027, eventcounter=211853284 ts=1523434027944000 ttl=306839552 ldt=1830273579, [eventtime=2018–05–02 01:07+0530 ts=1523434027944000 ttl=306839552 ldt=1830273579], filesize=<tombstone> ts=1523434027944000 ldt=1523434027, id=5e9d7ae7–7ef0–4d9a-bf50–49978410ef31 ts=1523434027944000 ttl=306839552 ldt=1830273579, parentid=215eb74e-6dd5–4a1d-81e8-a01c5fbc429b ts=1523434027944000 ttl=306839552 ldt=1830273579, profileid=<tombstone> ts=1523434027944000 ldt=1523434027, storagemode=0 ts=1523434027944000 ttl=306839552 ldt=1830273579, unit=<tombstone> ts=1523434027944000 ldt=1523434027, value=4 ts=1523434027944000 ttl=306839552 ldt=1830273579

After Fix SSTABLE DUMP

f5f33642-e35f-47d2–914b-6388244221c7@0 Row[info=ts=1523433105693000 eventcounter=960770214 ts=1523433105693000 ttl=306839552 ldt=1830272657, [eventtime=2018–05–02 01:07+0530 ts=1523433105693000 ttl=306839552 ldt=1830272657], id=fe353497-f59d-42eb-9945-d55b8ac4d984 ts=1523433105693000 ttl=306839552 ldt=1830272657, parentid=215eb74e-6dd5–4a1d-81e8-a01c5fbc429b ts=1523433105693000 ttl=306839552 ldt=1830272657, storagemode=0 ts=1523433105693000 ttl=306839552 ldt=1830272657, value=4 ts=1523433105693000 ttl=306839552 ldt=1830272657

Fix:
Set saveNullFields option to false for mapper mapper.setDefaultSaveOptions(Mapper.Option.saveNullFields(false));

7. Maintenance / Monitoring: Monitoring the Cassandra cluster is very tuf task. Maintenance team should have good knowledge about Cassandra and JVM . So that they can tune the JVM parameter. Technically Cassandra is Down if RF , Streaming between the nodes & Compaction’s are not properly controlled with tested values

What next?

In next part, I will be explaining how moving to dynamo solved our problems.

Reference Links

TWCS part 1 — how does it work and when should you use it ?

In this post we’ll explore a new compaction strategy available in Apache Cassandra. We’ll dig into it’s use cases…

thelastpickle.com

About Deletes and Tombstones in Cassandra

Deleting distributed and replicated data from a system such as Apache Cassandra is far trickier than in a relational…