How do you pick the right database for Big Data Architecture?

6 min readJun 8, 2018

There are many different types of databases and so much more confusion. Here are some guidelines around picking the right tool for the right job:

Let’s start with basics. You have an online application where you need solid transaction support (ACID compliant) and concurrency control. The application has balanced read/writes/updates and the size of the data is unto some terabytes. Use RDMS such as MySQL, Oracle, Postgres:

RDMS systems uses B-Tree structure as the storage engine of data. The data and indexes are organized with B-Tree concept and reads/writes always has logarithmic time. Because of that, for 1 million records, it takes 20 comparisons in the B-Tree to locate the required data/pointer in the index. With some optimizations, this can go down to 3 comparisons/disk access.
Inserts may even become more expensive as the system has to find space to insert the data and may need to work on creating space at the required location. Same as indexes. Each time there is insert/update/delete, system indexes has to update themselves which is disk heavy operation. (Although, there are RDMS systems which can use RAM for this heavy operations such as MySQL InnoDB)
Each disk access takes around 5 ms depending on the disk type. For 1 million records, it requires 15 ms. For 100 million records, it will take 1,5 seconds and for 1 billion records, the bill is 15 seconds — just to access one row!!! If you think of complex joins, this will easily go up to minutes. To prevent this happening, the “shard” concept is applied where data is split into several shards based on specific hash key. This can be done manually or automatically by some MPP (Massively Parallel Processing) databases such as Teradata, Exadata which are quite expensive. Yet, these solutions requires rigid schemas and have problems handling big data (like NoSQL’s)
Where data size is increasing, write and query performance will suffer and become bottleneck of the application. This is because how these systems store data in B-Trees on disk and makes sure the consistency is and concurrency is handled.
We can ease these requirements for ACID transactions and consistency and go with NoSQL databases to improve performance on insert/reads.
MongoDB is the next logical move from RDMS if you accept to denormalize all of your tables and use documents which can contain all related nested structures in one document to maintain consistency (This is also more aligned with the classes in your application. For eg: Account -> Orders -> Order Items can be nested in one document.) MongoDB uses indexes similarly with B-Tree structure but the file mappings are kept in memory which makes access to data fast. Writes also first done against the memory. Then from memory data is flushed to disk periodically. When reading back the data, query picks the data from memory and/or disk). All NoSQL’s uses memory a lot compared to RDMS. Query capabilities to MongoDB is rich. Thus, if reads/writes are balanced and you do not know all of your queries up front MongoDB is the choice that can handle terabytes of data and will be more performant for online applications that requires consistency and scalability.
Cassandra is meant for fast writes and known queries upfront. (Writes are 3 times faster than MongoDB and similar to HBase) But query is less performant which makes is suitable for Time-Series data. It uses Log-Structured-Merge Tree structure in its storage engine where again writes are made against memory and then flushed to disk. More explanation is provided in previous post — Cassandra internals. As it is append only system, writes are much faster compared to RDMS and some NoSQL systems. (It does not need to seek space to insert data and will append only) On queries, top performance is achieved if the query hits to one node based on the PK and the query does not scan several nodes (which is why secondary indexes should be used very lightly) Cassandra keeps values in columnar format accessed by rowkey. Thus, the data should be sliced and diced into columns before saving it with Cassandra.
HBASE is very similar to Cassandra in concept and has similar performance metrics. It can be used if there is already an investment on Hadoop. It’s main use case is lookups.
OpenTSDB is used if more analytics is required on HBase/Cassandra is mostly suitable for IOT use-cases where data arrives in several thousands in a second and aggregated queries are required for dashboards.
Druid is alternative to OpenTSDB which uses real-time roll-ups to calculate aggregates on the dimensions of IOT event data.

New SQL such as VoltDB and MemSQL are in-memory databases where all data is kept in memory. New SQL is built to prevent the drawbacks of RDMS and provide the performance of NOSQL. (best of both worlds) Thus, they support ACID transactions, fast writes, reads and analytics with limited joins. MemSQL is more on the analytics side. VoltDB is also positioned as CEP as it supports Java Stored Procedures to run custom logic and can directly consume feeds from Kafka with connectors. In-memory databases are always faster than RDMS with a lot of RAM because of its new approach in their design. Kudu is another disk based solution addressing the HTAP needs as a storage engine but it requires query engine such as Impala which has problems with high concurrency workloads. There is also disk based New SQL approaches such as Kudu (OLTP + OLAP workloads) which is similar to combining capabilities of HBASE and Parquet.

HTAP databases are the new generation in-memory databases which can handle both transactions and analytical workloads. The approach for polyglot storage using NoSQL databases for each data type and workload necessitated another approach such as too much data movement from one structure to another to run operational, reporting, and analytics workloads against the same data. And there was no single technology able to address all use cases. These two types of workloads have different processing needs. OLTP databases need rapid response times, handle many concurrent sessions and target specific records for reads/updates. OLAP workloads are the exact opposite: they use long- running queries, few concurrent sessions and large scans of whole tables. For instance, historical databases uses locks to manage the concurrency by preventing updates to data while being used in analytical workload. With version-based concurrency models, modern databases can enable both processes to run without sacrificing consistency. NewSQL databases and Data Grid solutions such as SAP Hana and Apache Ignite are examples of HTAP databases which offer SQL support for queries.

Conclusion:So far, we have discussed solutions for:

Transactional loads (light analytics):RDMS, MangoDB, Cassandra, HBASE
Transactional & Analytical loads (based on pre-aggregation) on time-series data:Druid, OpenTSDB (good for event data which does not need to be updated too frequently), Kudu (supports arbitrary primary keys with uniqueness constraints and efficient key lookup by ranges). Kudu itself is not suitable for fast event aggregations and can be used with KuduTS (https://github.com/danburkert/kudu-ts) for Time-series data
Transactional & Analytical loads on operational data:HTAP and New SQL

My rule of thumb for the serving layer in Lambda Architecture:
Start with VoltDB, Apache Ignite and see if it can meet your use-cases. Then evaluate Kudu. If you have Time-Series data see if these databases can meet the requirements, if not consider OpenTSDB vs Druid. Besides, If you need free text search with the flexible query capabilities, consider Elastic Search - where all parameters in Elastic Search document can be indexed.

How do you pick the right database for Big Data Architecture?

Written by Cengiz Kayay (CK)