Into the wilderness of Modern Databases
Answers to What, When and Why of DBMS
Outside the shell of a world running on Relational Databases. 20th century expanded availability of technology to billions of people. Accommodating such scale and aggregate, collect and organise data requires wide spread knowledge of multiple database management systems. This article aims to bring clarity on RDBMS and NoSQL databases and the First steps to plan a database architecture.
Relation Database (RDBMS)
Based on the relational model of data in RDBMS. The data is stored in one or more tables (or relations). With each row having a unique key to identify the record. Relational model ensures organised data by strictly requiring users to state what information the database contains and what information they want from it.
RDBMS requires you to be specific in what you store and what you ask
Relational databases stayed reliable without doubt for a full era owing to its ACID properties —
- Atomicity: There must be no state in a database where a transaction is left partially completed
- Consistency: Database should remain consistent at all cost and scale. Data should never be ambiguous (like Schrödinger’s cat).
- Isolation: On executing transactions simultaneously in parallel, all transactions will take place without affecting each other
- Durable: Database will hold all its latest updates even if the system fails or restarts. A committed transaction will never fail, if the system fails before the data could be written on to the disk, then that data will be updated once the system springs back into action.
Endnote: On the upside, RDBMS provides organized data, low bandwidth, low latency, high throughput, data integrity and transactional processing. And it is the top recommended go-to system for up to 100 million users in most cases. For scaling it provides master-slave approach (read replicas) , but on the downside RDBMS can only scale vertically which raises cost exponentially and there is always risk of hardware failure.
In master-slave approach, two DBMS systems are syncrosied together and maintain consistent data. Notably, Master instance supports both write-read operations while the read instances or read-replicas only support read operations.
Vertically and Horizontally scaling are two methods to make the system available to large users. They are also known as Scale-Up and Scale-Out.
Vertically scaling is achieved by upgrading the system specifications and resources. For example; upgrading a system processor from i5 to i7. Vertically scaling increases cost exponentially on each upgrade.
Horizontally scaling means to distribute a single system’s load to multiple systems. For example; increasing RAM of a 4GB system to 8GB by installing an additional 4GB RAM. Costing of horizontally scaling is linear.
NoSQL Databases
Facebook disrupted with its first 100 million users in 2008 , technology was being used largely by public and it was predicted to reach billions very soon.
Giving in to limitation of RDBMS which will ultimately require scaling up beyond supercomputers. Yahoo, Facebook and other tech giants started developing horizontally scalable, distributed, fault-tolerant, non-relational (schema-free) DBMS. Today known as NoSQL or Distributed DBMS or solution for big data problems.
Cassandra, MongoDB, CouchDB, HBase, Redis, Neo4j, DynamoDB are the popular NoSQL DBMS. And the classification of NoSQL databases even goes further down by their data model —
Types of NoSQL databases
- Key-value
- Extremely simple and fast
- Each key has to be unique
- Anything can be stored in the key
- Slower if querying entirely
- Best suited for volatile operations
- Applications: session variables, chat messages, live location tracking, caching a webpage or API result, real-time data analysis
- Example: Redis, Riak, Memcached, Azure Tablestore - Document oriented
- Gives the advantages of key-value storage
- Stored as documents like JSON, XML, SVG
- Each document is independent for its schema (schema-free)
- Instead of storing the data in different tables and use relations, all the related data can be stored within a single document
- Allows data querying and unlike key-value storage, Document oriented DB allows to return only those parts of the document that you require
- Applications: image blob and its meta-data, CMS, user generated data, catalog data, IoT data logs
- Example: MongoDB, CouchDB, DynamoDB, DocumentDB - Wide-column
- The names and format of the columns can vary from row to row
- Suitable for storing unorganised and somewhat related data
- Faster querying for content oriented data or on-the-go aggregation queries
- Ideal for managing the data that just won’t fit on one computer
- Allows Sparse Database or Sharding
- Applications:
- Example: Cassandra, Google Bigtable, Druid - Graph
- Ideal for node-relationship usage
- Relationship Metadata of a node binds it to other node
- Graph databases help to find relationships between data and extract their true value
- Best suited for complex relations between data and crazy queries (OLAP)
- Applications: finding friends-of-friends up to nth degree, real time recommendations, Analysing data from machine learning models (such as sentiment analysis)
- Example: Neo4j, orientDB - Time series database
- Can collect, store, and serve billions of the data points without any loss of precision
- Data is stored exactly as you give it
- Keep raw data forever
- Scales to millions of writes per second
- Tools like Grafana can be used to visualise data more efficiently
- Applications: Ideal solution for the monitoring system
- Example: OpenTSDB
It’s important to remember that all NoSQL databases are fundamentally based on the following concepts —
- Distributed datastore: A distributed datastore is a system where information is stored on more then one node.
- Fault tolerance: A distributed datastore stores information on multiple nodes. It eliminates risks due to single point of failure and hence, there are less chances of system downtime.
- Database Federation: Splitting database by function and purpose. Example: a separate partition of master data for managing IoT data logs.
- Database Sharding: Cutting down the data tables on the basis of their usability. Example: A sharded partition of master database with Indian Users Data can be used for India instead of using whole database. Sharding should be the secondary step if federation could not be performed.
And also, NoSQL databases are limited by CAP Theorem
CAP Theorem
CAP Theorem states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees
Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response without guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
Considering the CAP tuning options is the very first step in choosing appropriate Database System.
RDBMS provides Consistency and Availability but lacks partition tolerance which makes them vulnerable to faults and machine failures.
While with partition tolerance, on choosing strict consistency you have to give away availability.
Some tuning options in NoSQL Databases
Amazon DynamoDB
Amazon coined the term ‘Eventual Consistency’ for DynamoDB where it allows users to choose between data availability over data consistency.
- Eventually Consistent read fetches the data which might not reflect the result of recently completed write operation.
- Strongly Consistent read fetches the most up-to-date result but the result might not be available in case of network delay.
- Read more: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html
Cassandra
Cassandra provides tuning options for Availability and Partition tolerance by configuring replica placement strategy and consistency level.
- Replication factor is total number of copies of data to be stored in different nodes. Replication factor cannot exceed number of nodes in the cluster. It keeps the system fault tolerant.
- Consistency Level can be set to furthermore to decide the number and location of nodes to which a write must be written. And in case of read, consistency level is used to decide the number and location of nodes whose responses must be compared for fetching the most recent record.
- Read more: https://blog.imaginea.com/consistency-tuning-in-cassandra/
Apache HBase
Apache HBase provides tuning options between Consistency and Availability. Running on a master-slave architecture, by default HBase uses strongly consistent reads.
- Client can request the result directly from master instance where write has taken place and data is up-to-date
- If speed/availability is more important than consistent read, Client can request a faster result from a replica or slave or read-only instance
- Read more: https://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_hbase_read_replicas.html
Endnote
While using one database alone will have limitations as per CAP theorem. The desired results for a cloud infrastructure are delivered using multiple kinds of databases systems. For example, In an Instagram Clone —
- RDBMS: user credentials, user settings and OLTP features (Online Transactional Processing)
- NoSQL: post comments, post likes, media meta, OLAP (Online Analytical Processing) features such as Recommendation ML model
- In-memory DB for chat messages, session keys, API response caches, sharding specifications of other DBs
- Graph DB: followers and following, identifying relations between users, suggesting for follow request
Choosing an efficient and performant DBMS requires wide-spread as well as deep knowledge.
It is essential to also learn about underlying architecture of different DBMS, query languages and scaling patterns which I will cover in my next articles.Thank you for reading : )
I would love to hear from you, work together and solve complex problems. Reach me on LinkedIn at https://linkedin.com/in/aishwarydhare