Make It Real Elite — Week 7: Databases — MySQL, Cassandra, CouchDB and MongoDB.

Welcome back, peeps :) Today we’ll go through some info in relation to databases. Let’s break down together some of the main types to keep in mind for a trick or two: SQL, and NoSQL.

Here’s something you’d like to know beforehand. If you’re up for using SQL, then would be good to recall that you can do so, if your data can be modeled through clear and concrete relations, and if your data integrity and read operations are a priority, and if you have a clear vision of your data schema for most of your application use cases. Then you are good to go :)

On the other hand, NoSQL database comes in handy primarily when the speed and the scalability are a high priority for your platform due to the fact it has to handle an enormous amount of data. It’s also useful when you don’t have a clear initial definition for your data schema. Further, if you find yourself having unrelated entities, or if you are aware that your data can evolve in the future, then in all these cases, NoSQL databases are the best choice.

Let’s take a deeper look into both scenarios, shall we?

  • SQL(Structured Query Language): SQL databases share a standard for running queries and handle database operations like insertion, fetching, updating and deletion. They follow a relation data model, where you define a previous data schema following the rules to normalize your data and to create the relations between your tables with your primary keys and indexes(in order to make your queries run faster). It will offer you the ability to fetch data from many tables at the same time using joins. Pretty fair deal, I would say. In most of the cases they scale vertically which means adding more hardware (physical resources) to the machine running the database. Of course it will be expensive, and although you can scale them horizontally with sharding or replication techniques, it won’t keep the ACID Compliant(Atomic-Consistent-Isolated-Durable). Thus, you will be in front of distributed systems problem, sadly. Parallel work and SQL databases aren’t good friends, they’re like cats and dogs.
  • NoSQL: On the other hand, NoSQL databases are more flexible in the beginning, where you don’t have to define your schema as a previous step, you can simply declare your data structure along the way. More options are available for choosing a data model using a NoSQL database, they can be document oriented, key-value oriented, column oriented, etc, and if you choose the first type I mentioned before you will store JSON-like field-value pair documents. NoSQL databases scale horizontally which at first place is cheaper but the you will face the risk of data loss or in the worst case disaster recovery. It’s also good to know that NoSQL databases run into a CAP Compliant( Consistency — Availability — Partition Tolerance), and also that since we are talking about a partitioned system, our constraints are focused on availability vs consistency.

Since I mentioned above the CAP, I though of expanding a little more on this concept, so you guys understand what’s the deal with this CAP Theorem.

Back in 2000 Eric Brewer proposed the famous CAP Theorem, and into this proposal he declared that we can only have two of the following guarantees when we are building a distributed system, and those apply for NoSQL databases:

  • Consistency: The state of the entire database(each node into the cluster) will change completely and not partially with each transaction.
  • Availability. The distributed system can give a response to each request confirming a transaction was finished successfully or not.
  • Partition tolerance. If one of the nodes into the cluster fails, dies or is impossible to be reached due to a network issue, the distributed system has the capacity to still operating without any constraint.

There are lot of resources on the internet describing the same I did above. However, this is not the main purpose of this post, but if you want to learn from scratch about distributed systems, I recommend you this amazing talk from Tim Berglund.

Also, this article can help you to understand the constraints when choosing between the consistency and the availability guarantees. And finally, for take a look into the difference between SQL and NoSQL database you can follow this article. Don’t forget you can leave a comment or question any time into the response section I always will be really happy to discuss, help or learn new things, so don’t be shame.

And finally, for taking a further look into the difference between SQL and NoSQL databases, you can go through this article. Don’t forget that you can leave a comment or question any time into the response section, and I will always be really happy to discuss, help or learn new things, so just go for it.

Let’s get back on track, and start analyzing 4 database engines based on the CAP Theorem. We will be considering their main features, properties, advantages and disadvantages. Are you still with me? Right! Let’s do this!

MySQL

MySQL Database Engine

Data Model: Relational DBMS — Table oriented

Implementation Language: C and C++

Protocol: MySQL protocol

Type of Computer Software: Open Source

Cap Compliant: CA — Consistent, Available System

Advantages: Atomic transaction support. JOIN support. Privilege and password security. Optimized for reading. Mature and confident solution. High availability and performance.

Disadvantages: Memory usage is increased, which leads to costs increase through scaling vertically.

Cassandra

Cassandra — High Scalable NoSQL Database

Data Model: NoSQL DBMS — Column oriented

Implementation Language: Java

Protocol: CQL3

Cap Compliant: AP — Available, Partition-Tolerant System

Advantages: Cassandra is decentralized system — There is no single point of failure. Map/reduce possible with Apache Hadoop. If one node fails, data can be retrieved from some of the other nodes (redundancy can be tuned).

Disadvantages: CQL3 is very similar to SQL, but with some limitations that come from the scalability (most notably: no JOINs, no aggregate functions). Different database design, first queries, then data model.

CouchDB

Data Model: NoSQL DBMS — Document oriented

Implementation Language: Erlang

Protocol: HTTP/REST

Cap Compliant: AP — Available, Partition-Tolerant System

Advantages: You can store any JSON data, documents can be in any index any number of times, views(embedded map/reduce), master-master replication, previous versions of documents are available. CouchDB makes use of the ubiquitous HTTP protocol.

Disadvantages: Arbitrary queries running out of a predefined view are really expensive.

MongoDB

Data Model: NoSQL DBMS — Document oriented

Implementation Language: C++

Protocol: Custom binary (BSON)

Cap Compliant: CP — Consistent, Partition-Tolerant System

Advantages: Schema less, ease of scale out, low cost, you can choose the level of consistency. Quick start and fast learning. Replica sets. Auto sharding. Community. JSON support.

Disadvantages: No joins. Map/Reduce performance. Fast evolving product.

Based on the above information, I think there isn’t an absolute solution. You’ll have to analyze many variables before going with one of these databases, but I hope you can have a better perspective about how databases work, what components you have to consider when taking a decision, and what kind of problems you can potentially face when you are on the road with one of them.

See you next time, stay tuned!