NoSQL Architecture: Part III
Categories of NoSQL in detail
There are different categories of NoSQL databases. These categories are based on how the data is stored. NoSQL products are optimized for insertion and retrival operations — because they usually happens in large scale and to calibrate performance most of the NoSQL products follow a horizontal structure. (as far as I know)
There are four major storage types available in NoSQL paradigm:
1. Column-oriented
2. Document Store
3. Key Value Store
4. Graph
Column-oriented: Data stored as columns. Typical RDBMS stores data as rows. You may argue that relational database displays data in two dimensional table with rows and columns. The difference here is when you query a RDBMS it will process one row at a time where as a column oriented database will have the data stored as columns. An example would enlighten the concept here:
Imagine following data needs to be stored. Lets compare RDBMS and NoSQL here:
StudentID StudentName Mark1 Mark2 Mark3
12001 Bruce Wayne 55 65 75
12002 Peter Parker 66 77 88
12003 Charles Xavier 44 33 22
Data in RDMBS will be stored in the following way:
12001,Bruce Wayne,55,65,75
12002,Peter Parker,66,77,88
12003,Charles Xavier,44,33,22
Data in NoSQL will be stored in the following way:
12001,12002,12003
Bruce Wayne, Peter Parker, Charles Xavier
55,66,44
65,77,33
75,88,22
Note: This is a very trivial, simple example just to demonstrate a point. One cannot take this in face value and argue insertion will be much difficult in NoSQL. Whether it is RDBMS or NoSQL they are more sophisticated and their systems are optimized enough to handle data for processing. We are just looking things at a higher level.
The advantage of column based approach is that it is computationally faster than RDBMS. Imagine if you would like to find out average, maximum or minimum of a given subject, you dont have to go through each and every row. Instead, just look at that respective column to determine the value. Also when you query the database, it does not have to scan each row for matching conditions; whichever the column is conditioned to retrive data, only those will be touched and voila, faster processing. You have to read these assuming you have a billion records in your database and need to query all of them at once just to retrieve few hundreds of it.
Examples are HBase, Cassandra, Google Big Table, etc… Oracle also has this feature introduced quite recently.
Document Store: In the previous category, we looked at structured data, students’ records to be precise. Now we are looking at how to store somewhat structure/semi-structure data. When we use Facebook API to extract posts from the given group, we would get that in JSON format. Like this:
{ “technology”: “big data”, “message” “:”Way of Internet of Things to Internet of Everything”,”permalink”:”http://www.facebook.com/comsoc/posts/10151526753216486",”actor_id”:130571316485}
{ “technology”: “big data”, “message” “:”Challenges of Big Data Analysis”,”permalink”:”http://www.facebook.com/comsoc/posts/10151494314921486",”actor_id”:130571316485}
{ “technology”: “big data”, “message” “:”Big Data’nin hayatimiza etkisi”,”permalink”:”http://www.facebook.com/comsoc/posts/10151490942041486",”actor_id”:130571316485}
{ “technology”: “big data”, “message” “:”Etkinligimiz hazir!! Peki Ya Siz??”,”permalink”:”http://www.facebook.com/comsoc/posts/10151325074526486",”actor_id”:130571316485}
{ “technology”: “big data”, “message” “:”30 Nisan’da ‘girisimci melekler’ Istanbul’u kusatiyor. Siz nerdesiniz?”,”permalink”:”http://www.facebook.com/comsoc/posts/10151318889096486",”actor_id”:130571316485}
Or even imagine something like this:
{ “StudentID”: “12001", “StudentName”:”Bruce Wayne”, “Location” : “Gotham city” }
{ “StudentID”: “12002", “StudentName”:”James Tiberius Kirk”, “Address” :
{“Street”: “2nd street”,”City”: “Edge city”, “State”: “New York”} }
Imagine where records/documents does not follow even/constant schema. RDBMS cannot process this. We need something better where such semi-structured data can be indexed and queried. The document store category of NoSQL. (About the indexing part — the document ID could be the URL from where these data are crawled, or the timestamp when the data was crawled. It is even okay if these records are without document ID)
These category of database with room for changing schema or schemaless documents would provide flexibility and hence the popularity. Ideal situation would be any web-based application where the content would have varying schema. Or in cases where the data is available in JSON or XML format.
Examples are Redis (In-memory), MongoDB, CouchDB, Lotus Notes, etc…
More on the remaining categories in future posts.
I wrote this article on my original wordpress account. Now re-posting on medium. Visit http://datasciencehacks.wordpress.com/
Email me when Pavan publishes or recommends stories