Considering NoSQL databases

Even though the adoption of NoSQL is in its early stage, it is not uncommon to hear about the NoSQL hype among start-up developers, open-source communities, and universities. While being the seasonal member of London meetup communities, I try to attend events dedicated to sharing thoughts and the insights about the latest data storage solutions. Not to my surprise, I often talk to people who possess a little competence in choosing the right NoSQL solution for their immediate organizational or product growth needs. What strikes me occasionally is the lack of methodical approach in choosing the right solution for the product for a naive reason of not having experienced technical team member.

In fact, the entire process of database consideration is a science, in itself. There are plenty of benefits we get by adopting relational storage systems, like ACID features, comprehensive commercial support, and etc. But, in the meantime, there are a few tangible drawbacks SQL solution comes with that make us research alternatives. From other point of view, knowing those alternatives, and its pros and cons provides a platform to build the ground knowledge and make better decisions.

When the middle-sized team works on a product hoping to scale and gain traction in the future, the budget and operational limits are visible. The impact on which data storage you choose affects a number of important aspects such as the type of skills you need to have in the team, the total cost of solution ownership as well as the easiness of the maintenance and support. Often, I read the use cases where team abandons SQL solutions in favor of NoSQL for well-understood reasons.

Here I would like to highlight some of the most common problems development teams encounter while working with SQL databases

  • Performance penalty of having ACID
  • Lack of native sharding or clustering tools
  • Absence of ad-hoc linear scalability
  • High cost of software, hardware and licences
  • Cost of commits

Needless to say that the aforementioned points about SQL persist in all of SQL databases, and, as a result, a developer does not have a freedom to choose SQL database with a selected, but, not all, of the listed features. Another important point is the rigidity of the data schema in SQL databases. The most common problem teams encounter after deploying and running their product for a particularly long period of time is the inability to smoothly and effortlessly change the data schema and migrate the existing accumulated data to the new tables. For the most of the cases, it comes with high cost and time effort from the development team.

From another point of view, NoSQL databases offer relaxed rules on data schema. Developers don’t need to spend long hours working with product managers, and business analysts, to come up with a perfect version of the data schema and avoid any unexpected changes later on. Instead, data can be inserted in an unstructured form without even additional rules on consistency or duplication factors. Given the source of the data and the project goals, data is generally stored in a particular structure like documents, key-value pairs, columns, or graphs. To choose which structure the product data fits best the team shall analyze the use cases and come up with query patterns that explain how data is queried. While that process helps to shape the data model, it does not add up any constraint on data attributes and thus leaves the data model run-time flexible.

Amazon SimpleDB
Riak
Redis

Key-values Stores — Using hash table, the store associates keys with values where value is a particular data item. Keys are generally unique across the entire data set and the pairs are often kept in-memory for the fastest access. The data model is the simplest and most efficient in such use cases where application has to work with large volume of associative data sets. However, there are few disadvantages such as inefficiency on updating part of the value when that update is required during application run-time. Besides that, in-memory key-value stores become an expensive solution to maintain as the data grows continuously.

HBase

Column Family Stores — Column family stores are designed to work with very large data sets exceeding terabytes of data. The data is stored in the rows where each row spans to thousands or tens of thousands columns. Each column includes a time-stamp, name, and the actual value. The data model is very well suited for time-series data sets, and, apparently, is an attractive solution for customer behavior analysis, recommendation engines, and, recently, financial technology products in equities, bonds, or FX. Such a flexibility in creating large volumes of historical data comes with a bunch of additional features for tunable consistency, replication and sharding. Column family stores require a better data structure and database knowledge and often represent a steeper learning curve for newbies and developers who had never worked with NoSQL databases before.

CouchDB

Document Databases — Document databases store all the data in document format. A document model is a JSON document consisting of multiple key-value pairs as in plain JSON object. As JSON is a string data structure, it is easily editable, and, in the meanwhile, takes larger disk space compared to column family data sets. Essentially, documents is the next level after key-value pairs, allowing developers to create data objects with nested documents. Document databases are great to apply for use cases where product requires processing of large number of textual unstructured data, and where data schema has to be flexible enough to allow data of different formats.

InfoGrid

Graph databases provide a solution for storing highly connected data items. Essentially, data items in NoSQL database are not semantically connected. To avoid complex JOIN operations so much commonly used in SQL databases for data aggregation tasks, NoSQL stores data in a disconnected manner. In practice, to perform any cross-document aggregation tasks, one should use complex aggregation frameworks embedded into database engines. Graph data stores preserve the connection between data items and thus acts as a solution for social media applications. With a graph database, the team can build products similar to social network like Facebook connecting thousands of people or LinkedIn for finding ex-colleagues or professional contacts.

When the team knows what sort of data product has to work with, it becomes obvious which databases best addresses product needs. The next step is to look at operational and administrative aspects. I tried to summarize the aspects into a few decision criteria.

Elastic

Linear scalability — the aspect addresses how fast your data is going to grow. To answer this question the team shall know if they need to add more database instances in the future as the data is rapidly growing while preserving service availability, in other words, not having long down time. For instance, if the objective is to design a search engine on rarely populated textual data set, then, ElasticSearch could be a great solution. It provides a rich API for searching against a large corpus of textual documents, while not being able to linearly expand in run-time. That mean, if the team wants to add a new ElasticSearch instance, they will have to re-deploy the entire cluster and wait for data to be re-indexed into all the ElasticSearch instances.

MongoDB supports textual documents and does provide some sort of search capabilities using Text Index. It is not as powerful in search as ElasticSearch, but, it scales easily without any down time and thus acts as a great solution for products with rapidly growing data sets.

Learning curve has always been a crucial factor for engineering managers. When it comes to selecting the next technology to add to the existing stack, the cost of technology familiarization affects the speed with which team members develop the product. In small-sized teams with limited budgets it is often the case managers decide on databases that is comparatively easy to learn and build production-ready products.

As an example, MongoDB popularity is growing exponentially. Part of the reason to see such a trend in adopting MongoDB is its usage simplicity. MongoDB shell commands and the deployment process is not complicated and can be learned by a high school students. Scaling MongoDB cluster with multiple shards, replicas, and sophisticated aggregation pipeline requires a decent time dedicated to learning low level bits and practice though. In practice, MongoDB is used for a lot of e-commerce projects where consistency is not the first priority.

Alike MongoDB, learning Cassandra makes some developers give up and steer away from column family stores. Cassandra has complex deployment process requiring knowledge in replication, consistency, and indexing. Its data model is not that obvious for people who used to work only with SQL databases, and, demands some practice, especially, when the use cases for Cassandra go beyond retail, and e-commerce.

Commercial Support is not as important at the initial stages, however, it is a requirement for most of the 24/7 Software-as-a-Service products where Service Layer Agreement (SLA) conditions require the team to provide 99.9999% service availability warranty.

Past deployment history acts as an evidence that solution has been stress tested and adopted by others. I doubt there are plenty of teams who wants to deploy an absolutely new technology and then witness a great failure during the moment when they get traction, and finally, they start making real money. Gladly, there are a plenty of blogs describing different use cases with its lessons learned and best practices written by experienced developers at blue-chip companies like Facebook, Google, Amazon, etc.

With so much information available online, it is still a hustle to choose the right NoSQL solution for product needs and future goals. Knowing every bit of every database is basically impractical and requires a lot more hours spent reading and researching than actually building and testing working prototypes. The best recommendation is to find the right person who worked on, at lease, one or two NoSQL database and made it a successful product. Consultants are the invaluable source of knowledge and wisdom when it comes to rapid product prototyping and technology planning.

Strat @ Goldman Sachs; systematic trading, equities, machine learning, big data, self-growth, gym, keto diet

Strat @ Goldman Sachs; systematic trading, equities, machine learning, big data, self-growth, gym, keto diet