Comparing Storage for Data Engineering

Ken Hu
Ken Hu
Feb 23, 2017 · 7 min read

With the rise of big data and data science, many engineering roles are being challenged or expanded. One of them is data engineering — designing and developing pipelines to collect, manipulate, store, and analyze data. Originally, the domain primarily consists of loading external data sources and designing databases. It has since grew to support the volume and complexity of big data. Data engineering now encapsulate a wide range of skills, from web crawling, data cleansing, distributed computing, to data storage and retrieval. Today, we will focus on data storage and retrieval — a component of the pipeline critical to how the data can be used and analyzed.

The SQL databases with which most engineers are familiar (including PostgreSQL, MSSQL, and MySQL) are structured in relational data tables, with row-oriented storage. Given how ubiquitous and well-examined these databases already are, we will not discuss them in this post. Instead, we will explore 3 types of alternative data storages — search engine, document store, and columnar store — within the context of data engineering. Their rise in popularity and differences from traditional SQL databases make them valuable to have in our toolbox.

  • Search engines excel at text queries. Comparing to text matches in SQL databases (e.g. LIKE), search engines offer higher query capabilities and better performance out of the box.
  • Document stores provide better data schema adaptability than traditional databases. By storing their data as individual document objects, often represented as JSON, they do not require predefining the schema.
  • Columnar stores specialize in single column queries and value aggregations. SQL operations such as SUM and AVG are considerably faster in columnar stores, as data of the same column are stored closer together on the hard drive.

In this article, we shall explore Elasticsearch, MongoDB, and Amazon Redshift as examples of their respective types. By understanding the alternative data storages, we can better choose the most suitable one for each unique situation.

To compare these technologies, we will examine how they index, shard, and aggregate data. Each data indexing strategy improves certain queries while hinders others. Knowing which queries are often made can influence which data store to adopt. Sharding, methodology which databases divide their data into chunks, determines how the infrastructure will grow as more data are ingested. Choosing one that matches our growth plan and budget is critical. Finally, these technologies aggregate their data very differently. When we are dealing with giga-and-terabytes of data, an ill-suited aggregation strategy can limit the types and performances of reports we generate. As data engineers, we must consider all three aspects when evaluating different data storages.

Search Engine: Elasticsearch

Elasticsearch quickly gained popular among its peers for its scalability and ease of integration. Built on top of Apache Lucene, it offers powerful text search and indexing functionality out of the box. Aside from the traditional search engine tasks, text search and exact value queries, Elasticsearch also offers layered aggregation capabilities.

In many ways, search engines are data stores specialized on indexing texts. While other data stores creates indices based on the exact values of the field, search engines expand allows retrieval via a fragment of the, often text, field. This is done automatically for every field, by default, through analyzers. An analyzer is a module that creates multiple index keys by evaluating the field values and breaking them down into smaller values. For example, a basic analyzer might examine “the quick brown fox jumped over the lazy dog” into words, e.g. “the”, “quick”, “brown”, “fox”, etc. This enables users to find the data by searching for the fragments, with the results ranked by how many fragments match the same document data. A more sophisticated analyzer could utilize edit distances, n-grams, and filter by stopwords, in order to build a comprehensive index for retrieval.

Elasticsearch was built on top of Lucene to scale horizontally and to be production ready. This is done by creating multiple Lucene instances (shards) and distributing them across multiple nodes (servers) within a cluster. Each document is routed to their respective shards (by default) through their _id field. During retrieval, the master node sends each shard a copy of the query, before finally aggregating and ranking them for output.

Documents within Elasticsearch can be bucketed by exact, ranged, or even temporal and geolocation values. These buckets can be further grouped into finer granularity through nested aggregation. Metrics, including means and standard deviations, can be calculated for each layer. This provides the ability to calculate a hierarchy of analyses within a single query.

Being a document based storage, it does suffer the limitation of intra-document field comparisons. For example, while it is great at filtering when a field followers is greater than 10, we cannot check if followers is greater than another field following. Scripts can be injected as custom predicates. This feature is great for one off analysis, yet suffers in performance in production.

Document Store: MongoDB

At this point, MongoDB can be considered the go-to NoSQL database. Its ease of use and flexibility quickly earned its popularity. It supports very rich and adaptable query to dig into the complex documents. Often queried fields can be further sped-up through indexing. When aggregating a large chunk of data, MongoDB offers a multi-stage pipeline, which we will explore further shortly.

As a generic data store, MongoDB has a lot of flexibility when indexing the data. Unlike Elasticsearch, it only index the _id field by default, and we need to manually create indices for the commonly queried fields. As a result, the cost of storing and updating each index is very obvious. Comparing to Elasticsearch, MongoDB text analyzer also isn’t as powerful. MongoDB does provide a lot of flexibility with indexing methods, from compound and geospatial for optimal querying, to TTL and sparse for reduce storage.

Within a MongoDB cluster, there are three different types of servers: router, config, and shard. While scaling the router servers can accept more requests, the shard servers are where the heavily lifting are done. Similar to Elasticsearch, MongoDB documents are routed (by default) via _id to their respective shards. At the query time, the config server notifies the router which shards to query, the router server then distributes the query and aggregate the results.

The Aggregation Pipeline is powerful and very fast. As named, it operates on the returned data in a stepwise fashion. Each step can filter, aggregate, transform the documents, introducing new metrics, or unwind previously aggregated groups. Because these operations are done in a stage-wise manner, by ensuring documents are reduced to only filtered and reduced to the necessary fields, the memory cost can be minimized. Comparing to Elasticsearch, and even Redshift, Aggregation Pipeline is an extremely flexible way to view the data.

Despite its adaptability, MongoDB suffers the same lack of intra-document field comparison as Elasticsearch. Furthermore, some operations, including $group, requires the results to be passed to the master node, thus do not leverage the distributed computing. Those unfamiliar with a pipeline stagewise calculation will also find certain tasks unintuitive. For example, summing up the number of elements in an array field would require two steps; $unwind and $group.

Columnar Store: Amazon Redshift

Alongside the growth of NoSQL popularity, columnar databases have also gathered attention, especially for data analytics. By storing data by columns, instead of rows as per tradition, aggregations operations can be executed from the part of the disk, greatly increasing the performance. A few years ago, Amazon rolled out their hosted service for columnar store called Redshift.

Unlike Elasticsearch and MongoDB, or even traditional databases including PostgreSQL, Amazon Redshift does not support an indexing method. Instead, it reduces its query time by maintaining a consistent sorting on disk. As users, we can configure an ordered set of column values to be the table sort key. With the data sorted on disk, Redshift can skip an entire block during retrieval if its value falls outside the queried range, heavily boosting performance.

An Amazon Redshift cluster consists of one leader node and a number of compute nodes. The leader node handles the compilation and distribution of queries, as well as aggregation of intermediate results. Unlike MongoDB’s router servers, the leader node is consistent and cannot be scaled horizontally. While this creates a bottleneck, it also allows efficient caching of compiled execution plans for popular queries.

The benefits of Amazon Redshift cannot be understated. Frustratingly slow aggregations on MongoDB while analyzing mobile traffic, was quickly solved by Amazon Redshift. Supporting SQL, traditional database engineers will have an easy time migrating their queries onto Redshift. Onboarding time aside, SQL is a proven, scalable, and powerful query language, supporting intra-document/row field comparisons with ease. Amazon Redshift further improves its performance by compiling and caching popular queries to be executed in the compute nodes.

Being a relational database, Amazon Redshift does not have the schema flexibility that MongoDB and Elasticsearch have. Optimized for read operations, it suffers performance hit during updates and deletes. To maintain the best read time, the rows are required to be sorted, adding extra operational efforts. Tailored to those with petabyte-sized problems, it is not cheap and likely not worth the investment unless there are scaling problems with other databases.


In this article, we examined three different technologies — Elasticsearch, MongoDB, and Amazon Redshift — within the context of data engineering. There is no clear winner, as each of these technology are front runners in their storage types, search engines, document stores, and columnar stores, respectively. Depending on the different use case scenarios, different options are superior to others.

  • MongoDB is a fantastic starter database. It provides the flexibility we want when the data schema is still to be determined. That said, MongoDB does not outperform specific use cases that other databases specialize.
  • While Elasticsearch offers similar fluid schema as MongoDB, it is optimized for multiple indices and text queries at the expense of write performance and storage size. Thus, we should consider migrating to Elasticsearch when we find ourselves maintaining numerous indices in MongoDB.
  • Redshift requires a predefined data schema, failing the adaptability which MongoDB provides. In return, it outclasses other databases for queries only involving single (or a few) columns. When the budget permits, Amazon Redshift is a great secret weapon when others cannot handle the data size.

Despite their popularity, we are not restricted to these specific implementations. CouchDB and OrientDB are good alternatives for document stores. Solr is another open sourced search engine; while AWS CloudSearch and Microsoft Azure Search are hosted services we can consider. Other columnar storages include Druid, HP Vertica, and Google Big Query.

I found it rewarding to try out these diverse technologies. Based on the needs of individual products and data, we can decide which data storage is best for our situation. What are your recent favorites in data analysis?


A version of this article was originally published on the TopTal Engineering Blog.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade