The History of Big Data Management — Part 2

Yaekyum Lee
Kaldea
Published in
5 min readOct 24, 2022
The data history museum (by DALL-E)

Data as a Service (DaaS) in 2016

Before Data as a Service (DaaS) was introduced in 2016, data engineers had to go through a manual process of connecting cloud storages (Amazon S3, Google Cloud Storage, MinIO, etc) with Spark and Hadoop. Although Spark and Hadoop reduced the work needed to be done by data engineers, the manual connecting process still took a lot of time and effort. To even further reduce the engineering dependencies, we started seeing services that package Spark or Hadoop with cloud storages — Data as a Service. Examples of DaaS are Google’s BigQuery, AWS Athena, and Snowflake.

Image by one of the authors, Jong Hyuk Park

AWS Athena

AWS Athena takes Facebook’s Presto as a basis, which was built to improve the computing power limitation of Hadoop. Hadoop, although it had enough data storage space, could not control the CPU, meaning that it was not suitable for operations that were computing heavy. Presto solved this by creating a Presto server layer and allowing for computing units on-demand.

Image by one of the authors, Jong Hyuk Park

The above chart represents the cost vs performance for Presto. Presto, as well as AWS Athena (since it is Presto based) have higher costs to set up, but scales out and surpasses the performance to cost ratio of MySQL and other databases.

Another potential advantage of AWS Athena is that it is a Data Lake, allowing for unstructured data (like image files) or semi-structured (like JSON) in addition to structured data. Data Lakes are good solutions for companies that want to store the data first and then, later, create schemas and analyze the data.

With all the benefits of AWS Athena, why do some people choose other products? The limitation of AWS Athena is its data synchronization. AWS Athena is a query engine that can be used in S3, but it needs a metadata management (schema management) solution to store the structure and schema of the data in the S3 database. This storage is commonly referred to as a Catalog. AWS solved the need for a Catalog by creating AWS Glue to “glue” together AWS Athena and S3 by serving as a Catalog. Since these three are separate entities, however, someone must synchronize the data–i.e., when there is a change in S3 database, AWS Glue needs to be notified to tell AWS Athena, and vice versa. Due to this management requirement, using AWS Athena requires a lot of data engineers to set up and manage it.

Image by one of the authors, Jong Hyuk Park

Google BigQuery and Snowflake

BigQuery and Snowflake are managed Data Warehouses, so data management isn’t required but data must be structured in order to be stored and follow the predefined schema required for storage.

That Data Warehouses only allow structured data is a big limitation because logs generally are JSON (semi-structured data). To expand the market, Google created a system called Dremel, making it possible to save semi-structured data as tables. The algorithm to change semi-structured data into tables is called Serde. Google’s BigQuery and Snowflake are the two OLAP that incorporate the Dremel system.

The fragmentation of the modern data stack

From the history of the evolution of the data management tools, we can see how and why people moved from using OLTP and Microsoft Excel, to using OLAP softwares to get the same benefits of analysis when using Microsoft Excel.

Revisiting the advantages of Microsoft Excel, Excel is a good tool for:

  1. Editor for data analysis (SQL)
  2. Catalogs through Excel Sheets
  3. Visualization

As data volumes increased, Microsoft Excel was less and less viable, so OLAP softwares started replacing Microsoft Excel.

Out of these three categories, historically speaking, emerge the slurry of tools for analysis, output, and reporting that we see fragmenting the current data tool landscape.

Below are some examples of products that replaced the advantages of Excel:

Editor for SQL

  • Apache Zeppelin, and more

Catalogs

  • Amundsen
  • DataHub
  • SelectStar, and more

Visualization

  • Looker
  • Redash
  • Tableau, and more

So as data volumes increased, so did the fragmentation of tools intended to analyze data, aid discovery and analysis, and communicate the results. As the number of tools increased, so did the risk of losing the context of data or analysis and so did the number of data silos among tools and the people who use them.

Conclusion: Solving for Lack of Context and Tool Fragmentation

If data storage, performance, and quality were historical problems in data management and led to the development of innovative solutions that make up our current data stack; then our current historical moment in data must address data silos, the lack of context in data, and tool fragmentation. Across tools, ad-hoc analyses, and even data teams, context is lost to tribal knowledge, the inability to find and reuse old work or work from colleagues, and the unfortunate need to flip between tools, dashboards, Slack, email, screenshots, etc. when analyzing or communicating about data. This next moment in data must solve for unifying the entire analytics workflow, democratizing tribal knowledge; creating an archive of analyses, queries, docs, and reports that are easily discoverable; and communicating results.

Kaldea is the unified analytics platform for all your analytics from discovery to reporting. Kaldea automatically indexes how you interact with data, so you’re empowered to democratize tribal knowledge and complete all your analysis faster with a single tool.

Check us out at Kaldea.com and try it for free.

--

--