From Data Warehouse to Data Lakehouse — Part 2

Seckin Dinc
6 min readMay 6, 2024

--

Photo by Leif Christoph Gottwald on Unsplash

Should I start my article with “Big Data, a marketing masterpiece!” or “Rise and fall of Big Data!”? I couldn’t decide which one to use but I think you understand what and how I am going to articulate my thoughts around Big Data!

In the late 2000s and early 2010s, the data format started to change from structured data into unstructured data. This was mainly led by increase of the web and mobile channel usage driven by the explode of the social media applications. It was not only the structure of the data but also the frequency and volume of the data changed as well. To sum up that complexity, a new terminology was produced 3 Vs of Big Data; volume, velocity and variety!

One day we were only seeing celebrities drinking a cup of coffee with a fancy view over the TV and the next day millions started to post how they drink coffee all over the world. In a nutshell this is the Big Data boom!

In this article, I will walk you through the journey of big data and how it helped companies to kick off solutions for the problems they didn’t have.

What is Big Data?

Big Data refers to extremely large and complex data sets that traditional data processing applications are inadequate to deal with. These data sets typically come from various sources such as social media, sensors, digital images, videos, and other sources. Big Data is characterized by three main attributes:

Volume: Big Data involves vast amounts of data that exceed the capacity of conventional database systems to store and process efficiently.
Velocity: Data is generated and collected at an unprecedented speed, requiring real-time or near-real-time processing to derive meaningful insights.
Variety: Big Data encompasses a wide variety of data types, including structured data (e.g., databases), semi-structured data (e.g., XML, JSON), and unstructured data (e.g., text, images, videos), making it challenging to manage and analyze using traditional methods.

Under the spotlight of the newly arrived 3 Vs, usage of standard on-premise Data Warehouses became a major challenge due their non-flexible schema designs, cost and scalability problems. As we needed to find a solution to these problems, (some weird and mutant) animals arrived to save the day!

Animals Started to Rule Data Industry

We needed solutions to fix the challenges of the on-premise data warehouses. They were ugly big machines which live at cold server rooms everyone is afraid to interact with. We needed more charming, fancy and friendly solutions to make everyone love to interact with. At least this was the notion communicated towards over the open source conferences.

Apache Hadoop

Image courtesy https://hadoop.apache.org/

Yes, that is a yellow elephant! Despite the awkward image, Hadoop became so popular in the data industry. It enabled software engineers to claim that they are the new data engineers as Hadoop was Java-based platform. Let’s take a look into Hadoop from a more technical point of view.

Apache Hadoop is an open-source framework designed for distributed storage and processing of large volumes of data across clusters of commodity hardware. It provides a scalable and fault-tolerant platform for handling Big Data by leveraging a distributed file system Hadoop File System (HDFS) and a distributed processing framework (MapReduce).

MapReduce is a programming model and associated framework within Hadoop, responsible for processing and analyzing large datasets in parallel across multiple nodes in a cluster. It divides tasks into two main phases: the Map phase, where data is processed and transformed into intermediate key-value pairs, and the Reduce phase, where these intermediate results are aggregated and combined to produce the final output.

Apache Hive

Image courtesy https://hive.apache.org/

Yes, it is a mutant elephant bee or bee elephant. Apache Hive was originally developed by Facebook in 2007.

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop for querying and analyzing large datasets stored in Hadoop’s distributed file system (HDFS) or other compatible file systems. It provides a high-level interface and query language, called HiveQL (Hive Query Language), which is similar to SQL (Structured Query Language), allowing users to write SQL-like queries to process and analyze data.

Apache Hive became really popular in the short period of time because of two core capabilities;

Schema-on-Read: Unlike traditional relational databases where data schema is enforced at write-time (schema-on-write), Hive follows a schema-on-read approach. This means that data stored in Hadoop can have varying schemas, and Hive applies the schema during query execution, making it more flexible for handling semi-structured and unstructured data.
Tables and Partitions: Hive organizes data into tables, which are logical representations of data stored in HDFS or other file systems. Tables can be partitioned based on one or more columns, allowing for more efficient data retrieval and query processing.

Apache Pig

Image courtesy https://pig.apache.org/

Yes, that is a pig in a human-like costume. Apache Pig was developed by researchers at Yahoo! Research around 2006.

Apache Pig is a high-level platform and scripting language built on top of Hadoop for analyzing large datasets. It provides a simple and expressive way to express data processing logic, making it easier for users to work with Big Data without needing to write complex MapReduce programs directly.

Personally I found it always harder to use Pig over Hive. Some companies put their money on Pig and others on Hive. Really small portion tried to mix them and had some hybrid solution. But in the very first place, what were their differences?

Apache Hive vs ApachePig

For a very long time there was a big debate of using either of the solutions in our Hadoop ecosystems. For most people, Hive was more popular and implemented over the globe compared to Pig. These are the main reasons for such outcome;

Familiarity with SQL

One of the main reasons for Hive's popularity is its similarity to SQL, the standard language for querying and analyzing structured data in relational databases. Many data analysts and SQL developers find it easier to transition to Hive because they can use their existing SQL skills.

Ease of Use

Hive abstracts the complexities of distributed data processing by providing a familiar SQL-like interface. This makes it more accessible to a broader audience, including business analysts and data scientists who may not have a background in programming or data engineering. With Hive, users can perform ad-hoc querying, data exploration, and analytics tasks without needing to write complex MapReduce jobs or learn a new scripting language like Pig Latin.

Integration with Existing Tools

Hive integrates seamlessly with existing SQL-based tools and BI (Business Intelligence) platforms, making it easier to incorporate Hadoop-based data analytics into existing workflows and environments. Many popular BI tools and data visualization platforms support Hive as a data source, allowing users to create interactive dashboards and reports directly from Hive queries.

Optimized for Batch Processing

Hive is well-suited for batch-oriented processing and data warehousing scenarios, where large volumes of structured data need to be processed and analyzed in batches. It provides optimizations such as query optimization, data indexing, and partitioning to improve query performance and efficiency.

Conclusion

Early adapters were promised that Hadoop Ecosystem will be much easier to use, fault tolerant and easy to scale up whenever needed. It was promising Heaven on Earth. Of course the reality was the opposite!

Facebook, Yahoo, and other giants needed to come up with solutions to their Big Data problems. But it was their problem. Almost 99.99% of the companies didn’t have enough data close to be a Big Data problem.

Also we didn’t have enough educated data people to handle the software engineering requirements as the core concept built on Java. This opened a door for Software Engineers to update their Linkedin profiles to Data Engineers. This triggered the the next big problem Data Swamps, I mean Data Lakes!

Thanks a lot for reading 🙏

If you are interested in Analytics Engineering and Data Engineering, don’t forget to check out profile and subscribe.

--

--

Seckin Dinc

Building successful data teams to develop great data products