What is Data Lake?

5 min readJan 26, 2023

Let’s start with a brief background of traditional databases to set the ambience and then jump to Data Lake, as I have seen many people confusing between data warehouse and data lake. Don’t worry it will be clear in a while (especially with the funny pics I have put below :). Please note, I will avoid mentioning about NoSQL databases to keep it brief.

Traditionally organizations have been using OLTP sql databases for quick day to day transactions where they access small set of records and update them, e.g., online payment transactions. Gradually with growing amount of data, organizations felt the need of running large analytical queries for business trends (for example). This is when OLAP database warehouse came into existence as OLTP databases were not suitable for such large analytical queries. OLAP database warehouse can be considered as giant database with current as well as historical data used for running large analytical queries. OLAP databases are columnar in nature (as opposed OLTP databases which are row based storage) which helps speed up analytical queries on given columns since all values from the column are stored together on the disk. There are many articles available online on this, hence I will avoid going in too much detail. Organizations might keep both OLTP database and OLAP database warehouse side by side for different needs where OLTP database keeps only recent records for fast transactions and OLAP database warehouse retain historical records for analytical queries (they could be bridged by some CDC tool like Debezium for automated replication).

One point to note, both OLTP database and OLAP warehouse store highly structured data in the form of tables, which means we need to define schema and adhere to it. Also OLAP warehouse comes with some high cost. Soon large organizations started generating massive amount of data of different kinds including semi structured and unstructured data alongside structured data. Data is being considered an asset this generation. Hence there was a need for a relatively cheaper storage to store data of different kinds (a mix of structured, semi-structured and unstructured) and use them for a future possible analytics rather than discarding them. Sometime organizations need to retain data for long time due to audit requirements. They need to dump their data in some central cheap storage. This is where Data Lake came into birth. Data lake brought the concept of dumping your raw data in a central storage (backed by HDFS on-prem or object storage on cloud) and later figure out what to do with the data. This data could be text, image, video etc. Data lake is not only meant for raw data, we can process raw data , and put refined data back in the lake for another use case. This highlights another beauty of Data Lake that the raw data (which is source of truth) is immutable, which is not the case with traditional databases. In case you are wondering what does immutability gets you, let me briefly mention it. With immutability we can time travel and query data at any time in its history, it provides fault tolerance due to any application bug that might corrupts data as we can always derive data from the raw data.

Data lake can be seen as a repository of data of various kinds. It decouples storage from compute, which means both storage and compute can scale independently which is not the case with traditional data warehouses.

Data Lake (repository of structured, semi-structured & unstructured data)

Let’s briefly touch upon various points relevant to Data Lake below.

1. Ingesting data into Data Lake :-

Data Lake ingestion is simpler than writing to a database warehouse, given there is no schema we need to adhere to while ingesting. Data can be ingested in its native format which could be CSV, JSON, Avro, Parqet or ORC format etc. Parquet is widely used and recommended storage format, as it has better compression and good for analytical queries due to its columnar storage nature.

2. Refining data :-

Once raw data has been ingested into the Data Lake, we might need to refine/curate the data for a later use case. We can refine data as many times as we need to cater to different requirements. This refiner can be written in many ways. Spark refiner for on-prem HDFS, AWS EMR or AWS Glue on AWS S3 for example. One example of refinement could be, suppose if we are ingesting semi-structure json raw data, we can refine it to flatten the data for another use case.

3. Data catalogue :-

With huge amount of data in the Data Lake, we need a catalogue to refer/search them efficiently. We could use Hive Metastore for on-prem HDFS, or AWS Glue catalogue on cloud.

4. Reading data :-

Once raw data has been ingested into Data Lake and refined as per suitable needs, we can query them using Trino (formerly Presto), which is an open source distributed sql query engine, on the HDFS directly. Running presto on HDFS is cheaper but might not be very much performant for joins. If performance is a concern or we need performant joins, we can export refined data into some database warehouse.

5. OLTP layer on top of data lake :-

Few products have been developed recently to add OLTP feature to data lake. They are Delta lake, Iceberg and Apache Hudi. These products provide features like ACID transactions, schema enforcement, unified batch and stream processing, time travel etc. It needs a separate blog to go into deeper comparison among them.

6. Entitlement :-

Apache Ranger could be used to impose any entitlement restrictions on top of Data Lake based on individual use case.

7. Data modelling :-

Bit advanced topic, but worth exploring Legend from Goldman Sachs to build data models on top of Data Lake in an intuitive & business-friendly way for end consumers.

I appreciate you and the time you took out of your day to read this! Please watch out for more blogs on big data and other latest technologies. Cheers !