Getting Started with Data Lake

Dmitry Anoshin
Rock Your Data
Published in
6 min readMay 16, 2019


Lake Bled, Slovenia

I chose Lake Bled for the cover image it is one of my favorite lakes. But we will talk about a different type of lakes — Data Lake. Probably you’ve heard a lot about this, especially if you are working with data. I believe, one more definition and article about Data Lake won’t be hurt anyone.

There are a couple of popular quotes about Data Lakes:

a single data store for all of the raw data that anyone in an organization might need to analyze by Martin Fowler

If you think of a datamart as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. by James Dixon

It is clear that it is somehow related to Analytics and help us to store data and easy access this.

I like to simplify, and we can try to simplify the idea of the data lake. Let’s look to the common thing like photo gallery at iPhone

Screen from my Data Lake presentation

We make a photo and it could be saved on cloud file storage (iCloud). Moreover, it will collect metadata about the images (files). As a result, we can access data through the friendly interface and we see some statistics. In my example, I can see a number of photos with fire.

Even from this simple example, we can see the advantage of the data lake against traditional Data Warehouse. Moreover, we can identify the key steps:

  1. Ingestion — a key component of a data lake. We can ingest data into the data lake using batch processing or streaming.
  2. Storage — the main component of the data lake is the storage. We should be able to access data in a flexible and in a scalable way. Moreover, we should provide extremely high durability at a low cost. The best way to store is by using AWS S3 or similar storage capabilities from AWS and GCP.
  3. Catalog and Search — in order to avoid data swamp, we should build metadata layer for classification of data and for users to be able to search on various attributes. Often, we can build an API in order to provide a search interface.
  4. Process — this layer is responsible for data transformation. We can transform data into various structures or formats. Moreover, we might do an analysis of data using processing power. We can leverage Hadoop or Spark for processing. As well as Hive/Presto/Athena and another tool for analysis.
  5. Security — we should think about the security of the solution. For example encryption of the data at rest and in transit, a mechanism for authenticating and authorizing users. Moreover, we should audit all events around the data lake.

In a practical sense, a data lake is characterized by three key attributes:

  • Collect everything — A data lake contains all data, both raw sources over extended periods of time and any processed data.
  • Dive in anywhere — A data lake enables users across multiple business units to refine, explore and enrich data on their own terms.
  • Flexible access — A data lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory, and other processing engines.

Data Warehouse vs Data Lake

The logical questions are

  • “What is about Data Warehouse”?
  • “Do we replace Data Warehouse or we extend it?”
  • “Maybe we can use only Data Warehouse”

I would say “Yes” on all of these questions. For example, this is the article from Amazon Subsidiary Woot — “Our data lake story: How built a serverless data lake on AWS”. They replaced DW with Data Lake by using AWS technology stack.

On the other hand, Snowflake claims that you don’t need to build the data lake individually because Snowflake provides this functionality by splitting computing and storage. And this is true.

And finally, we can complement the existing Data Warehouse solution with Data Lake. For example, in the case of using Google Big Query, Amazon Redshift or Azure SQL Data Warehouse.

Let’s look at the traditional Data Warehouse solution:

Traditional Data Warehouse

The pretty straightforward solution, when we are collecting data from Source by ETL/ELT and load into Data Warehouse. Then we can access data with BI tools. The downsides of this approach are:

  • takes time for ETL/ELT
  • Expensive storage and compute
  • Business users see aggregated and transformed data, lack of raw data access

It really depends on your use cases. If you are ok with existing DW and existing functionality, then you don’t need a Data Lake. But it is clear, then more data we have, than more value we can extract. That’s why the data lake is popular. Let’s look at the data lake pipeline:

Data Lake Architecture

Using the Data Lake approach, we are ingesting data into Data Lake in batch or stream fashion and then we can process and transform that data. Data Lake contains the raw data which allows different users to have their own ETL process to format the data the way they need it.

The key goal of Data Solution is to serve business users. We always should work backwards from business users and meet their data needs.

Let’s compare the key points of Data Lake and Data Warehouse:

Data Lake vs Data Warehouse

Based on the table above, it is clear the Data Warehouse doesn’t compete with Data Lake. It is actually complimentary technology.

Real world example of Data Lake

It is clear the role of Data Lake in the organization. There are many use cases for Data Lake available nowadays. In most cases, we would like to use Data Lake when Data Warehouse can’t help us or when we have strict SLA for near-real-time streaming data.

One of the recent cases is to get insights from Clickstream data — access logs. The data volume could be TB of data per day. Moreover, this type is data is semi-structured like in the example below:

https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 0.086 0.048 0.037 200 200 0 57
"GET HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
"Root=1-58337281-1d84f3d73c47ec4e58577259" "" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012"
1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-"

For example, we might try to use a traditional ETL tool in order to load around 50GB of raw data into Data Warehouse. It is 6000 log files per day. The traditional approach took ~1minute per file. With our volume of data, it will take too much time. Moreover, the cost of analytics data warehouse storage is quite expensive. In our case, we used Redshift. As a result, we came up with a data lake solution:

Clickstream processing

The solution is simple. We are Leveraging Elastic Map Reduce and Spark in order to produce a Parquet file. On top of Data Lake, we have Redshift Spectrum and it provides SQL access for data. AWS Glue is collecting metadata about available data and partitions.

As a result, business users are able to deliver insights for business use cases: analyze bot traffic, track broken URL and measure the performance of the website.

About Rock Your Data

Rock Your Data is a consulting and technology firm that delivers secure and scalable cloud analytics solutions for large and medium-sized enterprises in Canada.

Rock Your Data help organizations to make distinctive, lasting, and substantial improvements in their performance by leveraging their data and cutting-edge technology.