Building Data Lake with Snowflake

Thank you for reading my earlier blogs on Snowflake designs — designing warehouse, ELT loads and Data mesh. In case you missed reading it earlier, refer to the blogs here —

Implementing ELT with Snowflake — https://medium.com/snowflake/implementing-elt-with-snowflake-660f9414fb19

Architecting Data warehousing solutions with Snowflake — https://medium.com/snowflake/architecting-data-warehousing-solutions-with-snowflake-1157983d6213

Designing data mesh with Snowflake — https://medium.com/snowflake/designing-data-mesh-with-snowflake-aecb5583f591

This is the next blog in the series that helps you learn more about data lake, typical data lake challenges, designing a data lake and implementing it with Snowflake.

What is a data lake?

Data Lake stores, process large volumes of data. Data can be of structured, semi-structured, and unstructured in its native format. The data lake architecture has evolved over a period of time to support and meet the demands of increasingly data-driven enterprises as data volumes continue to rise. Data lakes typically store a massive amount of raw data in its native formats. This data is made available on-demand, as needed; when a data lake is queried, a subset of data is selected based on search criteria and presented for analysis.

Traditional Data Lake

Data Lake features —

Data lake used to store, refine, process and analyze huge volume datasets. Below are some of the key features —

  • Open to all data, regardless of type or source
  • Data is stored in its original raw, untransformed state
  • Data is transformed only when provided for analysis based on matching query criteria

Data Lake benefits —

The source- and format-agnostic nature of data stored in a data lake offers several benefits for businesses, including:

  • Flexibility, as data scientists can quickly and easily configure queries
  • Accessibility, as all users can access all data
  • Affordability, as many data lake technologies are open source
  • Compatibility with most data analytics methods
  • Comprehensive, combining data from all of an enterprise’s data sources including IoT

Data warehouses and data lakes are different from each other. The way data is stored, processed as well as type of data supported in both varies. Most of the data warehouse implementations are now planning to move to data lake way of implementation to store raw data and processs as needed. Data lake implementation on cloud offers variety of cloud native services as well as data platforms like Snowflake offers everything you need to implement data lake.

Snowflake Data Lake

Snowflake’s platform provides both the benefits of data lakes and the advantages of data warehousing and cloud storage. With Snowflake as central data repository, business gains best-in-class performance, relational querying, security, and governance. Alternatively, you can store data in cloud storage from AWS, Azure.

You can refer to this reference architecture and design the platform considering various integrations within the platform —

  1. Source integrations — Data lake sources include various sources, data to be integrated from sources in the form of batch, streaming, API etc. The type of data also varies based on the source — structured, semi-structured and unstructured. The data can be integrated in the form of ETL , ELT and streaming ingestions etc. Snowflake offers support to ETL tools, various connectors to integrate with databases, native load utilities to load batch data, streaming pipelines via Snowpipe using snowpipe streaming. You can also plan on using other ETL tools or tools like Mattillion to load data to the raw layer.
  2. Raw Layer — Data sourced is landed and stored in the raw format. Raw data stored using cloud storage services. Data is stored in the native format. You can define external tables or load data into raw or landing layer tables. The raw data can be loaded as truncate and load.
  3. Intermediate or transform layer — Data from raw layer is processed and loaded in the transform layer. The transformations can be applied in the form of SQLs, using native database objects like — SPs, UDFs etc. You can also use tools like dbt to execute pipelines, design models using native SQL. You can also use native orchestration using tasks. Snowflake offers streams for CDC — you can refer here for CDC — https://blog.devgenius.io/change-data-capture-using-snowflake-streams-54a58e1839d3
  4. Target layer — Processed data is stored in the target layer for consumers. Typically, the data stored is appended or performed upserts (update + insert) .
  5. Consumers — Data consumers can consume data from the platform. Consumers can be internal stake holders like data science, data analysts, other business units like marketing, sales. The external consumers can be downstream applications or users that extract data from the platform. Snowflake offers data sharing that enables sharing with consumers in secure ways. You can refer to data sharing here — https://poojakelgaonkar.medium.com/snowflake-data-sharing-149c9b97fce2

6. Data Protection — Snowflake offers features and extended support to implement Data security, governance and security implementations. Data protections refer here — https://medium.com/snowflake/snowflake-data-protection-features-part-ii-3057fccb06e5

7. Alerting & Error Handling — Snowflake native offering and integrations to integrate cloud services via API integrations, notification integrations. You can refer here — https://poojakelgaonkar.medium.com/setting-up-alerting-for-snowflake-data-platform-8b67863eeb07

Hope this blogs help you to understand data lake, data lake design, architecture components, layer and implementation with Snowflake.

About Me :

I am one of the Snowflake Data Superheroes 2023. I am also one of the Snowflake SnowPro Core SME- Certification Program. I am a DWBI and Cloud Architect! I am currently working as Senior Data Architect — GCP, Snowflake. I have been working with various Legacy data warehouses, Bigdata Implementations, and Cloud platforms/Migrations. I am SnowPro Core certified Data Architect as well as Google certified Google Professional Cloud Architect. You can reach out to me LinkedIn if you need any further help on certification, Data Solutions, and Implementations!

--

--