Data Lake: What, Why, and How
Data is increasing as all the devices are connecting to the internet. As a result, IoT is producing a massive amount of data that traditional systems cannot cope up the pace. Data engineering is all about architecting the data pipeline to manage the data, and it is evolving to handle the volume and velocity of the data.
Data Lakes is a hot topic in data engineering as it solves many challenges presented by traditional Data warehouses (DWH). Traditionally data was stored in the databases in structured format as per the need of the business. However, now analytics need to be extracted from different types of data. It can be unstructured, semi-structured, or even in binary format.
Over the years, different type of databases has been developed to store various kind of data such as no SQL databases were being used for non-relational data. Also, quite a few purpose-built databases exist for specific formats such as timer series database — InfluxDB, session storage — Redis, text searching indexing elastic, and document-based — mongo DB. Some more types are NewSQL, key-value, and time-series databases.
The technology has evolved, and the trend is changing to store all types of data under one storage system — a Data lake, a water lake where water is streamed from various sources. Similarly, data is streamed from multiple systems and formats into the data lake. The goal is to have quicker query results from the data irrespective of storage and format, and data lake architecture optimizes the storage and computation.