Data lakes, what are they, and how to use them?
Organizations store, manage and distribute data more than ever before. This practice will continue to grow by the day. Using key innovations in Data Lakes and BigLake, you can advance your business using the next generation of data solutions.
Data lakes and data warehouses store big data, but the terms are not interchangeable. The only similarity between them is that they are both used for storing data.
This article will explain what a data lake is and how it’s different from a data warehouse. I will also mention BigLake, a new Google Cloud product launched at the Cloud Sumit 2022 event. I will focus on data lake architecture and the best practices and use cases for data lakes.
What is a data lake?
Like water molecules in a lake, a data lake consists of raw, free-flowing data. A data lake is a large storage repository that holds raw data in its native format. The unstructured character of the data is also one of the main differences from a data warehouse, which stores data that is already processed. The data in a data lake is waiting there until it’s needed for analytics purposes. A data lake uses a flat architecture to store data where, mainly, files and object storage is used. There is no schema defined until the data is queried. Each piece of information has a unique identifier and is tagged with a set of tags. A data lake can be queried for relevant data. This smaller set of data can be analyzed to answer business questions.
What is a Lakehouse?
Significant data volumes are growing every day. Gartner states that in 2022 90% of the corporate strategies contain analytics as an essential competency. This is why many business leaders have to rethink their strategy of storing and processing data. Many of those businesses have existing data warehouses which offer flexible data storage but cannot wrangle with data for analytics purposes. The solution to this problem is adding a complementary data lake to the existing architecture. Such a hybrid setup is called a lakehouse and makes it possible to run concurrent workloads on the same datasets and reuse the available data for different purposes. At Crystalloids, we see the advantages of lakehouse solutions for some of our clients. The dual implementation allows it to reroute sensitive data to the warehouse and keep it from entering the data lake. This leads to better control over data usage and compliance instead of using only a data lake. A lake house solution is a great way to manage data, serving many different needs.
What is a BigLake?
This new Google Cloud Platform (GCP) service allows customers to integrate data lakes and warehouses, manage access on a row and column level, and analyze that data using the GCP native tool BigQuery on an open-source processing service such as Spark (through BigQuery Storage Read API).
This extends a decade of BigQuery innovations to data lakes with the support of
- multi-cloud storage
- open formats
- unified security and governance
BigLake is based on BigQuery (BQ) and allows you to examine files in familiar formats (CSV, JSON, Avro, Parque, and ORC) that may be spread over several cloud storage systems (GCP — cloud storage, AWS S3, Azure — Blob Storage) from a centralized place. This enables a single source of data “truth” to be shared across numerous cloud platforms without duplicating or copying your data.
BigLake extends BigQuery to data lakes, so in the fullness of time, you can get the same functionality as BigQuery (even through the BigLake storage APIs for external engines). It does not necessarily help transfer data from Snowflake to BigQuery, though.
Why you should use a data lake
Many data lakes contain large sets of structured, unstructured, and semistructured data. Because relational databases require a rigid data schema, these environments can only store structured data. Most data warehouses are built on relational databases and don’t allow various schemas. Data lakes support just that and don’t require any defined schema upfront. This makes them so strong in handling different data types in other formats. Most companies use them as a platform for big data analytics and data science applications that use advanced analytics methods, such as data mining, machine learning, and predictive analytics.
Schema-on-read and Schema-on-write-access
Data lakes don’t need a schema. This means that when a user wants to view the data, they can apply a schema at that moment. This process is called schema-on-read. I think this is a massive advantage of data lakes. This is a beneficial process for businesses that add new data sources regularly. Defining a schema upfront is a time-consuming process which makes not having to design a schema a significant advantage of data lakes. A data warehouse has a predefined schema; the schema has been created before the data is loaded into the warehouse. This process is called schema-on-write. It may prevent the insertion of specific data that does not conform to the schema. Therefore a data warehouse is better suited for cases where a business has a large amount of redundant data that needs to be analyzed to answer predefined business requirements.
What a data lake architecture should look like
Building the exemplary data lake architecture is extremely important for turning data into value. The data in your data lake will be useless if you don’t have the architectural features to manage the data effectively. Therefore it’s essential to build the correct elements into the data lake architecture. Using a cloud-optimized architecture will simplify the data lake. A modern cloud data lake should have the following characteristics:
- Multi-clustered and shared-data architecture
- Independent storage resource and compute scaling.
- A well-defined metadata service that is fundamental to an object storage environment
- All architectural components and their interaction should support native data types.
- Data discovery, storage, transformation, and visualization should be managed independently.
- The architecture should be tailored to the industry, with its unique features and capabilities present for the specific domain.
Features that should be available in any data lake include:
Data governance: refers to the processes and standards used to ensure the data can fulfill its intended purpose. It helps to increase data quality and data security. An example of data governance can be limiting the file size to standardize them. Files that are too large can make the data difficult to work with. An example of increasing scanning data quality would be to scan the data for incomplete or unreadable data.
Data catalog: this is the information about the data in the data lake. A data catalog makes it easy to understand the context of the data. It enables stakeholders to work faster with the data. The types of information in a data catalog vary per use case, but they usually include information like:
- The connectors necessary to work with data
- Metadata about the origin of the data and the amount of time it has been stored
- Description of the applications that use the data
Search: Being able to search effectively through the data lake is crucial. The search functionality includes the ability finding on features like size, content, and date of origin. A best practice would be to build an index of data assets to facilitate fast searches.
Security: To ensure that sensitive data remains private, security measures such as access control that prevent non-authorized parties from accessing and modifying data and encryption methods must be implemented.
The use cases for a data lake
The more unstructured data an organization has, the bigger the need for a data lake solution. An excellent example of an industry where a data lake plays well is the healthcare industry. The unstructured nature of much of the data in healthcare, think about X-rays, MRI-scans, clinical data, medicine info, analysis, and the need for real-time insights make data lakes suitable for use in the healthcare industry since they allow a combination of structured and unstructured data. In transportation, data lakes can help make predictions. The predictive capabilities of a data lake can have enormous cost-saving benefits, especially in supply chain management.
Crystalloids uses data lakes to create a lake house solution
Building a lake house solution is often part of our job when developing data platforms such as Customer Data Platforms. A Customer Data Platform aims to centralize data coming from different sources. This data combined can provide beneficial insights for all kinds of purposes. In the context of a customer data platform, a lakehouse solution is used to get valuable insights about customer behavior and automate customer activations in owned, paid, and earned channels.
For example, a customer profile is formed based on purchase behavior and loyalty from various systems in the central data point. Having this data present in a customer service application can provide a sales employee guidance on how to approach a specific type of customer. To create such profiles, it’s important to record history. This is done in a data lake; other information contributing to making a profile can come from information stored in the data warehouse. Both streaming data and batch data are processed. Whenever streaming data comes in, it gets written in real-time to multiple databases that need the information.
Conclusion
Which solution is best for your organization depends highly on the use case, the type of data, and the existing architecture. If you don’t have a clear purpose for your data yet and your organization has a lot of structured and unstructured information that needs to be used in a later stage, a data lake is a good solution. If your business has an existing warehouse that cannot wrangle data, adding a data lake to the existing architecture is your way. We are talking about a lakehouse, which makes it possible to run concurrent workloads on the same datasets and reuse the data for different purposes.