Data Lake
As Data drives business we need Data lake to collect data and get advantage from it. In this story, we will cover all the insights about the data lake and know it in a better way.
Having intelligent services and microservice architecture, a system produces millions of bytes of data per second but if we don’t use them to drive our business and get useful insights that help us to make crucial decisions it’s just garbage for us. So here comes Data lakes into the picture.
Before going let’s see what type of data we have!
What are the types of Data?
So mainly there are two types;
- Structured Data: It’s neat, has known schema and could be fit in fixed fields in a table. A relational database is an example of structured data: tables are linked using unique IDs and a query language like SQL is used to interact with the data. Structured data is data created using a predefined (fixed) schema and is typically organized in a tabular format. Think of a table where each cell contains a discrete value. The schema represents the blueprint of how the data is organized, the heading row of the table used to describe the value, and the format of each column.
- Unstructured Data: It can be found in different forms: from web pages to emails, from blogs to social media posts, etc. 80% of the data we have is known to be unstructured. Regardless of the format used for storing the data, we are talking, in most cases, about textual documents made of sequences of words. No schema or structure eg images, application logs, etc.
What is Data Lake?
Let’s begin with a brief overview of the data lake; data from various sources, structured or on-structured get into the ingestion layer, from here we store the data, organize it, process and transforms it so that it can be applied for various tasks say training a Machine learning Model, analyzing trends, making dashboards, generating reports, etc.
Data is one of the business’s most valuable assets but there are steps that involve that we need to take in order to utilize is well for our organization.
You may call this as the data processing ladder; so first we collect the data then we organize this data in some proper way with indexing and catalog. Once we have this data organized we can run queries over it, analyze it, and finally get interesting and useful insights, that finally help to drive business and make crucial decisions.
Coming back to the data lake, here are some of the key points that cover the use and role in short.
- It’s a centralized repository for structured and unstructured data storage
- Data flows from the streams (the source systems) to the lake. Users have access to the lake to examine, take samples, or dive in.
- Data is stored at the leaf level in an untransformed or nearly untransformed state.
- Stores raw data as in without any structure.
- Store many types of data such as files, text, images.
- Data is transformed and schema is applied to fulfill the needs of analysis.
- Store Machine learning artifacts, real-time data, and analytics output.
- It’s different from Databases and Data Warehouse.
Now lets deep dive into the Architecture
There are three main layers when it comes to designing a data lake, time to go deep into them.
Ingestion Engine/layer
This is basically the data consumption or the collection layer. Data from multiple sources came into it and this layer further sends it for storage and processing. There are broadly three sources that supply the data.
We can get data from the Databases, for this we need to have an SQL service, which fetches all the data from the database. In case we have a database that is running and consuming data, we also need to build a Replication mechanism as in that case we can’t just fetch the data once, we also need to keep ourselves in sync with the live DB.
Second, we can get the data from our Applications, now since most of the firms adopt MicroService architecture, there are thousands of service running and they produce a million bites of data such as logs that can be further used for analysis. So for this, we need a Streaming mechanics (say Kafka) so that all the data from these services flows into our lake in an async manner and there is no load over the system.
And the last one is the local disk/ NFS, say I need to get a CSV file into the lake. So we require an Upload mechanism for this so that the user can upload the data into the lake. We can provide a self-serving portal where the user can upload the data. This UI can serve manual requests and for automation, we can build APIs.
So this is all we require to have our Ingestion layer.
Process
Once we have all the data from the Ingestion layer we have to store it and work upon it.
So we store all the data, you may use AWS S3 for the storage. Only storing is not enough we need to get to know data and process it as fast as possible. So we also need to have a catalog, you may use AWS Glue for this and most importantly index our data, proper indexing is needing for fast processing.
Once we are done with storage, indexing, and catalog. We need to extract and transform this data for our needs. For this, we can have SQL as a service or function as a service, say AWS lambda. We can either automate this process or run manually that depends upon the consumption and how frequently we are getting the data. We may schedule these jobs to transform our data.
Now there is another key component that is Governance, this is also an important part of this layer. It's basically responsible for having information regarding from where this data is flowing, so the data flow and secondly enforcing policies over the data; say some government policy in some region states that we can’t store user email Id and contact details for more than 1 year. So in that case this component can run a scheduled job to obfuscate all the user info that is 1 year old.
This is the main layer from here our data is ready for the insights.
Insights
There are a lot of ways we can use this data now, say we can consume this data to make dashboards, do trend analysis, generating reports. Or say providing this data to train Machine Learning models. For instance, based on user actions or purchase history we need to recommend items to the user in the future. Or we may want to provide this data to our product folks to learn about how our user is behaving or this data may be related to some A/B tests and we need to see how well this test went. We can have other applications too, consume this data in real-time.
There can be multiple uses of this data and it all depends on what all you are storing and why you need it at the end.
And with this last layer, I hope you all are set to start building a Data Lake for your organization and getting data to work for you.
Data Lake:: The AWS Way!
AWS does provide various services that can be used to build your own Data Lake, and recently I came to attend a conference on the same, and below are some of the screenshots from the same. You can refer to AWS documents to get more insights over them.
If you look at the above two, they have a one to one mapping.
We haven’t talked about security here but that is important to consider when we are developing almost everything.
PS. I usually prefer AWS technologies a lot when it comes to the cloud but I’m sure other cloud providers say Microsoft Azure does offer these services too. The choice is yours either to use these service providers or work with open source tools or both to have your Data lake ready.
Hope you enjoyed the article and got an idea or an overview of how we go about designing Data lake. I know this is kind of brief just to provide you with the helicopter view of the system. There is a lot of insights we need to go into when it comes to working/ building these systems.
Please do provide your valuable feedback over this as; our brain does act do also data lake; it consumes this feedback and gets insights from it, that eventually drive towards improvement.
Thanks………see you soon.