How AWS Lake Formation helps you building Data Lakes?
About the Product
AWS Lake Formation is a new product on AWS portfolio aiming to give you the power to build a Data Lake in a matter of days instead of weeks/months.
It was first presented on 2018 AWS re:Invent and I can expect it to gain a lot of traction specially because of its tight integration with the other AWS tools such as Glue and Athena. It was promoted to GA (publicly available) last August, so I think that many organisations are still not familiar with that.
While I am a more regular user of Google Cloud Platform, I have decided to give a quick look at that and see how much of the administrative burden this new tool can really take from our arms and fulfil the promises of faster data lake setup.
Standard approach for Data Lake building
The usual approach to build a Data Lake is to use a “Cloud File Storage” solution such as AWS S3 or Google Cloud Storage and organize your data as a set of folders and potentially different buckets (top level containers).
What happens is, after just a few days, you need to start keeping track of the different files, format and folders you are putting together. Otherwise you start having inconsistencies and overlaps that will make it harder for your data team to focus on the insights they should be looking for in the first place.
As with many other problems in technology, you can solve this issue by either creating a documentation (data catalog) somewhere else (which is an approach I like, because you can keep such documentation and associated code together, say in a git repository — a DevOps way) OR you can rely on an Enterprise Software to handle that.
I am not going into the details here, but you can google “Data Lake Data Catalog” and probably find a quite comprehensive list of the major players such as Cloudera, Databricks, Informatica and so on.
This is not essentially different from any data modelling from the past when “database consolidation” projects were a thing, and you had to design complex databases with hundreds of tables.
It is important to note though, that on your data lake you may be storing much more than typical relational data (coming from CSV and tabular data files).
While it is still not a reality everywhere, you can store other forms of data such as web server logs and other more unconventional ones such as audio files, pictures, graphs and so on, so you can analyse them together with your transactional records.
I have already seen a different approach for “data lake” architecture which is to use “big data warehouse” solution such as Big Query and AWS RedShift. While these analytics databases can handle relational data quite well, they may not scale to the other varieties of data I have just mentioned. They may also not scale in the same way a more simpler storage mechanism (from both technical and cost perspectives).
How Lake Formation Can Help
There are three areas where I think Lake Formation can help you as a Data Architect or Engineer:
1-) Data Catalog
This is 100% connected to what I explain above. It can keep track of the purpose of each folder in your data lake. It provides you with a very simple and familiar database (1 to n) table hierarchy. It supports multiple S3 buckets, so it is definitely a great help especially if you are trying to avoid other tools and trying to keep it simple.
2-) Export (Blueprint)
A common scenario is to create your data lake starting with the key database and tables you already have available in the relational database world. Once they are in your Data Lake you don’t need to worry about performance and other restrictions of standard RDBS. Lake Formation Blueprints helps exactly with that.
Just like the previous feature it is more a management layer at the top of AWS Glue, but it still make sense and it helps you organize and have consistency.
3-) Permission (Authorisation) Management
For me this is really the great advantage of the tool. On Lake Formation you can define permission of the data at table and even at column level. And this is applied to other products such as AWS Athena and Glue.
This is a feature only found on expensive tools and it is being made available virtually for free. This is again AWS replicating features of other products on their platform. Off course this comes with assumption that you will be using the other AWS products that are able to leverage such capability.
What I am still missing
Things that I expect to be developed or solved somehow on the next product releases:
1-) More complex data hierarchy support
When you are handling more complex data or data coming from “non transactional” sources you need to perform multiple processes on that, for instance:
At the end of the day, you will have multiple “version” (stages) of the same entity/table (e.g. for Events, raw-events, dedup-events, norm-events, …).
On “plain” S3 you can organize this set of related data in a folder structure, but inside AWS Glue or Lake Formation you are limited to a simple Database➡️Table structure. It is not the end of the world, but I am looking forward to see how they are going to support for growing demand and constant bigger data catalogs.
I know that Databrick’s Delta Lake has a nice concept for that (details here). Let me know how you deal with this issue.
2-) (Native) Compatibility with more file formats
The file types that are quite common are still not supported out of the box:
- Spreadsheet files (they are sill omnipresent).
- Support for native compressed files (e.g. CSV.zip). I understand that Athena would have problems querying those files, but I would love to have that support on Glue at least).
- Improved support for complex JSON files. There are quite few options for those files which are exported by many tools.
- Oracle/SQL Server export/dump files: again, they are used in 99% of enterprises. We may not always use Blueprints for that. Not all databases will be accessible by AWS Glue and still be integrated using file exports/imports.
If you are starting a new Data Lake on AWS and trying not to rely on other proprietary tools, Lake Formation for me is a no brainer, inexpensive tool that can help you on your Big Data/Analytics journey.
I really expect other cloud providers such as Google and Azure built similar layers in the near feature and can expect AWS improving this along this year and launching new features at least by the next re:Invent.