Fun with LakeFS and Terraform

Bjorn Olsen
cloudandthings.io
Published in
21 min readApr 8, 2022

We decided to take LakeFS for a spin and see what we think :)

What is LakeFS

The layman’s explanation is that LakeFS is like “Git for data”.

Git is great for managing small text files, like code, and providing version control, branching and merging capability in a decentralised way. It is great for reviewing and collaborating on changes. But Git doesn’t manage large files efficiently.

Some bespoke large-file solutions exist with Git, but a modern cloud-based data lake environment might be hard to manage. We might have petabytes of data across millions of files. We might need to integrate with data processing tools, like Spark, Presto (or AWS Athena) or newer lakehouse tech like Delta Lake or Apache Hudi.

LakeFS provides a git-like ability to manage data. LakeFS can create isolated branches for data. Data changes can be persisted and reviewed before being consumed; perhaps the review would be done via automated data quality / integrity testing of the data. Finally, LakeFS can atomically commit and merge data, without data movement, and operate at data-lake scale.

LakeFS also provides a set of S3 compatible APIs, so data could be managed with an AWS CLI or Python boto3 client, even if your LakeFS data was stored in Azure or GCP rather than on Amazon S3.

There are some tradeoffs involved — sorry if you were hoping for a silver bullet! I will discuss some of the pros and cons here.

Why do we need LakeFS

Cloud data stores offer cost-effective and scalable data storage solutions for supporting a wide variety of use cases, such as data lakes.

Managing data is not without its challenges. Data might be “anything” — structured or unstructured, huge flat files or many small ones. Ultimately, we want to analyse our data and expect it to behave a certain way. But these data stores are not databases, and instead they have rather different characteristics.

For example, there might not be transactional / ACID capabilities. An error during writing may result in data corruption and complete data loss, which can’t be rolled back. The data volumes might be extremely large (in either size or count) and moving or processing the data might be error-prone or expensive.

Modern data lakes and lakehouses solve some of these problems, but are often limited. They might offer ACID for a single table but not over multiple tables at once. They might not allow you to make changes to the data in isolation from all other users, test it for data quality, and then revert the changes if the quality is poor or merge it atomically if it passes. These are features where you might employ a data warehouse — at an additional cost. What if we could add them to our data lake?

Lastly, being in the cloud (or anywhere else) isn’t going to prevent mistakes. A data lake that is business critical would need backup and restore, that is quick and effective. Good luck, if your disaster recovery strategy depends on thoughts and prayers, or identifying and copying millions of S3 objects by hand. I’m not sure which is worse.

Rather, imagine you can create a separate branch and make changes to it without copying any data, and without impacting consumers of your main branch who happily consume from main while you’re busy with updates in your own branch. Your changes could be reviewed if you like, perhaps automatically, before they are atomically merged into the main branch, or discarded.

You could inspect a point-in-time view of your entire dataset, at some previous commit and without needing to “unwind”, restore or move any data. You can ensure that this point-in-time view was not affected by subsequent data processing even if you’ve made changes to historical data. And if you need to restore your main branch to a known-good state, you can do so instantaneously — at data lake scale.

These are some of the features which LakeFS offers.

Demo-ing LakeFS

I tried the options below to play with LakeFS. There are more options listed under the quickstart section.

The docker image was very easy to install and use, and to get a feel for the LakeFS UI. Any data is stored locally in the Docker image storage. You can’t easily assess LakeFS in a cloud environment in this way, but it’s handy for a quick demo.

The free online demo environment was handy, but I ran into some problems running our own scripts. I wanted to be able to see logs and fiddle with things under the hood to see how they worked.

In the end we decided to roll my own demo LakeFS environment, using Terraform IaC. With Terraform we can easily spin up and spin down an entire LakeFS demo environment in our own AWS account, whenever we like. If you notice IP addresses changing in the screenshots, this is why :)

If you would like to play with LakeFS and Terraform, or just take a look at what I did, here is the Github repo.

Our goal was to understand what possible use cases would benefit from LakeFS — what are the benefits and what are the tradeoffs. Most of our use cases would be Spark based and relate to S3 data lakes managing structured data. LakeFS is format agnostic and can also work with unstructured data.

We set out to test the following:

+ Create LakeFS infrastructure and learn how LakeFS works

+ Test reading and writing some structured data with LakeFS using Spark (via AWS Glue).

+ Test using the Glue catalog for accessing data (read and write) via Glue, and Athena

+ Test the LakeFS Python client for creating branches, making commits, etc from a Glue job.

Our LakeFS environment

LakeFS server

We spun up an AWS EC2 instance to host our LakeFS service.

LakeFS EC2 with Terraform

We can use the Terraform templatefile function to render variables into a cloud-init script. The cloud-init script is executed automatically on the EC2 instance and it creates a LakeFS user, writes a LakeFS config.yaml, downloads the LakeFS binaries, and starts LakeFS as a service. Below is a short snippet.

Using cloud-init to configure LakeFS — snippet

One “gotcha” is that we had to inject a Postgres connection string including credentials into the cloud-init template, to get the RDS connection to work. We haven’t tested with IAM role-based authentication to RDS (see: Future Work).

In production we could spin up multiple such instances and have a load balancer in front of them, but a single instance is fine for testing.

Our EC2 server for LakeFS in action

LakeFS database

Our EC2 instance was backed by an AWS PostgreSQL RDS instance which LakeFS uses to manage various metadata about the LakeFS filesystem state.

At first I tried using an on-instance PostgreSQL service but ran into an issue that could probably be fixed. Then I realised, RDS was better for this demo, because we can easily snapshot it after manually setting up the LakeFS admin user and repos, and restore from there next time.

An RDS for LakeFS

LakeFS setup

After our server and RDS were created and configured via Terraform, I set up LakeFS by manually creating a LakeFS admin user using the LakeFS UI — a once-off step. This isn’t an actual IAM user, but provides an IAM-like access key and secret key. LakeFS clients need this information to connect to LakeFS.

LakeFS user credentials

I didn’t configure federated access to LakeFS, but it seems at this time only LDAP federation is supported. It would be great to have AWS IAM role-based integration.

I then manually recorded the LakeFS admin user credentials in AWS Secrets Manager, so we can retrieve them easily from AWS Glue Spark jobs. We created an AWS IAM user with full S3 access, and stored its credentials in secrets manager. This is used in some of the Glue scripts, which need to talk to S3 directly.

AWS Secrets Manager configuration

We’re nearly done. Just log in to the LakeFS UI and create a sample LakeFS repo, main branch, and configure the S3 bucket for storing our data.

I also took some snapshots of the RDS instance. Then, I can safely run terraform destroy, and next time we terraform apply then I can restore from the snapshot. This preserves our LakeFS credentials and configuration, plus our initial repo and branch (if we choose that snapshot).

RDS snapshots to choose from

Glue jobs

We wrote some Glue jobs for evaluating how LakeFS behaves when reading and writing data under different scenarios. Common functionality was placed in utils.py and zipped with each Glue job automatically by Terraform.

  1. sample_data_s3.py

This generates sample random data, creates a Glue table with data written to S3. LakeFS is not used at all. This script simply tests that the basics work in our environment, and is used as a baseline for comparison.

Create sample dataframe using our utils
Define a normal Glue table over S3
Write data using the Glue APIs

Really, not much to see here folks. Moving on.

2. sample_data_s3a_write.py and sample_data_s3a_read.py

The write script generates sample data and writes it to S3 via the LakeFS S3 gateway using the Spark dataframe API. The below code looks pretty bog standard apart from the use of the s3a:// protocol.

Writing data via the S3 gateway using s3a:// protocol

Note we have some pre-configuration to do which is encapsulated in utils.configure_s3a . This is where we retrieve and use our AWS Secrets Manager secret which we created previously.

S3 gateway configuration

The LakeFS S3 gateway is an endpoint running on our LakeFS server, that partially implements the S3 API. The S3 gateway authenticates our Spark session using LakeFS user credentials. Data is written to S3 using the Storage Adapter, and metadata managed via Graveler.

LakeFS server architecture

From our Spark client it is as simple as using the usual dataframe API to read or write data — df.read or df.write as shown above. In short, the Spark dataframe API sends our generated data to the LakeFS server S3 gateway, which writes it to S3. Neat.

Finally the write script creates a Glue catalog table over the data, which looks as you might expect.

Creating a Glue table over our data

The read script runs as a separate Glue job using the S3 gateway running on our LakeFS server. It also uses the Spark dataframe API to read data, and then queries the same data via the Glue catalog, and compares the result. An exception is raised if the row count is not positive or does not match.

It’s pretty straightforward. Note the use of the s3a:// protocol again.

sample_data_s3a_read.py using s3a:// protocol

3. sample_data_lakefs_write.py and sample_data_lakefs_read.py

The write script generates sample data and writes it to S3 in the LakeFS filesystem format. This allows Glue to talk to S3 directly when writing data, and only sends metadata commands via the LakeFS server. This means we could scale to handle much larger volumes of data than we could before with the S3 gateway service running on EC2.

For this to work, we have to additionally provide an assembly JAR file to Glue so that it can read and write the LakeFS filesystem format.

Glue job details with hadoop-lakefs-assembly JAR

As before, the write script uses the Spark dataframe API to write the data and then create an external Glue table over the data. The only changes are a different configuration call — utils.configure_lakefsand the use of the lakefs:// protocol instead of s3a://

LakeFS write script using lakefs:// protocol

Here is what the configuration looks like in utils.configure_lakefs. We use our IAM user which has full S3 access for the s3a settings, and for the lakefs settings we use the LakeFS credentials and API endpoint (rather than the S3 gateway endpoint). We also specify the io.lakefs.LakeFSFileSystem class for writing data to S3.

Configuration for using the lakefs:// protocol

The read script is similar to before. We read the data using the LakeFS filesystem this time, both via the Spark dataframe API and the Glue catalog table and compare the results.

LakeFS read script using lakefs:// protocol

4. sample_data_lakefs_with_client.py

This script is mainly to test the Python client for LakeFS — it is used to create a new branch, write some data to the branch, and commit the data.

I found the Python client documentation to be really simple to use, thanks to the use of OpenAPI generator (which is awesome if you’ve never seen it before). It was easy to browse the API docs and copy-paste the sample code I needed.

Configure the LakeFS Python client
Create a branch. Most of this was copy-pasted.

After our branch is created, we can generate and write data to it as we did before.

Writing data to the branch
Committing data is pretty straightforward.

This was pretty easy, but of course, I found an easier way using less code in the LakeFS documentation. Perhaps someone can try it in future :)

That’s it for our LakeFS test environment. On to the results!

Results

Running the Glue jobs

Terraform automatically creates one Glue job per Spark script in our repo, making it easy to test each script:

Our Glue jobs

First, let’s make sure our non-LakeFS environment works by running sample_data_s3.py . Below is the output — our generated data was written to S3. Our data on S3 is partitioned by hive-style partitioning columns, yyyy_mm_dd and hh_mm. Other services could understand and consume this data directly or via a Glue catalog table.

Here is the Glue catalog entry, and an Athena query showing the data is indeed there.

So far, so good.

Using the S3 Gateway to interact with LakeFS

Let’s run sample_data_s3a_write.py to test what happens when writing data via the LakeFS S3 gateway running on our EC2 instance. Here is the output on S3.

The output path provided to Spark was: s3a://{repo_name}/{branch_name}/data/sample_data_s3a/

In the S3 output we can begin to see the LakeFS filesystem format. Our data was written to a different physical path on S3, namely “lakefs/” as a bunch of extension-less objects rather than Parquet files. It isn’t immediately clear how the data is partitioned or what the format of the data files is.

This is the first “gotcha” with LakeFS — a consuming application would need to interpret this output format, or we’d need to export the data from LakeFS into the usual S3 structure for it to be consumed, requiring data movement.

On the LakeFS UI, we can see our output data written as Uncommitted changes to the main branch. It was written to main_workspace and can be either committed or reverted.

Somewhere in the LakeFS documentation we saw a statement that uncommitted changes are stored in the PostgreSQL database — which is concerning given the potential size or structure of uncommitted data. However we just saw output on S3, so it the documentation is outdated. We’ve logged an issue and will update once it’s fixed up.

An interesting thing happens if we now run sample_data_s3a_read.py — an entirely separate Glue Job. We are able to read data from the main branch even though it is not committed. This is a somewhat unintuitive result. It indicates that when using LakeFS, there is a risk that the main branch could be modified unintentionally by any application, and then read by any other application.

Luckily, LakeFS has a feature for branch protection which I have not tested but sounds exactly like what we need.

This seems to be critical to avoid accidents. Perhaps it could be made more obvious to LakeFS users when creating a repo, for example by providing a checkbox to protect the main branch from direct writes from the start.

Let’s review the Glue catalog table we created while using sample_data_s3a_write.py

The location of the table matches what we provided Spark in the output path — ie not a real physical bucket and prefix — a LakeFS path containing our repo (main-repo), branch (main) and our prefix (data/sample_data_s3a). This might lead to some confusion if our catalog users aren’t familiar with LakeFS, but this is a minor issue.

The read job sample_data_s3a_read.py succeeds, indicating that we can read the data both using the Spark dataframe API and via a Glue Catalog table query, because both read attempts reconcile and there were 100 rows read.

So we conclude that within the Glue and Spark eco-system, LakeFS is usable. It will likely work with other Spark-based clients like EMR or Sagemaker without much hassle.

Let’s step outside that eco-system for a moment and look at Athena.

If we query the data via Athena, we immediately hit an issue. At this point, we have not done any configuration for Athena, so it is understandably unable to interact with LakeFS. Strangely, Athena has our lakefs_s3a_table in the list of tables (even though it is currently unusable).

I didn’t spend more time here, but it seems possible to get Athena to talk to LakeFS using symlinks. I’m unsure of the performance implications of this approach, or how simple / difficult it would be to maintain the symlinks as data is changed after each job run.

Lastly, if we rerun sample_data_s3a_write.py again while there are uncommitted changes in our branch, we find the following error. I guess this could be LakeFS’s way of protecting the uncommitted changes from being overwritten.

While on the topic of Athena, it is already apparent from our tests that LakeFS uses a different physical format than regular Spark does. This means that clients to our environment would be able to understand it — we should ensure that all the client applications we want to use, are listed on the LakeFS integrations list.

Notably absent from the list is Redshift / Redshift Spectrum. This may be a dealbreaker for those who wish to be able to analyse huge data lakes directly — i.e without having to move the data, eg using Spark, or a LakeFS export.

Using the LakeFS filesystem format to interact with LakeFS

Now let’s run sample_data_lakefs_write.py and sample_data_lakefs_read.py . These interact with S3 directly, rather than using the S3 gateway service running on our LakeFS instance/s.

After running the jobs, we have an even more unusual-looking physical path on S3, it is different to before :)

On the LakeFS UI we see our data as expected, in the main branch as uncommitted changes.

And we are able to read it from the separate reader job, via the Spark dataframe API and the Glue catalog table.

On that note, our Glue catalog table looks like this. Note the lakefs:// protocol, making it more obvious to consumers that this isn’t an S3 path at all.

In Athena, our table doesn’t even show up as an option to query. It is unclear if the symlink approach would fix this. I saw no mention of the hadoop-lakefs-assembly JAR which Spark needed to use the lakefs:// protocol. This may mean the only option when using Athena is to use the S3 gateway on our LakeFS server/s, which unfortunately implies reading all the data via our server/s rather than on S3.

Using the LakeFS API client to manage branches and commits

Our final test was to simulate more Git-like functionality: Create a branch off of our main branch and commit to it. We could expand to do other API actions, like merging, in future.

After running sample_data_lakefs_with_client, we can see in the LakeFS UI that we were able to create a branch (branch_2022–03–25…), write data using one of the previous methods, and finally commit it, without issue.

At that point, there was plenty more to look at but I was happy with what I’d achieved with LakeFS.

Other thoughts

GDPR and PII data

For PII or other sensitive data, we might require physical or logical deletion of specific records. LakeFS fundamentally preserves data from previous commits, so there is a challenge to do this on a per-request basis. This challenge is compounded if PII can be constructed from compound fields or across tables, rather than from specific columns. To solve some of these problems, organisations might try to isolate all PII data to specific tables, which might need to be handled separately to non-PII data.

This is not a challenge unique to LakeFS, for example what do organisations do when it comes to deleting personal data from backups, or any number of places where such data is used? At some point, regulations and practical implementations need to meet :)

There is fortunately an awesome document regarding GDPR best practices with LakeFS . In summary, the options are:

  • Decide that PII data should not be in your main branch. Use LakeFS git-like features to create a separate ingest branch, where PII data can be removed eg through obfuscation, and validated, before being merged to the main branch. The branch could be garbage collected automatically. One concern is, is it possible to restrict access to the ingest branch?
  • Decide to support the right to be forgotten, and lose reproducibility. Honour the request to delete data, by simply deleting all versions of the dataset. We can’t get a full picture of all the data as it was at the previous point in time.
  • Decide to support the right to be forgotten, but keep most reproducibility. If there is a low volume of PII deletion requests, we could re-process all past versions of the dataset and remove PII records from them, effectively re-writing all versions of the dataset. This would be process- and cost-intensive as a tradeoff.

Further reading

This excellent LakeFS page explains in a nutshell some data-lake related problems and how LakeFS can help to solve them.

We’d recommend reading this Medium post which, while now outdated, offers some more insight into the workings of LakeFS. Some key changes since then:

  • Introduction of the lakefs:// protocol to access the S3 data directly rather than requiring the S3 gateway.
  • Introduction of garbage collection for automatically cleaning old data, branches etc.

The following excellent posts show some examples and tradeoffs of using LakeFS to solve real-world problems:

Conclusion

What we did

We managed to get LakeFS to work with Spark jobs on Glue. Overall it was easy to set up and worked really well. Throughout this post we also highlighted some of the gotchas.

Below are a few opinions …. Let us know what you think!

Benefits

  • LakeFS works well with Spark. If your consuming applications are Spark-based, LakeFS could be a good fit. Especially in AWS. For example, loading an AWS RDS instance using Glue. Or processing the data using EMR or Sagemaker. You could also use the LakeFS S3 gateway to access files using the S3-compatible API which is supported by a variety of clients including the AWS CLI.
  • The git-like features (creating branches, committing data, etc) are really easy to use via the API and associated client, at least in the case of Python. This would make automating a breeze. The LakeFS UI is also clear and easy to use.
  • There is a strong use case for change isolation — being able to make independent changes to data, verify them using some automated tests, and then atomically merge or roll back the data without impacting anyone reading from your main set of data.
  • LakeFS could be used to provide git-like experience for managing S3 data that users can edit directly — a sandbox in production. This could be coupled with automated data testing using LakeFS actions, prior to merging the data. And of course there is atomic rollback.
  • When it comes to managing unstructured or semi-structured data, LakeFS can offer ACID capabilities that are normally possible only for structured data in a lake-house or database. This makes it attractive for these use cases.
  • The ability to instantly roll-back to any point in time is handy from a disaster recovery / recovery time objective, point of view. It’s a lot more attractive than trying to manage rollbacks using S3 object versioning (ouch) or deep-copying the data to/from a backup location.
  • If you would like to update multiple S3 datasets as part of one transaction — eg 2 or more “tables” and rollback or commit them together, LakeFS is a good fit. Current lakehouse technology (eg DeltaLake, Hudi, Iceberg) provides ACID capability on a single dataset / table at a time, but I haven’t found tools that can do this over multiple tables apart from LakeFS, or of course a dedicated database of some sort.
  • When investigating production issues, we often want issues to be reproduce-able. Code is relatively easy to roll back to a previous version, but with data things are more challenging. You should always have a data backup, but this might take a while to restore. LakeFS stores a point-in-time view of the data so it is possible to immediately view what the data looked like at a point in time. This is not without cost — you pay to keep the data copies around.

It’s worth highlighting that LakeFS isn’t an all-or-nothing option, either. You can quite comfortably have both LakeFS-enabled areas and non LakeFS-enabled in your data lake, running side by side.

Limitations

  • While LakeFS seems to work well in the Spark world, and supports other integrations also, it is not widely supported yet. For example, say you want to use AWS DMS, or Redshift Spectrum to read structured data on S3. It seems that these tools can’t be configured to read from custom filestores, so you might need to export data from LakeFS first. Perhaps if the data is catalogued and the S3 gateway is used, then this may work — but we haven’t tested it. So while LakeFS might work for current use cases that don’t need these tools, you might need a different approach in future.
  • Cloud migrations are challenging even with provider-native tools. Getting things to work well together is sometimes not as straightforward as it seems from the documentation. Adding LakeFS to the mix, off the bat, might make a migration even more challenging. Especially if engineering skills are in short supply.
  • If you’re using lakehouse tech and only need to update 1 data lake table at a time (a common pattern) then there is reduced benefit to adding LakeFS to the mix compared to the extra complexity, as lakehouse tech would give you ACID on a single table. However, there are other benefits with LakeFS that make it attractive even so.
  • The LakeFS S3 gateway is handy, but it means that your data is transferred via the LakeFS server/s. If this is your primary use case, you could auto-scale a set of LakeFS instances, but you might need to consider exporting the data if you have many consumers and are concerned about scale. Exporting data via the S3 gateway would have the same limitation so you prefer to run separate export Spark jobs that use the lakefs:// protocol, at additional cost.
  • LakeFS authentication options seem limited at the moment. For example, in AWS federated login and assuming an AWS role is a common pattern. The AWS role might normally restrict access to S3. It would be great if LakeFS could support AWS role-based authentication and any S3 authorisation, and not rely on a custom “admin user” credential set.

Future work

  • We’re really excited to see where LakeFS goes. When more integrations are supported and existing ones matured, and as federated access becomes possible, I’m sure we’ll see a wide adoption of LakeFS.
  • We have no idea how data merge conflicts look, or how they might be resolved in practice. Something to test next time :)
  • IAM role-based authentication to RDS, and to S3 for that matter. It’s not evident in the LakeFS documentation whether this can be done, and whether the temporary credentials are automatically refreshed. Currently it seems a case of providing an access key and secret upfront, rather than supplying an IAM role.
  • We didn’t do testing with KMS encrypted S3 objects but this could be added to our demo repo, should anyone wish to try. It probably works without issue.

Thanks

One key ingredient in this post, was support from the LakeFS community. I had a couple of questions and posted them on Slack, and consistently got a response within just a few minutes. Kudos!

--

--