What is lakeFS: A Critical Survey

A critical introduction to lakeFS, a new metadata layering solution that brings Git-like operations and versioning to object storage.

Jonathan Merlevede
datamindedbe
12 min readMay 31, 2021

--

Update September 2021: I wrote this story less than half a year ago, and I’m happy to say that some of the limitations highlighted in the second part of this story no longer apply, and most others are addressed by new features on the project’s roadmap. I wrote this story as a “snapshot view” on lakeFS, and although I will not update it, I encourage you to point out out-of-date information in the comments. lakeFS is developing at a rapid pace. If you’re interested be sure to check out its documentation to get an up-to-date view on its present-day capabilities!

At Data Minded, our drive to best serve our clients requires us to stay on top of technological developments in the data space. In this context, we recently surveyed LakeFS, a new “object storage solution” developed by Treeverse that has been making the rounds (they also have an excellent blog). But what does lakeFS do exactly? Does it address real issues our clients face? Is it ready for production use?

This story starts with a primer on lakeFS, introducing its basic setup, ideas and operation. We then provide a list of things we liked about lakeFS, and a list of things we didn’t like and think impede adoption.

LakeFS promises some big and very welcome improvements in how we handle object storage, so let’s dive right in!

A lakeFS primer

LakeFS is a collection of services that implement an S3-compatible storage layer over one or more object storage systems. It’s not just another overlayFS or or Unionfs though, because lakeFS brings Git-like data manipulation operations to the table. If this sounds novel or confusing, that’s probably because it is.

I find it easiest to explain what lakeFS does by working through an example, so I’ll do exactly that by showing how you could store a data product “weather data” on an existing location on AWS S3, on path s3://mybucket/clean/weatherdata.

Setup

Start by setting up lakeFS; ensure it has full S3/AWS IAM access to mybucket. If you just want to play around with lakeFS, you can probably also deploy a local installation, but we did not try this. In the remainder of the article, I’ll assume lakeFS is deployed at lakefs.mydomain.

We then create a lakeFS API key, install lakectl and add appropriate configuration to ~/.lakectl.yaml :

$ cat ~/.lakectl.yaml
credentials:
access_key_id: <mylakefskeyid>
secret_access_key: <mylakefskeysecret>
server:
endpoint_url: https://api.lakefs.mydomain/api/v1

I’ll interact with lakeFS data using rclone, but you can use any S3-compatible tool, including the aws s3 CLI. You can configure a lakeFS remote calledlakefs in rclone by running rclone config and going through the motions, or by appending the following to ~/.config/rclone/rclone.conf:

[lakefs]
type = s3
provider = Other
env_auth = false
no_check_bucket = true
access_key_id = <mylakefskeyid>
secret_access_key = <mylakefskeysecret>
endpoint = s3.lakefs.mydomain

I also configured the backend S3 system as the backend rclone remote.

Note: I added the somewhat obscure no_check_bucket = true to the rclone configuration file. This is required when using recent (or very old) versions of rclone (1.52+). If this flag is not set, rclone will ensure that the target bucket exists by attempting to create it. However, the create bucket command is not implemented by lakeFS, which will cause certain commands to fail with the error

ERROR : Attempt 3/3 failed with 1 errors and: ERRLakeFSNotSupported: This operation is not supported in LakeFS

Setting up a repository

Now that we got that out of the way, initialize a repository for our data product:

$ lakectl repo create \
lakefs://weatherdata s3://mybucket/clean/weatherdata

You can also do this through the lakeFS UI at https://api.lakefs.mydomain/. Creating a repo by default also creates a main/master branch called main.

Branching

Like in Git, we generally try to avoid working directly on the main branch. So, we’ll create a branch ingest to upload our data on:

$ lakectl branch create lakefs://weatherdata/ingest \
--source lakefs://weatherdata/main

Branches appear as folders / prefixes on the S3 endpoint:

$ rclone lsd lakefs:weatherdata
-1 2021-05-11 14:12:12 -1 main
-1 2021-05-11 14:28:17 -1 ingest

Uploading data

We can then upload some data to the ingest branch:

$ echo test1 > test1.txt && echo test2 > test2.txt
$ rclone copy test1.txt lakefs:weatherdata/ingest/
$ rclone copy test2.txt lakefs:weatherdata/ingest/
$ rclone ls lakefs:weatherdata/ingest/
6 test1.txt
6 test2.txt

The lakeFS endpoint behaves like a normal S3 endpoint, so nothing unexpected to see here. It’s sort of interesting to see what is going on on the underlying object storage though…

$ rclone tree backend:mybucket/clean/weatherdata
/
├── 10419beef5a940cdb0362095e8a1724f
├── 167fb34cd5ac493ea49b84c20ed48f9a
└── dummy

lakeFS is storing the files on the underlying storage as-is, but file names are scrambled. Metadata of uncommitted changes, including the mapping of filenames to their location on the underlying storage system, is stored (only) in lakeFS’s Postgres database.

Committing data

Once all data is uploaded or when we’ve happy with the weatherdata product, we commit it to the ingest branch:

$ lakectl commit lakefs://weatherdata/ingest -m "Omnomnom"
Branch: lakefs://weatherdata/ingest
Commit for branch "ingest" completed.
ID: a9667b669636d200071db933518fe3a81047702d950e4441bd0845397ad0677e
Message: Omnomnom
Timestamp: 2021-05-16 03:53:58 +0200 CEST
Parents: fb4a31835ab3b0a0feb0e9e198ec350e6876d4c3ecfaf45159a80b2b4137184c

Again, it’s interesting to see what’s happening on the underlying S3 object filesystem too.

$ rclone tree backend:mybucket/clean/weatherdata
/
├── 10419beef5a940cdb0362095e8a1724f
├── 167fb34cd5ac493ea49b84c20ed48f9a
├── _lakefs
│ ├── 00fb7eff1a97c500eb3ca8d2ef5464131d6b6a5f9512...
│ ├── actions
│ │ └── log
│ │ ...
└── dummy

By committing, the metadata that was in the Postgres database has been committed to the backend S3 system inside of the _lakefs prefix. Each of the files with the _lakefs prefix is a Graveler file, which is an “immutable” SSTable with metadata compatible with RocksDB. The name of the file itself is a function of its content (we say the file is “content-addressable”). This, together with clever tiering and clustering of metadata, is exploited to obtain effective delta encoding of commits, keeping commits +- constant in size as the number of objects in a lakeFS repo grows.

Merging

When listing the data at the main branch, you’ll still see an empty directory:

$ rclone ls lakefs:weatherata/main/

To get the data from the ingest branch into the main branch, merge it into the main branch:

$ lakectl lakefs://weatherdata/ingest lakefs://weatherdata/main
lakectl merge lakefs://weatherdata/ingest lakefs://weatherdata/main
Source: lakefs://weatherdata/ingest
Destination: lakefs://weatherdata/main
Merged "ingest" into "main" to get "67cb84659116ba0780cfbcab....".
Added: 2
Changed: 0
Removed: 0

That’s it! Now, the data is visible from the main branch

$ rclone ls lakefs:weatherdata/main/
6 test1.txt
6 test2.txt

There’s two remarkable things about what happens here:

  • The merge operation is atomic/transactional: there’s not a single point in time where someone listing the main branch could see test1.txt but not test2.txt!
  • Merging happens very quickly and does not actually involve a copy of data. Even though you see the files on lakeFS twice, once with prefix weatherdata/main and once under weatherdata/ingest, objects are stored on the backing filesystem only once.

If you want to, you can now remove the ingest branch:

$ lakectl branch delete lakefs:weatherdata/ingest

What we like

Now that we know what lakeFS is all about, let’s take a step back and look at what we like about it, and what we don’t like.

Let’s start with the obvious functional features: atomicity of data operations and the ability to check out older commits. Why are these nice features?

  • Atomicity means we no longer have to implement a system using success files and timestamping to prevent downstream processes from reading partial or invalid data.
  • Committing data before triggering an ETL job and starting from the same commit when re-running can help ensure true idempotency.
  • The ability to run code on data as it was when an exception occurred helps with reproducing bugs.
  • Taking a more philosophical view on things: application runs are generally impacted by their source code/binaries, application configuration, host environment and… data. In data, we generally like runs to be reproduceable, and there’s already many tools that help us towards this goal (version control, artefact stores, configuration as code). Being able to “fix” the data input allows us to finally achieve truly reproduceable application runs!

LakeFS is very flexible, and allows for different branching strategies tailored to your specific needs. You can use different long-lived branches to keep track of your development and production data, release data sets through tags or release branches, create a single repo for your entire data lake or one repository per data product, …. The only limit is your imagination :).

Being able to “fix” the data input allows us to finally achieve truly reproduceable application runs!

There’s also things to like about how lakeFS is implemented. We like the fact that it is API-compatible with AWS S3, the level at which it operates (the object level) and the fact that despite the presence of a database and various operational “endpoint” services, all committed data is stored and can be reconstructed only from the underlying object storage.

  • lakeFS’s API-compatibility with S3 allows straightforward use of all SDKs and tools supporting S3 and custom S3 endpoints, including tools like rclone and the AWS CLI.
  • Because lakeFS works at the object level, it works for unstructured data as well as structured data. This is good — if all we cared about was structured data, we would likely be looking at a modern data warehouse solution instead of a lake approach. Although next-gen metadata storage layers like Iceberg, Hudi and Delta lake also allow for time travel / cheking out old “commits”, they only do this for structured (tabular) data.
  • Metadata of uncommitted objects is stored in Postgres, but committing flushes metadata to object storage, which means that durability of committed data is not impacted by the use of lakeFS. The Postgres server used to store all metadata, but as we understand it this was changed as part of the lakeFS on the rocks design / roadmap. As part of this roadmap, the Postgres server will eventually be completely phased out!

Lastly, the lakeFS tooling seems high-quality, we didn’t encounter any bugs, we really like the web UI, and the stress testing command. The fact that lakeFS is free and open source is awesome. We also like that is possible to partially adopt lakeFS, by creating repositories for some data products but working on the raw “foundation” object storage for others.

What we don’t like

We’re not here to sell you on lakeFS, and unfortunately there’s also some things that we did not like.

No deletes

One first, pretty huge, limitation that currently exists in lakeFS is that it seems to be impossible to delete data. This can lead to an explosion in costs, and also seems incompatible with e.g. GDPR-compliance. Some of this will likely be “fixed” in future releases (garbage collection / vacuuming), but as we se no mention of this on the lakeFS upcoming features page, we don’t expect this to happen soon.

  • Uploading a file creates a file on the foundation object storage system. Even when deleing the file before ever committing it, the file remains present on foundation system, although no metadata points to it any more.
  • It is impossible to remove commits. If we remove a branch before ever merging it into a live branch, the commit remains part of the metadata (as a “dangling commit”) and can be checked out, although it is no longer visible from the web UI.
  • It is impossible to remove something from the history, or similarly “squash” a bunch of commits.

It appears to be impossible to really delete data.

Unlike the metadata, data files are not content-addressable; there’s no data deduplication. Uploading the same file multiple times will create identical objects on the foundation system, with different scrambled names.

Unlike the metadata, data files are not content-addressable; there’s no data deduplication.

Dataflows and IAM

We see some problems with how data flows when using lakeFS, and how IAM is dealt with. When accessing data through the lakeFS S3 gateway, all data flows through the gateway; it’s the gateway that enforces lakeFS’s own authentication and authorization system. This has many implications:

  • lakeFS requires broad permissions to access the underlying object storage.
  • Authorization needs to be configured on lakeFS. Although lakeFS has comprehensive support for authorization policies (similar to those offered by AWS), this means yet another permissions system to manage.
  • Likely additional network transfer costs.
  • Availability of your storage system is reduced to the intersection of that of the S3 backend service, lakeFS S3 endpoint and the lakeFS database backend. All components can be deployed in a highly available way, but you will definetely introduce downtime e.g. for upgrading lakeFS components.

lakeFS puts itself at the center of your data access control, in much the same way as a system like Immuta. However, unlike Immuta, lakeFS is not really well-equipped to be in this position. Although the lakeFS policies are powerful, it lacks features that we consider necessary for deployment in many environments.

  • LakeFS has no support for federated identities; in many organizations, this just won’t fly. We’d really like a Terraform provider too!
  • Auditing may be problematic. lakeFS itself does not seem to have sophisticated auditing / access log capabilities, while the underlying object storage will only see connections from the lakeFS S3 gateway user.

If you want to support systems that do not support setting custom endpoints (looking at you, Athena!; also see the topic below), then data access still needs to also be set up properly on the underlying system.

lakeFS is not really well-equipped to be in its position at the center of data access control

Metastore

When using multiple engines like Spark, Presto or Athena, pipelines often heavily rely on the Hive Metadata store or compatible services like Glue. By itself, checking out a new branch does not automatically populate the metastore with references to the data on this new branch. Since not all services have support for custom endpoints (*cough* Athena *cough*), it’s also not univocally clear whether the paths in the metastore should refer to paths on the foundation system or the lakeFS layer.

lakeFS presents solutions, but this again comes with some cognitive overhead and that are, at least in case of Athena, less than ideal.

  • The metadata store does not understand the concept of branches, so you’ll likely have to / want to create a separate database for every branch that you want to be able to query.
  • For services that do not support custom endpoints, lakeFS offers supports through the use of Hive Metastore symlinks. Which means that, no, you can’t “simply” use writeAsTable and expect things to work. However, in this configuration users need direct access to and credentials for the underlying storage system. lakeFS policies are bypassed, granular permissions are not supported and you’ll need to manage IAM policies on the foundation system (at the repo level). (Incidentally, you should be able to use the same system to read directly from the underlying storage system using Spark by using readFromTable.)
  • As part of the “lakeFS on the Rocks” project, a Hadoop/Spark JVM driver is under development that will only use a lakeFS metadata server endpoint to retrieve metadata, but that will read data directly from the foundation object storage system. We expect similar IAM-related problems here.

Complexity

Moving away from the gateway and IAM side of things, we also think lakeFS introduces a fairly large amount of extra complexity. We’re not talking about operational complexity (although you should also consider it), but about complexity from the user perspective.

The additional steps required to make your data available (the lakectl commands) add non-negligible complexity that most people will not be familiar with, and that may not actually be easier to understand than systems ensuring atomicity through a success file or that store metadata explicitly in DynamoDB. The additional features offered by lakeFS also require some additional technical decisions (e.g. the age-old “monorepo” versus “polyrepo” one — one repo / data product, the entire data lake as a single repo, or something in between?).

Hooks

To be fair, this is not really something we don’t like, as it’s not bothering us, but it’s maybe something we don’t understand.

lakeFS has the ability to define pre-commit and pre-merge hooks, but despite the fact that they’re presented as something revolutionary, it’s not clear to us what the added value of these features is for our data pipelines. If there’s the need for schema validation or format validation, you can run these easily as part of our pipelines instead of running them asynchronously through lakeFS and webhooks.

Conclusion

lakeFS is a cool project that is likely to excite many. lakeFS brings some nice-to-have features to the table, and allows us to work with data in new ways. However, we feel that it also has some weaknesses that, unfortunately, usually outweigh the advantages, at least in the setting we usually operate in. As discussed in the previous section, lakeFS introduces significant complexity for users, complicates access control and makes it difficult or impossible to address security and legal requirements in some environments.

That’s not to say this will remain the case forever. lakeFS is under very active development, and the project is actively working on features that would address some of the issues presented above. lakeFS on is on to something and seems to be moving in the right way with the “lakeFS on the Rocks” project.

While we’ll be keeping our eye on this one, we likely won’t be introducing lakeFS at our clients for now.

--

--