Should You Use LakeFS?

Dimitri Hector
cisco-fpie
Published in
3 min readNov 10, 2021

Introduction

Navigating the data world can be a challenge, and with this challenge comes the reality of data loss, poor data quality, and overall bad times. Data challenges can include how one locates all their data, how it is organized for ease of access by various authorized individuals, or how one acquires any meaningful insight from their data. All these problems have different solutions and in a perfect world, one encompassing solution. But for this article, I will focus on Treeverse’s fun tool LakeFS and what it brings to the table in the data world.

What Is LakeFS?

LakeFS is a data management tool that allows for git-like repositories on object storage. It scales to exabytes of data for version control, can revert any changes made to your data, allows for pre-commit/merge hooks, and more. This tool is also available for integration with many modern data frameworks such as Spark, Hive, AWS Athena, and Presto (1).

A screenshot of multiple repositories created in LakeFS

Why Use LakeFS?

LakeFS is another tool for reorganizing your object storage with the additional bonuses of calling back upon previous versions of your object storage if something goes horribly wrong. This kind of tool prevents data loss among updates and new additions. Imagine that you are a research biologist specializing in genetics and someone has added the wrong or an old patient list to your data set. It is now your responsibility to go in and figure out who are the correct patients for the tasks that you have at hand. That can be an extremely daunting and time-consuming process. A process that LakeFS will expedite.

A screenshot of the commit history for a repository.

Where Does LakeFS Fall Off?

LakeFS has some downsides when it comes to authentication and auditing, but these are a work in progress. Currently authentication with LakeFS is only possible through their credentials that are given to you. There is no way to either create your own credentials or use another systems credential in order to access it. This adds a new layer to work around in the user experience. Outside of authentication practices there is also a lack of logging capabilities that will allow for an administrator to see who has accessed what repository and when. The commit structure of the tool shows only commits and when they were created.

Would I Recommend This Tool?

Absolutely! LakeFS has a lot of power behind it and the community is very responsive. It was very easy to get in touch with the development team whenever my team members or I had a question, and the overall user experience of the tool was very pleasant. This would be an excellent tool for anyone looking for a storage solution for a new project they are going to start, or anyone searching for a better way to organize their object storage/data sets.

Resources and Notes

Please note that the opinions of this piece are based on trials of the 0.52.2 version of LakeFS.

Give LakeFS a try! Just click on the second link below to find out how to run LakeFS locally or to give it a try with their docker compose image.

  1. LakeFS Documentation
  2. LakeFS Installation

--

--

Dimitri Hector
cisco-fpie

Just doing some engineering things and learning a lot along the way!