CodeX
Published in

CodeX

Git-Like Data Set Versioning using LakeFS and Ceph S3

Data has become the core business of every modern organization today, whether it’s for decision making, reporting, predicting, analyzing and many other use cases that don’t fall short. Eventually, Data, In its final form will be presented to stakeholders in order for them to understand what happens in the organization, whether it’s internally or externally.

If we take AI as an example, We see that a lot of the assumptions being made by AI are the direct results of the data that the algorithm uses. AI is using the datasets that are provided by Data Engineers in order to train models at scale, And those models will provide business recommendations and predictions that will be used by the stakeholders.

You can surely understand that due to the presented above, The integrity of data is an important thing, It should be reliable enough for the AI model to base its theories on.

We see that AI running on Kubernetes is not a joke, and a lot of workloads are being migrated towards Kubernetes as it treats AI models as they were microservices that are going through a CI/CD pipeline (using tools like MLOps, Kubeflow, etc). The source code is being saved into git repositories, gets packaged into a container image, and eventually being deployed into Kubernetes as part of the data pipeline (using GitOps, CD, or any other deployment strategy).

This empowers the control we have significantly, As we can promote or roll back specific versions of our AI models we have when problems occur (inaccuracy for example).

More than that, We can now even control our dataset versions and treat it as if it was a git repository. By doing that, we control the versions that we hold as our data sets. If for example a new data source was added and should be written to the data lake, we can branch the current dataset, Verify that the new data source isn’t harmful (by running unit/integration tests), and then merge those changes back to the origin and provide the tested data to our scientists or BI engineers.

Today, We’ll see how we could bring this thing to life using LakeFS in order to version our data lake as it was a git repository. We'll use Ceph S3 as a data lake that will be directly attached to LakeFS as an S3 block store.

Prerequisites

  • A Ceph cluster that exposes a RadosgW S3 interface
  • Docker-compose installed on your computer

Installation

Let’s create a radosgw user in order to interact with Ceph’s S3 interface:

radosgw-admin user create --uid=lakefs --display-name="Lakefs User" --access-key=lakefs --secret-key=lakefs

Next, we’ll save Ceph’s S3 configuration in Lakefs’s config file, to make sure Lakefs uses Ceph as its S3 block store:

LAKEFS_CONFIG_FILE=./.lakefs-env
echo "AWS_ACCESS_KEY_ID=lakefs" > $LAKEFS_CONFIG_FILE
echo "AWS_SECRET_ACCESS_KEY=lakefs" >> $LAKEFS_CONFIG_FILE
echo "LAKEFS_BLOCKSTORE_S3_ENDPOINT=http://192.168.1.53:8080" >> $LAKEFS_CONFIG_FILE
echo "LAKEFS_BLOCKSTORE_TYPE=s3" >> $LAKEFS_CONFIG_FILE
echo "LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true" >> $LAKEFS_CONFIG_FILE

We’ll deploy Lakefs using docker-compose, Which will deploy both Lakefs and Postgres to save data and metadata persistency:

curl https://compose.lakefs.io | docker-compose --env-file $LAKEFS_CONFIG_FILE -f - up

Now that we have the LAkefs stack up and running, Let’s create our S3 bucket, Which will be used by Lakefs to store versioned objects:

aws s3 mb s3://lakefs/ --endpoint-url http://192.168.1.53:8080
make_bucket: lakefs

We now have all the infrastructure requirements ready, And we can create the actual repository:

lakectl repo create lakefs://repo s3://lakefs -d mainRepository: lakefs://repo
Repository 'repo' created:
storage namespace: s3://lakefs
default branch: main
timestamp: 1625919817

As you see, this repository actually points to the created S3 bucket that was given as part of the S3 context in the configuration/ Let’s make sure we can query that repository:

lakectl repo list+------------+-------------------------------+------------------+-------------------+
| REPOSITORY | CREATION DATE | DEFAULT REF NAME | STORAGE NAMESPACE |
+------------+-------------------------------+------------------+-------------------+
| repo | 2021-07-10 15:23:37 +0300 IDT | main | s3://lakefs |
+------------+-------------------------------+------------------+-------------------+

Now, let’s create a simple text file with a string within it, So that we could upload it to LakeFS:

echo "My name is Shon" >> txtfile.txt

Now we can upload the text file from the source given to our target repository, in the main branch:

lakectl fs upload -s txtfile.txt lakefs://repo/main/txtfile.txtPath: txtfile.txt
Modified Time: 2021-07-10 15:24:35 +0300 IDT
Size: 20 bytes
Human Size: 20 B
Physical Address: s3://lakefs/84935fb66e4947338cb934d6318302db
Checksum: ad44412ae72bc202dcb12891111a1864

As you see, LakeFS returns the physical address of the S3 object itself, which means we could access it as any other S3 object with the given credentials. Now let’s read the object’s content to verify that it suits the content that we put in it:

lakectl fs cat lakefs://repo/main/txtfile.txtMy name is Shon 

Like any other git repository, Once we have the files we want, we can commit our changes to the current branch we’re in, So we’ll commit this file to the ‘main’ branch:

lakectl commit -m "added one version of the txt file" lakefs://repo/mainBranch: lakefs://repo/main
Commit for branch "main" completed.
ID: 3a08e3da303b4b722d426cff458264b391c0209dacf9280b6bf5d73d0c92ff07
Message: added one version of the txt file
Timestamp: 2021-07-10 15:25:27 +0300 IDT
Parents: 7263176783ca345a7782d36ec14205b78a0f92b4956a4e93a0e704af32353052

Now that we have a stable version of our file located in Lakefs, Let’s play with it a bit to create a second version: We’ll add a line to the text file to create a second version of it in the next steps:

echo "And my last name is Paz" >> txtfile.txt

Now we’ll create a new branch, used for the second version of the text file:

lakectl branch create lakefs://repo/txtfile_change -s lakefs://repo/mainSource ref: lakefs://repo/main
created branch 'txtfile_change' 3a08e3da303b4b722d426cff458264b391c0209dacf9280b6bf5d73d0c92ff07

Let’s make sure that the second branch was created:

lakectl branch list lakefs://repo+----------------+------------------------------------------------------------------+
| BRANCH | COMMIT ID |
+----------------+------------------------------------------------------------------+
| main | 3a08e3da303b4b722d426cff458264b391c0209dacf9280b6bf5d73d0c92ff07 |
| txtfile_change | 3a08e3da303b4b722d426cff458264b391c0209dacf9280b6bf5d73d0c92ff07 |
+----------------+------------------------------------------------------------------+

Great! we now have two branches under our Lakefs repository. Let’s upload the updated file into our new branch:

lakectl fs upload -s txtfile.txt lakefs://repo/txtfile_change/txtfile.txtPath: txtfile.txt
Modified Time: 2021-07-10 15:26:57 +0300 IDT
Size: 44 bytes
Human Size: 44 B
Physical Address: s3://lakefs/cb9f5d4a125b421ea819c32a71b77e0a
Checksum: b25434f55350323f5f543ee0b92939bf

Next, We’ll read the updated file’s content and verify that we have the updated line:

lakectl fs cat lakefs://repo/txtfile_change/txtfile.txtMy name is Shon 
And my last name is Paz

Awesome! Now we’ll commit the second change into our newly created branch:

lakectl commit -m "added second version of the txt file" lakefs://repo/txtfile_changeBranch: lakefs://repo/txtfile_change
Commit for branch "txtfile_change" completed.
ID: c25f69534a85da7b1afcc38ef293d57724c1e14a79e24e23aa7c8cdbab74336f
Message: added second version of the txt file
Timestamp: 2021-07-10 15:28:11 +0300 IDT
Parents: 3a08e3da303b4b722d426cff458264b391c0209dacf9280b6bf5d73d0c92ff07

Like in every git repository we have, Once we’ve validated our data set we can merge the changes! Let’s merge the updated file from the newly created branch (txtfile_change) into our main branch:

lakectl merge lakefs://repo/txtfile_change lakefs://repo/mainSource: lakefs://repo/txtfile_change
Destination: lakefs://repo/main
Merged "txtfile_change" into "main" to get "a8b251a43e2e306e5dc170c05d0e83e289a0b162ff272d3a660c644e709b96f0".
Added: 0
Changed: 1
Removed: 0

We see that we get informed that the change has been merged and that one file was changed. Let’s re-read the file from the main branch to verify that those changes were indeed merged:

lakectl fs cat lakefs://repo/main/txtfile.txtMy name is Shon 
And my last name is Paz

Conclusion

We saw how we could use LakeFS’s ability to treat Ceph’s S3 block store as it was a git repository. We’ve talked about the benefits this kind of ability could provide to our overall core business. Hope you have enjoyed this demo, See ya next time :)

--

--

--

Everything connected with Tech & Code. Follow to join our 900K+ monthly readers

Recommended from Medium

How WeHelp People Become an Excellent Front-End or Back-End Engineer

Using Capabilities and Value Streams to write your OKRs

The Command Line Part 4

Core Web Vitals and Pagespeed Dashboard by Webpage

Some of the best Minecraft game servers (August 2021)

User Stories — Reach the next maturity level.

Why you shouldn’t use Python

Refactoring — Removing Connascience of Execution Order

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shon Paz

Shon Paz

Sr. Solution Architect, Red Hat

More from Medium

Kubeflow deployment on Azure (AKS), the Proper Way

Install Minio As Service On Linux

DataStax’s Snowflake Sink Connector for Apache Pulsar

Running Scalable Lightweight Session aware Python Flask application on Kubernetes

Deploying Flask App on Kubernetes