The Missing Piece in Your Software Cycle — An Isolated Testing Data Environment for Spark

Adi Polak
lakeFS
Published in
8 min readDec 2, 2022

--

Photo by Ashkan Forouzani on Unsplash

A typical routine of a data engineer includes writing code, choosing and upgrading the compute infrastructure, and testing both the new and changed data pipelines. Most of the time, engineers looking to test the changes they wish to apply need to run their tested pipelines in parallel to production.

Every data engineer knows that this complex process involves many moving parts and manual tasks, such as copying data, updating configuration, and creating alternative paths.

You can only imagine how time-consuming and error-prone such a workflow is. If something goes south, it might damage production or pollute the data lake.

You don’t have to worry; I’ve got you covered!

Luckily, as a data practitioner, you can take advantage of solutions that simplify this workflow. lakeFS lets you create a safe and automated development environment for data without having to copy or mock data. You can easily work on production pipelines without any significant engagement of DevOps resources.

What will you learn?

This tutorial demonstrates how to build a development and testing environment for validating your logic on a full-blown production data volume and variety, working with lakeFS and Spark. You will walk through the journey of creating a repository and building a Spark application while using lakeFS capabilities. You will learn how to data changes, revert them in cases of mistakes or other hiccups, and lately merge separate branches to reflect data changes from the isolated environments.

Ready to jump in? Just lakeFS it!

Table of contents:

  • Multiple environments: What is their purpose?
  • Why not build an isolated testing environment for data?
  • Building a testing environment for data with Spark and lakeFS
  • Step 1: Creating a repository
  • Step 2: Upload and commit data
  • Step 3: Develop a Spark application on the isolated branch
  • Step 4: Rolling back or committing changes
  • Step 5: Merge and iterate in the development workflow

Multiple environments: What is their purpose?

If you run ETL jobs directly in production, expect data issues to start flowing into dashboards, ML models, and other data consumers sooner or later.

When working with a data lake, it’s a good idea to have replicas of your production environment. They let you test and understand changes to your data without impacting the consumers of production data.

The most common approach that allows engineers to avoid making changes directly in production is creating and maintaining a second data environment (dev/test environment) where you first implement updates.

One problem with this approach is that maintaining a separate environment is both time- and resource-consuming. For larger teams, doing that might force multiple people to share a single environment, which requires significant coordination among team members. This takes a lot of time as well.

But there’s a way out.

Why not build an isolated data testing environment?

You can create an isolated testing environment using lakeFS. That way, you don’t have to spend another minute on maintaining multiple environments. On top of that, you can create as many environments as you need instantly — without copying one KB of data.

How does all that work? Let’s take a quick look at the lakeFS repository structure:

In a lakeFS repository, you will always find data located on a branch.

  • Each branch works like a separate environment.
  • Branches are fully isolated, meaning that data changes on one branch do not affect other branches.
  • Objects that don’t change between two branches are never copied but shared by these branches via metadata pointers (shallow copy), which lakeFS manages.
  • If you apply a change on one branch and want it to be reflected on another, you can carry out a merge operation to update one branch with changes from another.

When using lakeFS, you can create an isolated data environment a moment before testing a change. Once new data is merged into production, you need to delete the branch to remove the environment you don’t need anymore.

This approach differs from building a long-living test environment engineers use as a staging area to test all the data updates. With lakeFS, you can create a new branch for every single change to production that we want to make. And you can test multiple changes at one time.

Now that we covered the theory let’s see what building a test environment with lakeFS looks like in practice.

Building a testing environment for data with Spark and lakeFS

To get started, we need to create a repository and build a small Spark application.

Prerequisites

Tip: Click here to see how to run lakeFS locally against your storage in less than 3 minutes.

Step 1: Creating a repository

Our first step is to create a new repository called `example-repo` that will hold our data. In the code below, we’re using an S3 bucket `example-bucket` as the underlying storage.

$ lakectl repo create lakefs://example-repo s3://example-bucket/example-repo -d main
Repository: lakefs://example-repo
Repository 'example-repo' created:
storage namespace: s3://example-bucket/example-repo
default branch: main
timestamp: 1660458426

Step 2: Upload and commit data

We can download our data and upload it directly into our newly-created repository using the following command.

# Download data file - credit to Gutenberg EBook of Alice's Adventures in Wonderland
# Or you can use any text file you like
$ curl -o alice.txt https://www.gutenberg.org/files/11/11-0.txt
# Upload to our repository
$ lakectl fs upload -s alice.txt lakefs://example-repo/main/alice.txt
Path: alice.txt
Modified Time: 2022–08–14 09:35:16 +0300 IDT
Size: 174313 bytes
Human Size: 174.3 kB
Physical Address: s3://barak-bucket1/example-repo/ec827db52c044d3881cdf426b99819f3
Checksum: 7e56f48cb7671ceba58e48bf232a379c
Content-Type: application/octet-stream

This is the starting point of our development project. We must commit the data to a point in time from which we can always roll back, reference, or branch.

$ lakectl commit -m "alice in wonderland" lakefs://example-repo/main
Branch: lakefs://example-repo/main
Commit for branch "main" completed.
ID: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Message: alice in wonderland
Timestamp: 2022–08–14 09:36:08 +0300 IDT
Parents: fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc

At this point, we recommend updating the development environment’s main branch regularly. This is how we can keep the development environment on par with production.

Since all the branches are isolated, we can choose to update your branch — and newly-updated branches will be created from the updated main branch.

Step 3: Develop a Spark application on the isolated branch

We need to start with a new branch to capture the state of our repository while working on our Spark application.

$ lakectl branch create lakefs://example-repo/word-count1 -s lakefs://example-repo/main
Source ref: lakefs://example-repo/main
created branch 'word-count1' bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620

By creating a branch, you get an isolated environment that contains a snapshot of your repository. Note that this operation is lightweight — it doesn’t copy any data and still gives us an isolated space guaranteed not to change.

At this point, we can use the lakectl command to list the content of our development branch and see that it contains the same data as our main branch:

$ lakectl fs ls lakefs://example-repo/word-count1/
object 2022–08–14 09:35:16 +0300 IDT 174.3 kB alice.txt
$ llakectl fs ls lakefs://example-repo/main/
object 2022–08–14 09:35:16 +0300 IDT 174.3 kB alice.txt

Now it’s time to run a simple word count using Spark to generate a report based on our story. The first step is to set up the lakeFS Spark integration:

sc.hadoopConfiguration.set("fs.s3a.access.key",<lakeFS access key ID>)
sc.hadoopConfiguration.set("fs.s3a.secret.key",<lakeFS secret access key>)
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.local.lakefs.io:8000")
val branch = "s3a://example-repo/word-count1/"
val textFile = sc.textFile(branch + "alice.txt")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile(branch + "wc.report")

Access to our repository is enabled via the S3 interface — all we need to do is set the s3a endpoint and credentials based on our lakeFS installation.

Running the command above will generate a word count report. To see it, we can use lakectl diff:

$ lakectl diff lakefs://example-repo/word-count1
Ref: lakefs://example-repo/word-count1
+ added wc.report/_SUCCESS
+ added wc.report/part-00000
+ added wc.report/part-00001

If there’s any problem with the generated data, we can always discard all the uncommitted changes with a single command that resets the branch:

$ lakectl branch reset lakefs://example-repo/word-count1 - yes
Branch: lakefs://example-repo/word-count1

Step 4: Rolling back or committing changes

Once testing is completed, we’re ready to commit our changes. We can include additional metadata fields in our lakeFS commit — for example, by specifying its git commit hash, we can reference the code used to build the data.

$ lakectl commit lakefs://example-repo/word-count1 -m "word count" - meta git_commit_hash=cc313c5
Branch: lakefs://example-repo/word-count1
Commit for branch "word-count1" completed.
ID: 068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df
Message: word count
Timestamp: 2022–08–14 11:36:16 +0300 IDT
Parents: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5

By using the `meta` flag, we can easily store multiple metadata key/value pairs to label our commit. Later on, we can take a look at the log and use the referenced data.

d9fb61839a620
$ lakectl log lakefs://example-repo/word-count1
ID: 068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df
Author: barak
Date: 2022–08–14 11:36:16 +0300 IDT
word count
git_commit_hash = cc313c5
ID: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Author: barak
Date: 2022–08–14 09:36:08 +0300 IDT
alice in wonderland
ID: fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc
Date: 2022–08–14 09:27:06 +0300 IDT
Repository created

The commit ID — for instance, 068ddaf6b4919e5b — can be used by our CLI to address the repository.

$ lakectl fs cat lakefs://example-repo/068ddaf6b4919e5b/wc.report/part-00000
("Found,2)
(someone,1)
(plan.",1)
(bone,1)
(roses.,1)

Or by our application as s3a address:
s3a://example-repo/068ddaf6b4919e5b/wc.report

Step 5: Merge and iterate in the development workflow

Once both the code and data are committed, we can review them together before deciding to merge our new data into the main branch.

$ lakectl merge lakefs://example-repo/word-count1 lakefs://example-repo/main
Source: lakefs://example-repo/word-count1
Destination: lakefs://example-repo/main
Merged "word-count1" into "main" to get "50d593e52cba687f2af7894189b381ebafdf7495f9e65a70ba2fc67e4e8c0677".

The merge operation will generate a commit in the target branch with all the changes we have applied.

Committing is a quick and atomic metadata operation. lakeFS doesn’t copy any data during the process. Moreover, the data is stored only once. If we decide to merge changes to multiple objects, we are guaranteed that they all show up in our destination branch at the same time.

Optionally, we can delete the feature branch once we no longer need it using this command:

$ lakectl branch delete lakefs://example-repo/word-count1 - yes
Branch: lakefs://example-repo/word-count1

Let’s sum it up!

Building an isolated development and testing environment for a data lake can be easy. I showed you how to do it with lakeFS and hope you feel empowered to take this open-source solution for a spin.

Note that all the content above is just suggestions — what matters is the concept and methodology.

💡 Curious to learn more?

Check out lakeFS docs for more guides and detailed information on how to get started.

Join the awesome lakeFS data practitioners’ Slack community to meet like-minded people, ask questions, and share your wicked smart insights!

--

--

Adi Polak
lakeFS

👩‍💻 Software Engineer 📚 Author of Scaling Machine Learning with Spark (O'Reilly) 🗣️ Keynote Speaker 💫 Databricks ambassador