The Docker Everything Bagel™ — Spin Up A Local Data Stack

Use docker compose to create local replicas of a modern data stack with one command.

Paul Singman
Whispering Data
5 min readOct 11, 2021

--

Introduction

An important part of developing an open source project is assisting and advising users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally.

This means recreating the environment, running the same code, and raising the same error.

In complex, modern data stacks this is easier said than done. Developed from experience over the past year, we have a setup that helps us in this pursuit. Affectionately, it is referred to as the Everything Bagel.

What is the Everything Bagel?

The Everything Bagel is a multi-container Docker environment that spins up locally with a single command. It contains many of the technologies we see lakeFS commonly deployed with, including:

  • Spark
  • Hive
  • Trino
  • MinIO

The best part — it’s publicly available right in the lakeFS GitHub repo!

In this post I’ll cover how to get the Docker Everything Bagel up and running on your own laptop. In the process, I’ll also cover how it works and some of the cool things you can do with your very own Everything Bagel.

Spin It Up!

The only pre-requisite is to have Docker installed on your machine. Once installed, the steps to get the Everything Bagel running are as follows:

  1. Clone the lakeFS repo: git clone https://github.com/treeverse/lakeFS.git
  2. Navigate to the deployments/compose directory: cd deployments/compose
  3. Run: docker compose up -d

That’s it! The different containers will start to spin up. Once it completes (will take a few min the first time) you can check the status of the resulting containers by running docker compose ps. You should see a response in Terminal like:

A Quick Note on Memory

When creating the 10+ Everything Bagel containers, it’s important to make sure the Docker application has enough memory allocated to it.

To adjust this setting, go to Docker’s preferences page. From there, go to the Resources tab and make sure the Memory setting is set to at least 4 GB.

How It Works

The key to understanding how the Docker Everything Bagel works is to look at the docker-compose.yml file that the docker compose up -d command (the -d flag means run in “detached” state) by default references.

Although this article won’t be a comprehensive review of Docker Compose files, let’s take a peek at a few of the service sections to understand what’s happening.

Let’s look at part of the config.

This snippet shows sections that create containers running Postgres (a lakeFS dependency), MinIO (underlying object store) and lakeFS.

Docker Compose Specifics

Within a service’s section, perhaps the most important setting is the image.

This specifies the base image used in building the container. In most cases, you can find an official image for a service on Docker Hub that installs all the packages you need to run the service. For example, the lakeFS image on Docker Hub is updated automatically by a GitHub Action workflow every time there’s a new official release of lakeFS and starts running lakeFS automatically when started.

Next, the depends_on: key is used to control the order in which containers are created. Before starting the lakeFS service we make sure a container running Postgres is up first, as well as a service that starts MinIO and creates a bucket. If you aren’t familiar with MinIO, it’s an open-source object store that maintains compatibility with the S3 API. This makes it convenient for simulating S3 in local environments as we are doing here.

Any environment variables can be listed within the environment: key, as shown in all three services above. Similarly we can save entire config files and copy them into the container as a volume. The spark service contains an example of this.

The last thing I’d like o cover is the entrypoint: key. This is the command or commands we want to run inside the container. Usually, this simply starts the relevant service. In the lakeFS-setup service, however, we perform the additional steps of creating a lakeFS user and repository (via the lakectl command line tool) so these aren’t steps that need to be performed manually each time.

After all, there is little value in a repository-less lakeFS instance.

Correctly setting these settings lets us run pretty much any service we want inside an isolated Docker environment. Pretty cool!

Using the Everything Bagel

Once you have the Everything Bagel up and running, you are capable of doing a variety of things, such as…

  1. Connecting to the Hive and Trino [docker compose — profile client run — rm trino-client] clients and creating tables or querying data
  2. Hopping into the spark container [docker-compose exec spark bash] and testing out spark-submit jobs
  3. Logging into the lakeFS (http://localhost:8000) and MinIO (http://localhost:9000) UIs to see how Spark, Hive, and Trino’s operations are reflected.

Try different things out, let us know how it goes!

Looking Ahead

This is an introduction to the Bagel. I hope you learned a bit about Docker and how it can be used to easily create complex environments locally. Next time we’ll dive into a more advanced use case showing the Everything Bagel in action!

We’re always continuing to improve it as well, making the Docker environment simpler and adding other relevant technologies.

This article was originally published by Paul Singman on the lakeFS blog.

--

--

Paul Singman
Whispering Data

Data @ Meta. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.