Whispering Data
Published in

Whispering Data

The Docker Everything Bagel™ — Spin Up A Local Data Stack

Use docker compose to create local replicas of a modern data stack with one command.

Introduction

An important part of developing an open source project is assisting and advising users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally.

This means recreating the environment, running the same code, and raising the same error.

In complex, modern data stacks this is easier said than done. Developed from experience over the past year, we have a setup that helps us in this pursuit. Affectionately, it is referred to as the Everything Bagel.

What is the Everything Bagel?

The Everything Bagel is a multi-container Docker environment that spins up locally with a single command. It contains many of the technologies we see lakeFS commonly deployed with, including:

  • Spark
  • Hive
  • Trino
  • MinIO

The best part — it’s publicly available right in the lakeFS GitHub repo!

In this post I’ll cover how to get the Docker Everything Bagel up and running on your own laptop. In the process, I’ll also cover how it works and some of the cool things you can do with your very own Everything Bagel.

Spin It Up!

The only pre-requisite is to have Docker installed on your machine. Once installed, the steps to get the Everything Bagel running are as follows:

  1. Clone the lakeFS repo: git clone https://github.com/treeverse/lakeFS.git
  2. Navigate to the deployments/compose directory: cd deployments/compose
  3. Run: docker compose up -d

That’s it! The different containers will start to spin up. Once it completes (will take a few min the first time) you can check the status of the resulting containers by running docker compose ps. You should see a response in Terminal like:

NAME                     COMMAND                   SERVICE             STATUS              PORTS
compose_spark-worker_1 "/opt/bitnami/script…" spark-worker running 0.0.0.0:53129->8081/tcp
compose_spark-worker_2 "/opt/bitnami/script…" spark-worker running 0.0.0.0:53126->8081/tcp
compose_spark-worker_3 "/opt/bitnami/script…" spark-worker running 0.0.0.0:53128->8081/tcp
compose_spark_1 "/opt/bitnami/script…" spark running 0.0.0.0:18080->8080/tcp, :::18080->8080/tcp
hive "/bin/sh -c \"/entryp…" hive-metastore running 0.0.0.0:9083->9083/tcp, :::9083->9083/tcp
hiveserver2 "hive --service hive…" hive-server running
lakefs "/app/wait-for postg…" lakefs running 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp
lakefs-setup "/app/wait-for postg…" lakefs-setup exited (0)
mariadb "docker-entrypoint.s…" mariadb running 3306/tcp
minio "minio server /data …" minio running 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp, 0.0.0.0:9001->9001/tcp, :::9001->9001/tcp
minio-setup "mc mb lakefs/example" minio-setup exited (0)
postgres "docker-entrypoint.s…" postgres running 5432/tcp
trino "/usr/lib/trino/bin/…" trino running 8080/tcp

A Quick Note on Memory

When creating the 10+ Everything Bagel containers, it’s important to make sure the Docker application has enough memory allocated to it.

To adjust this setting, go to Docker’s preferences page. From there, go to the Resources tab and make sure the Memory setting is set to at least 4 GB.

How It Works

The key to understanding how the Docker Everything Bagel works is to look at the docker-compose.yml file that the docker compose up -d command (the -d flag means run in “detached” state) by default references.

Although this article won’t be a comprehensive review of Docker Compose files, let’s take a peek at a few of the service sections to understand what’s happening.

Let’s look at part of the config.

This snippet shows sections that create containers running Postgres (a lakeFS dependency), MinIO (underlying object store) and lakeFS.

Docker Compose Specifics

Within a service’s section, perhaps the most important setting is the image.

This specifies the base image used in building the container. In most cases, you can find an official image for a service on Docker Hub that installs all the packages you need to run the service. For example, the lakeFS image on Docker Hub is updated automatically by a GitHub Action workflow every time there’s a new official release of lakeFS and starts running lakeFS automatically when started.

Next, the depends_on: key is used to control the order in which containers are created. Before starting the lakeFS service we make sure a container running Postgres is up first, as well as a service that starts MinIO and creates a bucket. If you aren’t familiar with MinIO, it’s an open-source object store that maintains compatibility with the S3 API. This makes it convenient for simulating S3 in local environments as we are doing here.

Any environment variables can be listed within the environment: key, as shown in all three services above. Similarly we can save entire config files and copy them into the container as a volume. The spark service contains an example of this.

The last thing I’d like o cover is the entrypoint: key. This is the command or commands we want to run inside the container. Usually, this simply starts the relevant service. In the lakeFS-setup service, however, we perform the additional steps of creating a lakeFS user and repository (via the lakectl command line tool) so these aren’t steps that need to be performed manually each time.

After all, there is little value in a repository-less lakeFS instance.

Correctly setting these settings lets us run pretty much any service we want inside an isolated Docker environment. Pretty cool!

Using the Everything Bagel

Once you have the Everything Bagel up and running, you are capable of doing a variety of things, such as…

  1. Connecting to the Hive and Trino [docker compose — profile client run — rm trino-client] clients and creating tables or querying data
  2. Hopping into the spark container [docker-compose exec spark bash] and testing out spark-submit jobs
  3. Logging into the lakeFS (http://localhost:8000) and MinIO (http://localhost:9000) UIs to see how Spark, Hive, and Trino’s operations are reflected.

Try different things out, let us know how it goes!

Looking Ahead

This is an introduction to the Bagel. I hope you learned a bit about Docker and how it can be used to easily create complex environments locally. Next time we’ll dive into a more advanced use case showing the Everything Bagel in action!

We’re always continuing to improve it as well, making the Docker environment simpler and adding other relevant technologies.

This article was originally published by Paul Singman on the lakeFS blog.

--

--

--

Whispering Data is a Medium publication for all the data & productivity secrets you wish you knew years ago!

Recommended from Medium

0.1 + 0.2 != 0.3

Read Ebook ATL Internals: Working with ATL 8 (2nd Edition) [Full]

Test Driven Development

Full Stack Learner(Part I)

Development update #2

How to Get Started in Alien Worlds

Who needs a Team Leader when you’ve got waffle.io, git branching and a master plan…

Spring Boot CRUD with MongoDB, Postman For Starters

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Paul Singman

Paul Singman

DevRel @lakeFS. Ex-ML Engineering Lead @Equinox. Whisperer of data and productivity wisdom. Standing on the shoulders of giants.

More from Medium

Conduit: Streaming Data Integration for Developers

Dagster: The Best Free and Open-Source Alternative to Airflow With Python!

How to Use Prefect and Monte Carlo to Achieve More Reliable Data Pipelines

How Does 360 DIGITECH process 10,000+ workflow instances per day by Apache DolphinScheduler?