Apache Superset on Nomad by Hashicorp

Meetvasu
5 min readFeb 11, 2023

--

Apache Superset is an open source, lightweight and powerful data visualization tool. It started as a hack-a-thon project at Airbnb by Maxime Beauchemin in 2017. It has since become a top level project at Apache Software Foundation and has a tremendous community support. Superset is highly scalable, designed to work in a distributed environment and has support for working inside containers. The superset official documentation has plenty of material on how to deploy superset using docker compose as well as using helm-charts, however in this article we are going to explore deployment of a highly scalable and containerized Apache Superset on Nomad by Hashicorp. The resulting nomad job will replicate the docker compose yaml superset implementation.

We assume you have a running nomad cluster with consul running for service discovery and connect enabled. This medium article does a great job of explaining a realistic multitier application working on Nomad and Consul.

Superset needs the following services to run in a multitier application mode:

  1. Webservers — to service http requests from clients, typically more than one to scale up or down as per need.
  2. Metadata Database — superset needs a backend to store information about its dashboards, charts and other user information. Out of the box Superset uses a SQLite database, this approach is not recommended for production use. Since we will be deploying superset in a distributed environment, we would need something more robust like PostgreSQL or MySQL database. See database configuration suggestions for supported versions.
  3. Cache — for celery workers to store their results, to cache other superset objects, dashboard filter state, explore chart form data, metadata cache and charting data queried from datasets. The cache overall enhances the performance of SQL queries and subsequently visualizations.
  4. Celery workers — to enable support for asynchronous long running queries that execute beyond the typical web request’s timeout, typically more than one instance and they scale up or down as per need.
  5. Celery broker — it acts as a message queue/ scheduler for the celery workers. There should be no more than one broker in the entire setup.

Note — superset does not store the data that is presented in a dashboard or chart or a SQL query in its metastore, that information is queried directly from the database where it originally resides or it is retrieved from a cache (if enabled and available) when a user makes a request for it via a dashboard or sql query.

Nomad job overview

In this setup we are using PostgreSQL as our metadata database and Redis for our cache and results backend (for celery workers to store results of their execution). We will have each service in their own group so that we can scale them up or down independently of each other per our need. However there can only be one instance of PostgreSQL database, Redis cache and the Celery beat scheduler in this setup. They can be a part of the same group but in this example we will give them their own separate group as well.

Architecture of distributed Apache superset using gunicorn webserver, postgres database, redis cache and celery
Apache Superset distributed setup architecture

Job outline

-Job: Superset

-Group: Metastore
-Task: postgresdb

-Group: Cache
-Task: redis

-Group: Webservers
-Task: webserver

-Group: Scheduler
-Task: celerybeat

-Group: Workers
-Task: cleryworkers

Since we are replicating the docker compose implementation, we will be using the helper startup scripts and superset_config.py available at Apache GitHub repository within docker subdirectory. These scripts are helpful for initializing the superset metadata database, setting up admin user, starting the gunicorn web server, celery scheduler and celery workers. In this nomad job, we will have each task download the sub directory content directly from Github however to improve performance for job set up time, you may want to extend the docker image and bundle these file within the docker image. The docker subdirectory also includes a sample superset_config.py which picks up the environment variables we set here, you may want to extend or include your own config file for further customizations.

The Job

Database credentials & task environment variables

We will start by declaring some local values which will define database credentials, add the superset_config.py path to python path, database dialect and, sidecar proxy host & port for database & cache.

We modify the python path on the container to include the sample superset_config.py. We will be creating sidecar connect proxies for Postgres database and Redis cache to make them reachable for other services. We will also need to create upstream services on ports 5432 and 6379 on the webserver, celery beat and celery worker task containers and the default host address for upstream service is 127.0.0.1. For all the tasks that need to access Postgres database and Redis cache, i.e. webserver, celery beat and celery worker, we will add the following service stanza.

Metastore

We will create one PostgreSQL database that will be used by superset as its meta data store. Although in this example volumes are not used but a persistent volume is strongly suggested to persist the state of the database. The service stanza creates a sidecar service for other tasks to reach port 5432 on this container.

Redis Cache

The following creates a redis cache that serves as default cache for all superset objects, results backend for celery workers and can be used for other cache that superset needs. The service stanza creates a sidecar service for other tasks to reach port 6379 on this container.

Superset Webservers

Here we create a group of webservers that we can scale independently of other tasks in this job. We bind the service on the containers of this group to superset’s default port 8088. We also create two sidecar upstream services for tasks in this group to reach the Postgresdb and Redis cache.

Inside the task, we download the helper scripts from GitHub repository and run the gunicorn server using the docker-bootstrap.sh script. The template stanza includes the environment variables we declared as locals earlier.

Celery scheduler & workers

The scheduler and the workers group look just like the webservers group other than the arguments we provide to the bootstrap script to start the container.

Scheduler config block:

Workers config block:

Bonus: Initialize Superset

To initialize the database, create admin user and initialize the app we need to run the following:-

A initiation script is included in the docker subdirectory we download from Superset Github repository. You can choose to run this as a pre-task to one of the other tasks but keep in my mind if you are loading examples then the task container will need substantial CPU and memory allocated to it and the init script is needed only once to run, so you cannot place them in groups that have multiple instances initiated. The resulting initiation group looks very similar to the webserver group other than a different command to start the container.

You now have Superset running in a distributed environment in nomad. It is highly scalable because you can tweak the number of tasks per group as per your demand.

Here is a link to the complete nomad job.

--

--