Apache Superset — a cost-effective alternative to quick sight for data lake

Nilesh Bhosale
7 min readDec 3, 2019

--

A tale of gluing apache superset, AWS Athena, headless firefox, gecko driver and selenium to get the amazing experience of one of the best opensource BI ( Business Intelligence ) tool.

Making data sense with right tools
Photo by Stephen Dawson on Unsplash

At CreditVidya, a data lake is used to empower teams to enhance our products and services. We started building a data lake in late 2018 to leverage low-cost tech-stack such as S3, Glue, Athena, Lambda, and Quick Sight. Soon enough we realized, building a production data lake is far more difficult than the marketed reality.

Recently, AWS has launched a data lake solution, an umbrella for organizing data lake tech-stack, in few regions. But the underlying challenges for building a data lake remains unaddressed.

In the below sections, we focus on one such challenge regarding interface for data querying, visualization, and data governance.

Challenge 101: Data querying interface & visualization

  • The primary reason it became a challenge is due to unavailability of quick sight in the Mumbai region.
  • We don’t want business users to access AWS Athena console to run SQL queries. Since the console is accessible by IAM user (i.e AWS console user) with permission to read S3 in addition to Athena (i.e. a recipe for disaster).

Due to the absence of quick sight, we eagerly searched for an alternative that can help us with our challenge.

Open-source versus commercial BI tools: Which one to choose?

Frankly speaking, while designing a data lake the first preference was AWS quick sight, a commercial BI tool. The reason was quite simple, a managed service and cheapest among commercial BI tools. The decision for considering open-source was due to the unavailability of quick sight (A blessing in disguise) in the ap-south-1 region.

Open-source BI tools: A true underdog

  • After using redash, we realize its true potential and how effectively it covers all our use-cases.
  • The budgeting is a key metric for a project and choosing open-source BI tools directly gave us a minimum of 10x plus saving in the budget over a commercial alternative.
  • With our experience, we would like to recommend startups or organizations, looking to start a data lake project to give a thought about using open-source alternatives before going for commercial.

Redash.io as-a alternative

Redash is a new generation open-source BI tool with strong community support. The redash connectors extensively cover most of the databases in the big data world.

We started with redash.io to power our business users for building an interactive dashboard, and the analytics team for doing analysis. But soon enough, teams started complaining about frequent query timeout, and inability to join tables across different databases even though Athena console allows it.

First Complaint: Frequent query timeout

The Redash Athena Connector is JDBC based and it starts to fail over long-running queries. To overcome this limitation, we started pushing business logic in the database view. It resolves the query timeout issue and with proper usage of with clause helps us build testable and maintainable views.

Prestodb “with” clause for writing complex analytical queries

  • The with clause is used to define temporarily named relation for use within a query — definition from prestodb doc
  • We converted complex business logic into stepwise transformation and used with clause to glue transformations.
  • Testing & maintaining views become far easier due to the ability to traverse through each transformation.

Second Complaint: Inability to join table across different databases

For every database in Athena, we created an equivalent data source at redash. Since redash allows us to give access control over the data source level, it helps us indirectly restrict access over the database in Athena.

The problem started when a user having access to different data sources in redash is not able to join the table across different data sources. this feature is currently not available.

A non-scalable workaround is to create a new database in Athena with a combination of different tables required by a user and map it as the data source in redash.

Apache Superset to rescue

Superset supports both JDBC & REST based Athena access, it provides an interactive dashboard and schedule email reporting. With REST-based connector, it performs similar to running a query on the Athena console. the superset amazes us but the journey towards production deployment with scheduled email reporting was not quite straight forward.

Superset uses selenium, to run a dashboard in headless firefox and export the dashboard from the browser to an image. The recipe works wonder if you have the right version for each compatible service like firefox, selenium, gecko driver versions.

Below section, we’ll go through deployment steps for successfully deploying apache superset with dependencies needed for the email reporting feature.

Steps to deploy superset in production

Don’t do the installation from the master branch (just for Checklist)

After cloning the project, always check out the latest release tag. We observed that using a master branch can cause some unexpected behavior, and even the release branch can introduce new issues. Always better to finalize stable commit with testing before deploying into production.

Prefer docker deployment

# install docker
sudo yum install docker -y
sudo usermod -a -G docker ec2-user
sudo service docker start
# install docker-compose
curl -L https://github.com/docker/compose/releases/download/1.25.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose
# if root need access docker-compose
sudo cp /usr/local/bin/docker-compose /usr/bin/

Changes in the docker-compose.yml file

version: '3.5'
services:
redis:
image: redis:3.2
restart: unless-stopped
ports:
- "127.0.0.1:6379:6379"
volumes:
- redis:/data
postgres:
image: postgres:10
restart: unless-stopped
environment:
POSTGRES_DB: superset
POSTGRES_PASSWORD: superset
POSTGRES_USER: superset
ports:
- "127.0.0.1:5432:5432"
volumes:
- postgres:/var/lib/postgresql/data
superset:
build:
context: ../../
# <-- changes added shm_size due to driver failure
shm_size: '2gb'

dockerfile: contrib/docker/Dockerfile
restart: unless-stopped
environment:
POSTGRES_DB: superset
POSTGRES_USER: superset
POSTGRES_PASSWORD: superset
POSTGRES_HOST: postgres
POSTGRES_PORT: 5432
REDIS_HOST: redis
REDIS_PORT: 6379
# If using production, comment development volume below
SUPERSET_ENV: production
# SUPERSET_ENV: development <-- changed

user: root:root
ports:
- 8088:8088
depends_on:
- postgres
- redis
volumes:
# this is needed to communicate with the postgres and redis services
- ./superset_config.py:/home/superset/superset/superset_config.py
# this is needed for development, remove with SUPERSET_ENV=production
# ../../superset:/home/superset/superset <-- changed
volumes:
postgres:
external: false
redis:
external: false

Append below configuration in superset_config.py file

ENABLE_SCHEDULED_EMAIL_REPORTS = True
EMAIL_NOTIFICATIONS = True
# smtp server configuration
SMTP_HOST = "email-smtp.ap-south-1.amazonaws.com"
SMTP_STARTTLS = False
SMTP_SSL = True
SMTP_USER = ""
SMTP_PORT = 465
SMTP_PASSWORD = ""
SMTP_MAIL_FROM = "no-reply@company.com"
# Email reports - minimum time resolution (in minutes) for the crontab
EMAIL_REPORTS_CRON_RESOLUTION = 15
# Email report configuration
# From address in emails
EMAIL_REPORT_FROM_ADDRESS = "no-reply@company.com"
# Send bcc of all reports to this address. Set to None to disable.
# This is useful for maintaining an audit trail of all email deliveries.
EMAIL_REPORT_BCC_ADDRESS = None
# User credentials to use for generating reports
# This user should have permissions to browse all the dashboards and
# slices.
# TODO: In the future, login as the owner of the item to generate reports
EMAIL_REPORTS_USER = "admin"
EMAIL_REPORTS_SUBJECT_PREFIX = "[Superset Report] "
EMAIL_REPORTS_WEBDRIVER = "firefox"# Window size - this will impact the rendering of the data
WEBDRIVER_WINDOW = {"dashboard": (1600, 2000), "slice": (3000, 1200)}
# The base URL to query for accessing the user interface
WEBDRIVER_BASEURL = "http://0.0.0.0:8088/"

Update Dockerfile to include xvfb, firefox-esr(version 68), PyAthena, chrome driver, and gecko driver.

FROM python:3.6-jessieRUN useradd --user-group --create-home --no-log-init --shell /bin/bash superset# Configure environment
ENV LANG=C.UTF-8 \
LC_ALL=C.UTF-8
RUN apt-get update -y# Install dependencies to fix `curl https support error` and `elaying package configuration warning`
RUN apt-get install -y apt-transport-https apt-utils
# Install superset dependencies
# https://superset.incubator.apache.org/installation.html#os-dependencies
RUN apt-get install -y build-essential libssl-dev \
libffi-dev python3-dev libsasl2-dev libldap2-dev libxi-dev \
default-jre libgtk-3-0 xvfb firefox-esr
# Install extra useful tool for development
RUN apt-get install -y vim less postgresql-client redis-tools
# Install nodejs for custom build
# https://superset.incubator.apache.org/installation.html#making-your-own-build
# https://nodejs.org/en/download/package-manager/
RUN curl -sL https://deb.nodesource.com/setup_10.x | bash - \
&& apt-get install -y nodejs
WORKDIR /home/supersetCOPY requirements.txt .
COPY requirements-dev.txt .
COPY contrib/docker/requirements-extra.txt .
RUN pip --default-timeout=120 install --upgrade setuptools pip \
&& pip --default-timeout=120 install -r requirements.txt -r requirements-dev.txt -r requirements-extra.txt \
&& pip install "PyAthena>1.2.0" \
&& rm -rf /root/.cache/pip
RUN wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
RUN tar -x geckodriver -zf geckodriver-v0.24.0-linux64.tar.gz -O > /usr/bin/geckodriver
RUN chmod +x /usr/bin/geckodriver
RUN rm geckodriver-v0.24.0-linux64.tar.gz
RUN wget -q "https://chromedriver.storage.googleapis.com/79.0.3945.36/chromedriver_linux64.zip" -O /tmp/chromedriver.zip \
&& unzip /tmp/chromedriver.zip -d /usr/bin/ \
&& rm /tmp/chromedriver.zip
COPY --chown=superset:superset superset supersetENV PATH=/home/superset/superset/bin:$PATH \
PYTHONPATH=/home/superset/superset/:$PYTHONPATH
USER supersetRUN cd superset/assets \
&& npm ci \
&& npm run build \
&& rm -rf node_modules
RUN Xvfb :10 -ac &
RUN export DISPLAY=:10
COPY contrib/docker/docker-init.sh .
COPY contrib/docker/docker-entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
HEALTHCHECK CMD ["curl", "-f", "http://localhost:8088/health"]EXPOSE 8088

Build and execute the local docker image

# Build local image
docker-compose run --rm superset ./docker-init.sh
# first-time run without docker-compose without detach mode
docker-compose up
# Above cmd will take some time, close the container after it complete
Ctrl C :)
# Run docker-compose again in detach mode
docker-compose up -d
# Check if Xvfb is running, and presence of DISPLAY env variable
docker exec -it superset_superset_1 bash
# if not then run below cmd
root@98d29760875a:/home/superset# Xvfb :10 -ac &
root@98d29760875a:/home/superset# export DISPLAY=:10
# check if selenium, firefox, geckodriver setup is working
root@98d29760875a:/home/superset# python3
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://seleniumhq.org/')
browser.title
# The superset with dashboard emailing feature should work without any issues

Summary

  • Quick Sight is a great, cost-effective BI tool in a class of enterprise BI. It would be unfair to compare it with open-source BI tools.
  • For startups or organizations just introduce data lake, it would be wise to go ahead with open-source BI tools during initial days.
  • Through an article, I intended to make deployment easy for superset with scheduled email reporting, and dashboard/slice emailing feature.
  • The superset provides excellent granular-level access on UI components & database objects. With role-based user management, it becomes an effective BI tool.
  • A request to superset community, start giving docker image with pre-loaded dependencies for enabled email reporting. Instead of going through the article to make it work :)

If you are evaluating BI tools, do consider modern open-source alternatives e.g redash.io, metabase.io, and apache superset.

Found this article useful? then hit that 👏 button below to spread it around even more. Also, please comment to improve the article!

--

--

Nilesh Bhosale

An entrepreneur by nature | Problem Solver | Data Architect | Fin-tech