Scaling out Tableau Extracts — Building a distributed Tableau Hyper Cluster

Tamas Foldi
Aug 11 · 11 min read

Tableau Hyper Database (“Extract”) is a great engine; it’s one of the reasons people are obsessed with Tableau analytics. However, being a single node database server, it has its limits (performance and data size) — that we are going to break in this blog post. By adding multi-node capacity, partitioning and distributed query processing, Hyper can act like a real enterprise MPP data warehouse — running on modern infrastructure like kubernetes.

At the end of this document, you will know how to build an MPP database with hyper workers, and how to achieve 10–20x faster query times within your Tableau Dashboards.

Background and Fundamentals

Tableau’s Hyper database was built from scratch with de facto standard functionalities (LLVM code generation, in-memory capabilities, columnar data store, etc.), Postgres network and SQL dialect compatibility. It is a pretty neat database - fast, convenient and standard. With the new Extract API you can issue a full set of Postgres-like SQL statements, including COPY, to move data in and out in a “bulk” way.

No doubt, with minor tweaking we should be able to access Hyper database with libpq based applications like psql and PostgresODBC, accessing the pure power of this engine.

MPP, Shared-nothing

What is missing here is horizontal scalability: leveraging multiple server worker nodes to speed up single queries and providing availability for parallel data ingestions. Traditionally, this is what the MPP (Massive Parallel Processing) architecture provides: running multiple database worker nodes with dedicated, partial datasets and aggregator/master node(s) that combines the results from the worker nodes.

Theoretically, by adding twice as many nodes you will have twice as much performance. Imagine a webshop where all of your transactions are stored in a single extract file. If you calculate the overall customer value, you simply process the transactions related to a particular customer, then visualize the results for all customers in your Tableau report. But what if you have ten computers to calculate this customer value? If every customer’s transaction data is located on the same server (“co-locating related data on the same node”) then your algorithm can independently work on each separate servers for all customers stored on that node. Ten nodes, ten times more performance.

Imagine the following SQL statements:

SELECT sum(profit), state from SALES

In a shared-nothing world, the query first hits the aggregator node that dispatches the query as subqueries to its worker nodes.

SELECT sum(profit), state from SALES -- aggregator executes this query on on worker1 that has the value for East Coast states
--
SELECT sum(profit), state from SALES -- aggregator executes this query on on worker2 that has the value for West Coast states

Workers execute the queries on their partial set of data and returning the aggregated dataset to the master node. Then, the master node combines the aggregated values as an additional step. Again, most of the work happens on the workers while the master node aggregates the aggregates. Pretty simple. This is the way how MemSQL, Teradata, Yellowbrick, Vertica, Hadoop Hive, Spark, Redshift or Greenplum work.

But what would it take to convert Hyper Database to this “MPP” architecture?

How we are going to do this?

We need a few things to make this conversion happen:

1. Build worker nodes that are generic Hyper Database services running independently from each other

  • Create a docker image from Hyper Database that is remotely accessible,
  • Deploy it on kubernetes as a Service to manage its elasticity.

2. Build a master node that will act as an aggregator. Postgres 11 has database link-like functionality that can push down queries to other Postgres databases (and hyper acts like a Postgres too!).

  • Deploy Postgres 11 on kubernetes,
  • Set up foreign data wrapper for Hyper,
  • Import and synchronize metadata across Hyper workers and Postgres master nodes.

3. Validate if aggregation happens on the shared-nothing worker nodes.

Building Hyper Docker image

Hyper Database is part of all Tableau products: Desktop, Server, Online, Prep and standalone Extract API package. The easiest way is to start from an empty docker template like centos, download the Extract API, build a new hyper database with hyperd configure and start services with hyperd start. The final Dockerfile will look like:

FROM centos:latestMAINTAINER "Tamas Foldi" <tfoldi@starschema.net>ADD https://downloads.tableau.com/tssoftware/extractapi-linux-x86_64-2019-2-2.tar.gz /tmpRUN mkdir /hyper && \
tar xvzf /tmp/extractapi-linux-x86_64-2019-2-2.tar.gz --strip 1 -C /hyper && \
/hyper/bin/hyper/hyperd configure --log-dir /hyper/ -d /hyper/hyper.hyper --no-password --no-ssl --init-user tableau_internal_user
EXPOSE 7483CMD /hyper/bin/hyper/hyperd start --log-dir /hyper/ -d /hyper/hyper.hyper --skip-license --no-password --no-ssl --listen-connection tab.tcp://0.0.0.0:7483

The image is deployed to docker hub, so you can just pull it without building it. After starting the image we can easily connect to it:

$ docker run -d -t tfoldi/tableau-hyperd-docker:release
490ea4faea9704683c35a78809b964737d3fca4156ffda3838b1fa54443b5d64
$ docker inspect 490ea4faea9704683c35a78809b964737d3fca4156ffda3838b1fa54443b5d64 | grep IPAddress
"SecondaryIPAddresses": null,
"IPAddress": "172.17.0.2",
"IPAddress": "172.17.0.2",
$ psql --host 172.17.0.2 --port 7483 --user tableau_internal_user
psql (9.2.24, server 9.1.0)
WARNING: psql version 9.2, server version 9.1.
Some psql features might not work.
Type "help" for help.
tableau_internal_user=# create table foo (bar integer);
CREATE TABLE

We can easily start a new Hyper daemon and connect to its standard port. Now let’s deploy it on kubernetes with the following Service and Deployment configuration:

apiVersion: v1
kind: Service
metadata:
name: hyperd-mpp
labels:
app: hyperd-mpp
role: worker
spec:
ports:
- port: 7483
protocol: TCP
selector:
app: hyperd-mpp
role: worker

Deployment/Pod definition. The replicas key defines how many hyperd workers should run at a given time.

apiVersion: apps/v1
kind: Deployment
metadata:
name: hyperd-mpp-deployment
labels:
app: hyperd-mpp
role: worker
spec:
replicas: 2
selector:
matchLabels:
app: hyperd-mpp
role: worker
template:
metadata:
labels:
app: hyperd-mpp
role: worker
spec:
containers:
- name: hyperd-worker
image: tfoldi/tableau-hyperd-docker
ports:
- containerPort: 7483

If all looks good, we can apply these and have our Hyper workers up and running:

[tfoldi@kompi]% kubectl get all -l app=hyperd-mpp,role=worker
NAME READY STATUS RESTARTS AGE
pod/hyperd-mpp-deployment-c5fd49756-ccg9p 1/1 Running 0 20h
pod/hyperd-mpp-deployment-c5fd49756-x6q77 1/1 Running 0 20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/hyperd-mpp ClusterIP 10.100.200.35 <none> 7483/TCP 20h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/hyperd-mpp-deployment 2/2 2 2 20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/hyperd-mpp-deployment-c5fd49756 2 2 2 20h

Lovely, now, let’s move to the more interesting part, how we are going to add an aggregator on top of this.

The Aggregator / Master node

The first step is to quickly deploy a Postgres 10 container to our kubernetes cluster using this Service file with this Deployment file. If all looks good we should have similar topology in our kubernetes:

$ kubectl get all -l app=hyperd-mpp
NAME READY STATUS RESTARTS AGE
pod/hyperd-mpp-deployment-c5fd49756-ccg9p 1/1 Running 0 20h
pod/hyperd-mpp-deployment-c5fd49756-x6q77 1/1 Running 0 20h
pod/postgres-master-69f8d75f9c-bltwx 1/1 Running 0 20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/hyperd-mpp ClusterIP 10.100.200.35 <none> 7483/TCP 20h
service/postgres-master LoadBalancer 10.100.43.171 a52bd342.kubeschema 5432:31679/TCP 19h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/hyperd-mpp-deployment 2/2 2 2 20h
deployment.apps/postgres-master 1/1 1 1 20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/hyperd-mpp-deployment-c5fd49756 2 2 2 20h
replicaset.apps/postgres-master-69f8d75f9c 1 1 1 20h

Testing the system

Our excellent MPP infrastructure is up; we have two hyper workers and a Postgres master. Next step is to configure and test it. Let’s add some test data to the hyper nodes first. For my very basic first test, I will distribute the data (shard key) by date. Worker1 will have data for 2016 while worker2 will have for 2017.

Creating the tables

To create a table on of the node, connect to its console, open psql and issue a CREATE TABLE statement for 2017.

$ kubectl exec -it hyperd-mpp-deployment-c5fd49756-ccg9p -- /bin/sh
sh-4.2# psql --host localhost --port 7483 --user tableau_internal_user
tableau_internal_user=# CREATE TABLE temperatures_2017( at date,
city text,
mintemp integer,
maxtemp integer
);
CREATE TABLE

Then do it for 2016 on the other worker.

We can insert the data directly from here or later thru the master node. Typical MPP data ingestion flow is to extract the data in parallel from other downstream systems and upload to the workers independently. As you image, running 10–20 COPY statements in parallel on different workers will speed up the extract creation process by 10–20x.

Linking workers with the master/aggregator

Now it’s time to tell our aggregator where its hyper workers are. After logging on to our postgres11, we should add this default foreign data wrapper to the system, define remote servers and user mapping. The whole process is described in this fantastic document, but with another Postgres database instead of Hyper. However, as Hyper acts as Postgres, we should do the same.

% psql --host 100.10.10.23  -U postgres --passwordpostgres=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER hyper1 FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host '192.168.143.173', port '7483', dbname '/hyper/hyper.hyper');
CREATE SERVER
postgres=# CREATE USER MAPPING FOR postgres SERVER hyper1 OPTIONS (user 'tableau_internal_user');
CREATE USER MAPPING
-- repeat for hyper2 database

The next step is to create the table for external consumers (=Tableau Dashboards). Let’s call our table as temperatures and add temperatures_2016 and temperatures_2017 from Hyper workers as partitions to this table.

postgres=# CREATE TABLE temperatures (
at date,
city text,
mintemp integer,
maxtemp integer
)
PARTITION BY RANGE (at);
CREATE TABLE
postgres=# CREATE FOREIGN TABLE temperatures_2016
PARTITION OF temperatures
FOR VALUES FROM ('2016-01-01') TO ('2017-01-01')
SERVER hyper1;
CREATE FOREIGN TABLE
postgres=# CREATE FOREIGN TABLE temperatures_2017
PARTITION OF temperatures
FOR VALUES FROM ('2017-01-01') TO ('2018-01-01')
SERVER hyper1;
CREATE FOREIGN TABLE

Everything seems fine, time to add a few records:

postgres=# INSERT INTO temperatures (at, city, mintemp, maxtemp) VALUES ('2016-08-03', 'London', 63, 73);
INSERT 0 1
postgres=# INSERT INTO temperatures VALUES ('2017-08-03', 'London', 63, 90);
INSERT 0 1

Time to test, if we can issue one single SELECT statement that collects the records from the underlying Hyper workers:

postgres=# select * from temperatures;
at | city | mintemp | maxtemp
------------+--------+---------+---------
2016-08-03 | London | 63 | 73
2017-08-03 | London | 63 | 90
(2 rows)
postgres=# explain (verbose, costs off) select * from temperatures;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Append
-> Foreign Scan on public.temperatures_2016
Output: temperatures_2016.at, temperatures_2016.city, temperatures_2016.mintemp, temperatures_2016.maxtemp
Remote SQL: SELECT at, city, mintemp, maxtemp FROM public.temperatures_2016
-> Foreign Scan on public.temperatures_2017
Output: temperatures_2017.at, temperatures_2017.city, temperatures_2017.mintemp, temperatures_2017.maxtemp
Remote SQL: SELECT at, city, mintemp, maxtemp FROM public.temperatures_2017
(7 rows)

As we can see, Postgres split the query into two separate foreign scans, then combined them into a single result set.

However, we have some issues here with the aggregation push down:

postgres=# explain (verbose, costs off) select avg(maxtemp),city from temperatures group by 2;
QUERY PLAN
------------------------------------------------------------------------------
HashAggregate
Output: avg(temperatures_2016.maxtemp), temperatures_2016.city
Group Key: temperatures_2016.city
-> Append
-> Foreign Scan on public.temperatures_2016
Output: temperatures_2016.city, temperatures_2016.maxtemp
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2016
-> Foreign Scan on public.temperatures_2017
Output: temperatures_2017.city, temperatures_2017.maxtemp
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2017
(10 rows)

As we can see, when issuing aggregated queries on a partitioned table with foreign data tables, aggregation is not executed on the workers. This is a showstopper for us: considering the use case, we cannot have all the aggregation on the master node. We expect billions of rows on each worker, streaming them to the master would eliminate all the performance gains we are expecting. There is hope: if we execute aggregate statements directly on the foreign tables, Postgres correctly pushes it down to the workers:

postgres=# explain (verbose, costs off) select avg(maxtemp),city from temperatures_2017 group by 2;
QUERY PLAN
----------------------------------------------------------------------------------
Foreign Scan
Output: (avg(maxtemp)), city
Relations: Aggregate on (public.temperatures_2017)
Remote SQL: SELECT avg(maxtemp), city FROM public.temperatures_2017 GROUP BY 2
(4 rows)

Tableau Union

It seems we should change strategy here. Instead of using partitioned tables, we might need to create union from foreign tables directly in tableau.

However, foreign tables are not visible in Tableau Desktop by default. This is a product issue, Tableau”s schema inspection queries do not include foreign tables. Quick workaround: create views from these foreign tables:

postgres=# create view v_temperatures_2017 as select * from temperatures_2017;
CREATE VIEW
postgres=# create view v_temperatures_2016 as select * from temperatures_2016;
CREATE VIEW

It seems the workaround works (this is why it is called as a workaround):

After quickly building the union, we can see what queries are generated by Tableau and how it gets executed by our Hyper MPP Cluster:

postgres=# explain verbose SELECT AVG("t0"."maxtemp") AS "avg:maxtemp:ok",
"t0"."city" AS "city"
FROM (
SELECT "t1"."city" AS "city", "t1"."maxtemp" AS "maxtemp"
FROM (
SELECT CAST("v_temperatures_2016"."city" AS TEXT) AS "city",
"v_temperatures_2016"."maxtemp" AS "maxtemp"
FROM "public"."v_temperatures_2016" "v_temperatures_2016"
) "t1"
UNION ALL
SELECT "t2"."city" AS "city", "t2"."maxtemp" AS "maxtemp"
FROM (
SELECT CAST("v_temperatures_2017"."city" AS TEXT) AS "city",
"v_temperatures_2017"."maxtemp" AS "maxtemp"
FROM "public"."v_temperatures_2017" "v_temperatures_2017"
) "t2"
) "t0"
GROUP BY 2;
QUERY PLAN
------------------------------------------------------------------------------------------------
HashAggregate (cost=356.50..359.00 rows=200 width=64)
Output: avg(temperatures_2016.maxtemp), temperatures_2016.city
Group Key: temperatures_2016.city
-> Append (cost=100.00..342.85 rows=2730 width=36)
-> Foreign Scan on public.temperatures_2016 (cost=100.00..150.95 rows=1365 width=36)
Output: temperatures_2016.maxtemp, temperatures_2016.city
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2016
-> Foreign Scan on public.temperatures_2017 (cost=100.00..150.95 rows=1365 width=36)
Output: temperatures_2017.maxtemp, temperatures_2017.city
Remote SQL: SELECT city, maxtemp FROM public.temperatures_2017
(10 rows)

Still not the one we want. Tableau does not push aggregation down to standalone table level.

Tableau Custom SQL

With Tableau Custom SQLs we can control what queries needs to be executed. No surprise here, by setting a proper custom SQL our MPP cluster runs the right queries with the right push down.

postgres=# explain verbose SELECT CAST("Custom SQL Query"."city" AS TEXT) AS "city",
SUM("Custom SQL Query"."maxtemp") AS "sum:maxtemp:ok"
FROM (
select sum(maxtemp) maxtemp, sum(mintemp) mintemp,
count(maxtemp) maxtemp_count, count(mintemp) mintemp_count,
city, at from temperatures_2016 group by 5,6
union all
select sum(maxtemp) maxtemp, sum(mintemp) mintemp,
count(maxtemp) maxtemp_count, count(mintemp) mintemp_count,
city, at from temperatures_2017 group by 5,6
) "Custom SQL Query"
GROUP BY 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=284.80..287.30 rows=200 width=64)
Output: "*SELECT* 1".city, sum("*SELECT* 1".maxtemp)
Group Key: "*SELECT* 1".city
-> Append (cost=109.60..282.80 rows=400 width=40)
-> Subquery Scan on "*SELECT* 1" (cost=109.60..140.40 rows=200 width=40)
Output: "*SELECT* 1".city, "*SELECT* 1".maxtemp
-> Foreign Scan (cost=109.60..138.40 rows=200 width=68)
Output: (sum(temperatures_2016.maxtemp)), NULL::bigint, NULL::bigint, NULL::bigint, temperatures_2016.city, temperatures_2016.at
Relations: Aggregate on (public.temperatures_2016)
Remote SQL: SELECT sum(maxtemp), NULL::bigint, city, at FROM public.temperatures_2016 GROUP BY 3, 4
-> Subquery Scan on "*SELECT* 2" (cost=109.60..140.40 rows=200 width=40)
Output: "*SELECT* 2".city, "*SELECT* 2".maxtemp
-> Foreign Scan (cost=109.60..138.40 rows=200 width=68)
Output: (sum(temperatures_2017.maxtemp)), NULL::bigint, NULL::bigint, NULL::bigint, temperatures_2017.city, temperatures_2017.at
Relations: Aggregate on (public.temperatures_2017)
Remote SQL: SELECT sum(maxtemp), NULL::bigint, city, at FROM public.temperatures_2017 GROUP BY 3, 4
(16 rows)

This is some sort of (partial maybe?) victory. With a combination of parameters, even dynamic parameters, we can tackle complex use cases on large data sets. But we have to pre-define the aggregated calculations (sum, count) in advance, we have to add all attributes to the SELECT 's field selection and define parameters for all quick filters. Doable, but not convenient. Still, sacrificing convenience for extreme performance for critical use cases - that's what we do all the time.

Conclusion

After spending a few hours, I was able to build a distributed Hyper database powered shared-nothing, MPP database cluster. It supports parallel ingestions, horizontal and vertical scaling, spreading queries across servers in a kubernetes cluster. The only drawback is that the custom SQL based data source definition in Tableau as Postgres has some limitations on aggregate push-back on partitioned tables.

In the next article, I will perform a few performance measurements to see how this system scales along with the number of workers using real-life use cases.

All sources are uploaded to: https://github.com/tfoldi/tableau-hyperd-docker

If you have any questions feel free to drop a comment, I am happy to answer.


Originally published at https://databoss.starschema.net on August 11, 2019.

Starschema Blog

Data contains intelligence that can change the world — we help people discover, manage and use this intelligence.

Tamas Foldi

Written by

Starschema Blog

Data contains intelligence that can change the world — we help people discover, manage and use this intelligence.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade