Connecting Druid to Tableau

Tanushree Rohera
9 min readSep 10, 2021

--

If you use Apache Druid and want to bring it into Tableau, this post may be of interest to you. It is a summary of the learnings that my colleague and I had gained while helping a few customers establish this connection. We put this together because as Druid’s and Tableau’s popularity keeps growing, there is an increased need to integrate them together. The technologies, however, work pretty differently and so, there are nuances in getting them to work together. This write-up is to give you an idea of what Druid is and how Tableau connects to it.

Disclaimers:

This is not meant to serve as a best-practice guide and we do encourage you to complement this with your own learnings, knowledge, and understanding of Druid and the data environment you are dealing with. In fact, if you are passionate about either of these technologies and want to develop and refine this document, feel free to reach out. This is also a sharing of personal learnings and has not been officially vetted by the Tableau Development team, therefore isn’t meant to serve as an official help article. As of the date of this blog post, there isn’t an officially supported Tableau-Druid connector.

Acknowledgments:

The learnings for this post come from detailed tests run by my customers (who I keep learning from) and colleagues, Prashant Keshri and Prashant Sharma.

What is Apache Druid?

Apache Druid is an open-source real-time analytics database designed for faster slice-and-dice analytics. It has increasingly started becoming a key architectural component for people who are trying to query real-time data sources.

Because of its speed of querying, aggregation optimization capabilities, and because of how well it works with event data, it has become increasingly popular with new-age start-ups, e-commerce companies, and marketing agencies. You can read more about the technology and its use-cases here. If you’re wondering who has been using Druid and you probably want to read this post by Airbnb or this video presented by Netflix.

…So how do Tableau and Druid fit in together?

Given that Druid itself is an event-based streaming engine, focused on ‘real-time’ insights, you may not have it as part of the same data pipeline as Tableau. Airbnb and Netflix in the videos shared above have a separate open-source reporting platform like Apache Superset. That said, some of the reasons why my customers have considered pairing it with Tableau:

  1. Ease-of-use: Generally open-source reporting tools are pretty SQL based. For example, one of the alternatives is an open-source tool called ‘Redash’ where a user needs to write out the SQL Statement, bring it into Redash’s in-memory, and then build a visualization out of it. This is where Tableau’s native user-friendliness becomes really important. Often Druid captures main events and orders data: data that needs to be tracked and used across the organization. Therefore, this needs to be accessible to all analysts regardless of their comfort with scripting.
  2. Visual Governance Capabilities: As an enterprise analytics tool, Tableau has a strongly developed Governance framework. Through the use of sites, projects, user roles, and row-level security, there are multiple levels of content authorization and authentication, and setting it up is a very visual process. While tools such as Apache Superset may be an easier-to-use open-source alternative, and (as of its later releases) has stronger governance capabilities, it does require some development expertise.
  3. Mobile: A lot of these dashboards tend to capture the most important data points for end business users and having the metrics accessible on any device is very important. The open-source alternatives may or may not be built for mobile compatibility.
  4. Volume or Velocity of Data: Some of the BI alternatives have some limitations towards the volume of data they can analyze on their platform. And yes, Tableau’s not built to stream real-time, but it does give enough flexibility for it to be tweaked to your requirement.

How can Druid data be brought into Tableau?

Alternatives to bring Druid into Tableau: Using a supported connector like Hive/Presto and Apache Calcite
Ways of bringing Druid Data into Tableau

To bring Druid data into Tableau there are two major choices:

  1. Bring the data through an intermediary database (the ones we’ve come across our Hortonworks Hadoop Hive and Presto).
  2. Leverage the JDBC Connector provided by Apache Calcite

If you are wondering what you should consider when deciding between the two, you may want to consider the following:

  1. Supported vs. Unsupported Connectors: The advantages of using Hive and Presto is of course that they are both supported connectors and Tableau’s support team would be able to troubleshoot any driver-related concerns that may come. On the other hand, the Apache Calcite JDBC is open-source + unsupported, and therefore if there are any concerns with the driver, you would need Druid expertise internally to troubleshoot the issue.
  2. Openness to having a new element in your data architecture : Although recommended, going down the supported connectors route could involve having an additional element in the data pipeline, and sometimes that may not be the most feasible option for your platform team. Especially if you have been using Druid for a while, you are probably used to using the driver with another reporting layer and if it works fine there, you would assume it would work fine in Tableau. Right? (Sort of, we’ll get into what you need to keep in mind in the next section)
  3. Time to set up: Finally, if you are looking to get something up and running really fast, especially to test, the JDBC is pretty quick to set up. It just needs to be installed in the right locations, set up with the right permissions and it would be ready to use. You could read more on that here.

I would like to go down the JDBC Route, what should I know?

Alright! If you’re going to go down the JDBC Route (you have been warned..), we would definitely recommend having both a Druid expert and a Tableau expert with you. Let’s go over why:

  1. SQL is not Druid’s native query language: Druid’s native query language is JSON. Tableau generates SQL. Any conversion from Tableau’s SQL to JSON is done by the driver, Apache Calcite. There are times when things get lost in translation and the error messages may not always be the most useful to troubleshoot the concern. This is where having a Druid expert on hand to read and understand the Druid logs is very important. They would be able to identify if there’s a timeout on Druid if the Druid has lower than required specifications, etc. If you’re relatively new to Druid, it’s probably best to go down the supported connector route.
A snapshot of a driver related error message on Tableau
  1. Tableau allows you to generate SQL statements of varying lengths of complexity without you realizing it: Tableau was built to allow analysts to get really deep with their analysis without ever having to write a single line of code. The flip side of this is that very often you may generate SQL that the driver is not able to translate or is too complicated for the data source (in this case, Druid) to handle. And so, having a Tableau expert who understands the SQL Tableau generates, what counts as a complicated vs. not complicated query and the workarounds definitely helps. If you are a Tableau expert and have not come across the Performance Recorder, consider using this to get the SQL statements that Tableau generates.
A snapshot of a sample Performance Recording output, taken from Anirudh Khanna’s blog.

Were there any places you saw these driver concerns come up?

Although there were multiple JDBC errors that occurred because Druid’s resources were not sufficient, the two examples below were driver translation and complex calculation related:

  • Example 1: The one where we used a Date Filter as a Data Source: In order to reduce the query load to the database, we added a relative data source filter on a date, to include only the last 2 days worth of data. As you can see in the image below, when you do such a filter in Tableau, we generate a SQL statement that generates a ‘date’ calculation. In this case, because we also converted the date field into a date-time field it created a “Normalize_DATETIME” calculation. If you notice Druid’s list of calculations, they don’t support such calculations. The error message related to this was just ‘Generic JDBC Error’ so we had to iterate through a few steps to isolate the calculation causing the error.
Query screenshot from the Tableau Performance Recording workbook.
  • Example 2: The one where we used a LOD: If you have used Tableau before, you have probably used something called ‘Level-of-Detail’ Calculations. They are super strong and super powerful analytical functions that allow you to do all kinds of analysis in Tableau. But you probably have realized that they are expensive operations. A typical level of detail operation, as you can see in the screenshot below (highlighted in green) tends to generate a pretty detailed sub-query as part of the original query. At the time of testing, sub-queries were not supported by Druid and therefore operations like this too would timeout.
Image from Priyatham Pamu and Lingyan Yin Tableau Conference 2019 Session

Alright, so how do I work around these concerns?

The Showrunner: Custom SQL!

  • A Tableau performance guru’s general nightmare saved us when it came to Tableau’s connection with Druid. Unless you are really really good at isolating all the permutations and combinations that could result in Tableau’s calculation conflicting with Druid’s native language, Custom SQL is a great way to control the SQL getting sent back to Druid. This resulted in queries that Druid could accept and reduced the frequency of timeout concerns.

Supporting cast:

  • Table Calculations: Generally we would not recommend using Custom SQL and a Live Connection. But one of the highlights of Druid is its ability to work with real-time data and live Connections in Tableau give you the flexibility of near-real-time data access. Therefore, for some of the more crucial dashboards, you could probably test it out. But if you do use Custom SQL and are running a live dashboard, you probably want to be very disciplined on the end data volume and the complexity of the transformations you do in Tableau. For things that require a bit of data transformation, consider using the super versatile Table Calculations.
  • Extracts: If the complexity of calculations goes beyond things that can be solved by Table Calculations, you may need to use extracts. You can set it to a small refresh schedule (the lowest by default is 15 minutes, but it can be lowered using tabcmd). While this does lose the ‘near-real time’ benefit that Druid is famous for, if your users are comfortable with a few minutes lag for certain dashboards, then you can consider this route. Ultimately it will bring the benefits of governance and security and mobile accessibility mentioned before.
  • Published Data Sources: Finally, to allow more users to use the Druid Data Source, the last step was to publish the data source to Tableau Server so that other analysts can directly connect to it without having to write out Custom SQL.

Any Final Points

Apache Druid is a super powerful analytical engine and Tableau is a super powerful analytics and governance platform and integrating the two would seem natural. However, because they query so differently, you would need to understand the two while bringing them together. The good news is that like Tableau, Druid has a strong community with very passionate consumers. This means that generally there tend to be tons of community-based articles to support your troubleshooting.

If you’d like a direct connector to Tableau, be sure to add an idea to Tableau’s ideas forum — the Tableau dev team sources most new features from here!

And of course, reach out if you want to develop this article further :)

--

--