Refresh Tableau data sources with Airflow using Personal Access Tokens

Baruch Shteken
HeyJobs Tech
Published in
5 min readMar 30, 2023

The struggle

We had a working DAG to refresh our Tableau data sources using username and password, however, Tableau forced all our users to use two step authentication. This meant that we could no longer use the username and password method and the only other option available was to use Personal Access Tokens (PAT).

What are PATs?

PATs are tokens that can be used to retrieve a credential token. In turn, the credential tokens can be used to send API requests to Tableau.

E.g, a request to get a credential token:

curl "https://eu-west-1a.online.tableau.com/api/3.17/auth/signin" -X POST -d "<tsRequest><credentials personalAccessTokenName='token_name'  personalAccessTokenSecret='token_value'><site contentUrl='site_id'/></credentials></tsRequest>"

token_name and token_value are the name and value of the token you created in Tableau. site_id is the name of your Tableau cloud site.

Then, a request to get all data sources in Tableau:

curl "https://eu-west-1a.online.tableau.com/api/3.17/sites/site_id/datasources" -X GET -H "X-Tableau-Auth:credential_token"

site_id and credential_token are derived from the reply of the previous request.

Important points for PAT usage:

  1. Each PAT is valid for 1 year.
  2. Each credential token is available for 4 hours.
  3. Once a new credential token is generated, the previous one from the same PAT becomes invalid.

Initial architecture

We have multiple domains in our company and each domain has its own DAG. Also, we have a dedicated DAG for Tableau data sources refreshes (Tableau Refresh DAG). In the Tableau DAG, each domain has its own sensor and when the respective domain DAG finishes, the sensors is marked as success and the data sources related to this domain are starting to refresh. The tasks responsible for the data source refreshes are marked as success only when the data source refreshes finish in Tableau cloud. The DAG finishes when all data source refreshes finish.

The following diagram illustrates the Tableau Refresh DAG:

Depiction of Tableau Refresh DAG

This way, data sources related to domains which their running time is small, will refresh before other data sources.

We used username and password with the API requests in this architecture. The authentication details were saved in a Tableau connection in Airflow.

Airflow design for PAT

The current Airflow code base supports the usage of PAT (however, it is deprecated). The reason for the deprecation is that the current Airflow code generates a new credential token for each request. This means that previous credential tokens become invalid. We usually want to wait for the data source to finish and it involves requesting data from Tableau. This is done by sending a request to Tableau for the status of data source refresh every couple of minutes until we get a reply that data source finished refreshing. If the credential token changes, we will get a sign in error when we send the request for this status.

If we want to continue using the Airflow code, we have 2 options:

  1. Change the code to not generate a new credential token for each request but use the current one.
  2. Instantiate a pool of PATs and use them with the current Airflow code.

We decided to go with the second option because we needed to change urgently and we didn't know how much time it would take to change the Airflow code and contribute to the Airflow code base.

Refactored architecture

The new architecture includes the following new components:

  • Tableau PATs
  • AWS parameter store
  • Airflow integration with PATs

Tableau PATs

We created 10 PATs in our master Tableau account.

AWS parameter store

In AWS parameter store, we have 10 variables containing connection strings to Tableau with 10 different PATs. These variables are stored in our Airflow as environment variables and they can’t be seen in the connections page in Airflow. The reason these environment variables are considered connections is that they all start with AIRFLOW_CONN_ string.

E.g: AIRFLOW_CONN_AIRFLOW_TABLEAU_CONN_1 = xyz will generate a connection named AIRFLOW_CONN_AIRFLOW_TABLEAU_CONN_1 with the value xyz.

Airflow integration with PATs

As depicted in the following diagram, we added 1 task in the beginning of the DAG to generate a variable called tableau_conns_availability:

Depiction of Tableau Refresh DAG after implementation of PATs

The value of the variable is the following:

{
"AIRFLOW_TABLEAU_CONN_1": "free",
"AIRFLOW_TABLEAU_CONN_2": "free",
"AIRFLOW_TABLEAU_CONN_3": "free",
"AIRFLOW_TABLEAU_CONN_4": "free",
"AIRFLOW_TABLEAU_CONN_5": "free",
"AIRFLOW_TABLEAU_CONN_6": "free",
"AIRFLOW_TABLEAU_CONN_7": "free",
"AIRFLOW_TABLEAU_CONN_8": "free",
"AIRFLOW_TABLEAU_CONN_9": "free",
"AIRFLOW_TABLEAU_CONN_10": "free"
}

The following steps occur when an Airflow task wants to send a request to Tableau:

  1. Airflow gives the task an available connection from tableau_conns_availability.
  2. Airflow marks the connection as busy in tableau_conns_availability.
  3. The task sends requests to Tableau (it could take some time for refresh to finish).
  4. The task is marked as finished/failed.
  5. Airflow marks the connection as available in tableau_conns_availability.

If there are no available connections then the task waits 10 minutes and tries again. If there is no available connection again then the task waits another 10 minutes. This cycle can occur 10 times and after 10 retries with no available connection, the task fails.

The following diagram illustrates the PAT implementation in Airflow (step 1 in the previous list):

Flow of an Airflow task requesting a Tableau connection

We encountered 1 drawback with this architecture after implementing this. Since each domain DAG that finished kicks refreshes on multiple data sources, we sometimes had multiple data sources that got the same Tableau connection_id. We presumed that since multiple tasks requested a connection at the same time, Airflow didn't manage to mark some connection as busy and thus some connections were used twice. This resulted in failed tasks because the credential token changed and some tasks had the old one.

The solution we chose was simple. When we create the tasks in Airflow, each task gets a number (0,2,4…) and when the task should start, it waits the same number of seconds. This way, 2 tasks never request a connection together and Airflow has enough time to finish marking the connection as busy before dealing with other tasks that request a connection.

Improvements for the future

Although the process as depicted above works, it only works at HeyJobs. Our aim is to harness the expertise we have gathered in the realm of Airflow-Tableau integration, applying it to enhance the Airflow source code for improved functionality. The change could be for example, not generating a new credential token for each request but use the current one, or maybe another solution. We already raised an issue. Maybe you have an idea how to implement a solution for this problem?

Interested in joining our team? Browse our open positions or check out what we do at HeyJobs.

--

--