Streamlining Tableau Server Deployment, Monitoring, and Maintenance on AWS

Published in

Storm Reply

9 min readDec 18, 2023

Tableau Server is one of the most popular tools for data visualization and business intelligence (BI), adept at transforming complex data sets into clear visualizations. It enables rapid and responsive data processing, which facilitates efficient data analysis. Users can apply data filters and view datasets from various perspectives. Its use extends across numerous industries, including healthcare, research, and business.

Despite its efficient data processing and diverse analytical features, Tableau Server lacks inherent support for high availability. As Tableau’s official documentation notes, even with a multi-node high-availability setup, the server may experience functional interruptions if the primary node fails. Further details on this issue are available in the Tableau documentation under the topic “If an initial node fails.”

To address this significant limitation, we have developed an innovative, automated custom solution aimed at greatly reducing downtime during Tableau node failures. This solution is designed to instantly detect any failure in Tableau nodes and immediately replace the faulty node with a functioning one, all without manual intervention.

Our solution involves automatically installing and maintaining a multi-node Tableau server on the AWS cloud using our custom stack. Deploying this stack leads to a complete setup and configuration of a multi-node Tableau Server on AWS, requiring minimal manual effort.

From a maintenance standpoint, our solution is fully automated, eliminating the need for manual oversight. This not only removes the necessity for managed services but also reduces operational costs. As a result, the downtime experienced by Tableau during node failures has been drastically cut from hours to mere minutes, thanks to the fully automated recovery process.

Additionally, we have integrated the AWS CloudWatch service for real-time server health monitoring, enabling quick failure detection. We have also configured the AWS Web Application Firewall (AWS-WAF) with the Tableau Server to bolster security.

Moreover, we have implemented an automatic daily backup system for both application and data files. This ensures that the latest backup is always available as a fallback in the event of a total failure of the Tableau Servers.

Challenges with the Traditional Manual Installation of Tableau Server

Due to its operational importance, the installation, maintenance, and availability of tableau server is an important topic to discuss. When it comes to self-hosting tableau server, two installation modes are available: Single-node Installation and Distributed Installation.

Single-node Installation: This mode refers to the type of installation where all the tableau server processes are configured onto the same machine. Tableau’s official documentation states that this type of installation should be restricted to environments that can afford occasional downtime. The weak point of this mode of installation is that there is less redundancy in case of failure of one of the server processes.

Distributed Installation: This is the more stable mode of installation as it involves distribution of tableau server processes onto multiple server nodes. This would mean that the tableau server processes can utilize additional computing power when required. However, when it comes to the availability, this type of installation does not guarantee that the tableau server will continue to function in case of fault in the initial server node.

Tableau official documentation states the steps that need to be carried out to install the multi-node tableau server. Link to the official documentation:
Creating a distributed Tableau Server installation

The steps are to be performed in sequential order and it requires a lot of time if performed manually. The process is as follows: Tableau server must first be installed on the initial node, after which the bootstrap file is generated. This bootstrap file is subsequently utilized during the initialization phase of additional nodes of the Tableau server. Distribution of Tableau server processes among the nodes is then necessary to enhance availability and reliability. Additionally, it is recommended to deploy the coordination service ensemble when configuring the environment with three or more nodes.

The execution of all the above-mentioned installation steps requires specialized IT skills, as it involves the understanding of server architecture, network configuration, security protocols, and related skills. In the next section, we will discuss how we have automated these processes, rendering manual execution unnecessary. With our solution, installing the Tableau server is now a simple and hassle-free process.

Automated Deployment of Multi-Node Tableau Server on AWS

In this section, we introduce our automated approach for deploying a multi-node Tableau Server on the AWS cloud. This approach simplifies the deployment process and eliminates the requirement for specialized knowledge. Our implementation leverages AWS CDK (Cloud Development Kit) to orchestrate the deployment of this application. AWS CDK allows you to define and manage your infrastructure through code, making it easier to manage and version control the infrastructure. To ensure modularity and streamline administration, we have organized the entire application into separate CDK stacks. The details of the stacks are as follows:

Network Infrastructure Stack: This stack is responsible for provisioning the fundamental networking components essential for our application’s operation. It includes the setup of a Virtual Private Cloud (VPC), the creation of both public and private subnets, the deployment of Internet gateways, NAT gateways, and the fine-tuning of route tables and security group rules.

Tableau Stack: This stack deploys and configures the multi-node distributed tableau server environment onto the AWS cloud. The AWS services that are used in this solution are the S3 bucket, AWS SSM parameters, EC2 instances, CloudWatch, EventBridge, Systems manager run command, EC2 Autoscaling Group, and Load Balancers. As part of this implementation, we have developed the installation and configuration scripts that are automatically stored in the S3 bucket during the deployment of the Tableau CDK stack.

The autoscaling groups are configured to spin up the EC2 instances. The scripts are then executed on the instances which install all the dependencies that are required for the Tableau Server installation. Subsequently, these instances need to be configured as either the initial node, the application node, or the data node within the Tableau Server environment.

The sequence of this setup is as follows:

The first instance is designated as the initial node for Tableau Server.
The second instance assumes the role of the application node for Tableau Server.
The remaining instances are designated as data nodes within the Tableau Server configuration.

To implement this logic effectively, we have utilized the SSM Parameter Store to store flags that keep track of the sequence in which these instances are configured.

After assigning the instances to their respective nodes, all of them proceed to install the latest version of Tableau Server. The nodes then complete all the required configurations.

Once all the other nodes have completed their configuration, they notify the initial node, which then proceeds to distribute server processes among different nodes to ensure redundancy and optimize performance. One such example of distribution of server processes among different nodes is illustrated in Figure 1.

The ultimate architecture of this setup comprises one instance designated as the initial node, one instance serving as the application node, and remaining instances functioning as data nodes.

Monitoring Stack: This stack facilitates the deployment of monitoring capabilities for our Tableau Server through AWS’s CloudWatch services. Amazon CloudWatch streamlines our infrastructure and application maintenance by real-time visualizing the metrics and the event data.

Within this stack, CloudWatch metrics are configured to monitor CPU utilization, RAM usage, and disk space of the EC2 instances hosting the Tableau distributed servers. Additionally, it enables CloudWatch alarms with predefined thresholds, enabling us to closely monitor the health of our nodes. It guarantees real-time notifications to server operators through email alerts.

Security Stack: The security stack configures the AWS Web Application Firewall (WAF) to provide a critical layer of protection, shielding the servers from potential attacks. This includes the fine-tuning of Web ACL (Access Control List) rules, which specify the criteria for allowing or blocking web requests. The WAF configuration is designed to ensure the integrity and resilience of our system against potential threats and vulnerabilities. To allow real-time reporting, we have implemented alert notifications via email to promptly inform the concerned team of any suspicious activity or attacks. This proactive approach ensures that our infrastructure remains secure and well-protected.

Challenges in Tableau Server Maintenance

After the Tableau distributed environment is operational, there are some challenges and limitations pertaining to the maintenance of the services. Three of the crucial server processes only run on the initial node, namely the License Service (License Manager), Activation Service, and TSM Controller (Administration Controller). So even the distributed environment does not guarantee the continuity of the service in case of initial node failure. It is imperative to have a prepared recovery plan in place, which involves migrating these processes to other active server nodes. More information on the recovery guide for the Tableau’s initial node failure can be found here: Link

Furthermore, the failure of nodes other than the initial one can led to performance degradation and poor user experience. Hence, the recovery plan should also include a method for promptly replacing any failed nodes. The steps for addressing this scenario are outlined here: Link

We have automated all the above steps as part of our solution. Consequently, there is no longer a requirement for manual execution of recovery procedures, resulting in significant time and effort savings, as well as the removal of the necessity for managed services.

More will follow in the next section.

Automated Replacement of Failed Nodes, Leveraging AWS Capabilities

When Tableau Server is operational, there’s always a chance that one of its server nodes might become inoperative. Our objective is to automate the substitution of any failed node without the need for manual intervention.

To detect node failures automatically, we utilize EC2 lifecycle hooks. These hooks are activated when an instance within the auto-scaling group is set to terminate due to an internal node issue. We’ve also set up an AWS EventBridge rule to capture events from these lifecycle hooks. Following this, EventBridge triggers a Lambda function.

In situations where the initial node fails, the Lambda function first transfers essential server processes — like the License Service (License Manager), Activation Service, and TSM Controller (Administration Controller) — to a healthy existing node, designating it as the new initial node. This transition involves updating configurations and settings so the new node can take over the initial node’s duties. A brief period of downtime is expected during this transition. The function then deactivates the server licenses on the failed node before it’s terminated.

Subsequently, the new initial node creates an updated bootstrap file and oversees the replacement of the failed node with a new instance.

If an application or data node fails, the initial node regenerates the bootstrap file. Concurrently, a new instance is created, which installs the Tableau Server and configures itself using this bootstrap file. This method is designed to minimize downtime by swiftly and efficiently replacing the failed node.

Scheduled Backups and Maintenance in Tableau Server

In our efforts to enhance the efficiency of Tableau Server, we have not only streamlined the initial setup and automated the recovery of failed nodes but also integrated a system for regular daily maintenance and data backups.

To achieve this, we have employed Linux Cron jobs, set to execute daily. These jobs effectively save the day’s backups in an S3 bucket, providing a reliable source for disaster recovery, should there be any complications with the automatic restoration process.

Additionally, we utilize Cron jobs for conducting regular maintenance tasks on the Tableau Server’s TSM. Scheduled on a weekly basis, these maintenance jobs are crucial for removing temporary and log files. This systematic cleaning helps optimize storage space on the instances and boosts the overall efficiency of the system.

Conclusion

In conclusion, the solution presented in this article offers a comprehensive approach to addressing the high availability challenges of the Tableau Server. By seamlessly integrating this solution into the AWS cloud, organizations can efficiently manage their Tableau infrastructure without the need for extensive technical expertise. Our solution automates the installation of multi-node tableau server, and the recovery procedure for failed nodes, guaranteeing minimal downtime in an event of tableau node failure. Our automated stack not only ensures swift recovery from node failures, but also provides real-time monitoring and security enhancements through the integration of AWS CloudWatch and AWS WAF.

The incorporation of daily scheduled backups and weekly maintenance tasks further enhances the reliability and availability of Tableau Servers. This enables us to establish a strategy where we always have the most current data available for backup in case of emergencies.

With this approach, organizations can confidently leverage Tableau for their data visualization and business intelligence needs, knowing that their infrastructure is resilient, secure, and well-maintained.

By reducing the complexity and IT skills required for Tableau Server management, we empower organizations to focus on deriving insights from their data rather than managing the underlying infrastructure. This holistic solution represents a significant step forward in optimizing Tableau Server operations on the AWS cloud.

This article was written by Ali Jaffar and edited by Sven Seiler