Looker — self-hosted installation on GCP
Looker is Google Cloud’s cloud-native Enterprise BI Platform enabling access to near real-time data when and where you need it.
You can choose between Looker hosted solution where Looker manages all the components, and a customer hosted solution where you own your environment. The choice between the two solutions depends on the requirements. For example you can have security requirements that do not allow you to use the Looker hosted one. With Looker hosted solution you do not have to worry about installation, upgrades and scaling and also the time to solve issues is faster. With a customer hosted solution, you have complete control on your infrastructure.
In this article I will describe the steps necessary to install Looker on GCP.
It is possible to install Looker on a single VM but this is acceptable just for small workload databases. A better solution for a production environment is a cluster of VMs as shown in the picture below.
Running Looker as a cluster of instances across multiple VMs is a flexible pattern that benefits from service failover and redundancy. Horizontal scalability affords increased throughput without running into excessive garbage collection costs.
Cluster Considerations
- OS and Distribution
Looker runs on the most common versions of Linux: RedHat, SUSE, and Debian/Ubuntu. GCP distributions of Linux are compatible with Looker. Debian/Ubuntu is the most heavily used Linux variety. Looker runs in the Java virtual machine (JVM). When choosing a distribution, check if the versions of the OpenJDK 8 are up to date.
- CPU and Memory
For production use 16x64 nodes (16 CPUs and 64 GB of RAM), a good balance between price and performance. Configure more than 64 GB of RAM impact performance, because garbage collection events are single threaded and halt all other threads For configurations with up to 50 users, Looker recommends running a single server
- Disk storage
100 GB of disk space is typically sufficient for a production system.
- Capacity
When more capacity is required you should add 16x64 nodes to the cluster rather than increase the size of the nodes beyond 16x64.
- File System
Looker nodes need to share certain parts of the filesystem (LookML models, Looker models developers, Git server connectivity). The file system must be POSIX compliant.
- Database
Looker’s metadata needs to be centralized, so its internal database must be migrated to Cloud SQL (MySQL).
Looker needs a Git service to provide version management of the LookML files. GitHub, GitLab, BitBucket and others are supported.
Scalability
You can specify the number of nodes in the cluster in the instance group configuration. For the moment, there is no clean way for Looker to terminate a node gracefully.
You can autoscale up, but scaling down should be done manually and very carefully.
- Remove the node from the Load Balancer directory
- Wait until the user starts a new session (typically 15, 20 minutes)
- Take the node offline
Upgrading
- Create a new Looker image
- Always keep in mind the most important rule of upgrading VMs: you can never have two versions of VM-based Looker connected to the same database. This will corrupt the database and render it unusable.
- The quickest way to safely proceed is to “scale” our instance group to 0 nodes. Make the change in the Edit modal for your instance group.
- Backup your DB
- Recreate the instance group with the new image
Network
Looker listens for HTTPS requests on port 9999. Looker uses a self-signed certificate with a common name of self-signed.looker.com.
The Looker API listens on port 19999.
Internal database connection
Private service access must be enabled in order to connect to Cloud SQL from a Compute Engine instance using private IP. Your VM instance must be in the same region as your Cloud SQL instance.
External services
Looker’s telemetry and license servers are available on the public internet via HTTPS. Traffic from a Looker node to ping.looker.com:443 and license.looker.com:443 must be allowed.
SMTP services
By default, Looker sends outgoing mail via SendGrid. That may require adding smtp.sendgrid.net:587 to an allowlist.
The cluster nodes will communicate with each other through a message broker service, which uses ports 1551 and 61616. Ports 1551 and 61616 must be opened between cluster nodes.
Database
Recommendation is to use a remote MySQL database (in Google Cloud, use Cloud SQL ).
MONITORING
It is possible to use Cloud Monitoring or JMX.
JMX
JMX is not enabled by default. To enabled it the startup script needs to be modified
https://docs.looker.com/setup-and-management/on-prem-install/monitoring-instance
Cloud Monitoring
It is suggested to collect, graph and alert on at least the following performance metrics:
- CPU Utilization: load and percent CPU utilized
- Memory Utilization: total used and swap used
- Disk Usage
LOGGING
It is possible to configure in the startup script logging options such as where the log files are stored, the level and the log format.
SECURITY
- Looker uses a secure connection to query Cloud SQL and Cloud Filestore.
- Administrators can set granular permissions by user or group and can restrict access
- All data is encrypted at rest.
- Allow use of your MySQL user account only to the IP address used by your Looker server.
IAC
You can use Terraform to install Looker on GCP. The following modules are needed:
- Database module
- Filestore module
- DNS module
- Hosted Zone Module
- Instance Group module
- Load Balancer module
- Secrets Module
- SSl certificate module
References