Log Management and Guidewire Cloud Platform Observability

Published in

Guidewire Engineering Blog

11 min readAug 2, 2023

By: Premchand Bellamkonda (Senior Technical Product Manager), Amit Chaudhari (Senior Software Engineer), Muralicharan Gurumoorthy (Senior Director Of Engineering)

For any cloud platform, observability refers to the ability to gain insight into the internal state and behavior of the platform’s infrastructure components, services, and applications. Observability is a growing focus area in the industry because it helps operators and developers detect issues and make faster decisions that lead to improved platform performance, reliability, security, and customer experience.

While successful platform observability requires a cohesive approach, most observability tools are comprised of three (3) core components:

Logs, which are a record of operational events that have occurred within the platform
Metrics, which are quantitative measurements of the system’s performance or behavior
Traces, which provide visibility into the flow of requests and transactions throughout the platform

In previous blogs — Guidewire Cloud: Why Hybrid Tenancy is the Right Choice — we explained how Guidewire Cloud Platform (GWCP) is powered by ATMOS, a platform-as-a-service (PaaS) layer specifically optimized for the insurance industry. In any distributed computing model like that enabled by GWCP and ATMOS, providing a high degree of observability is crucial to maintaining a robust and reliable cloud platform.

In this blog, we will take a dive deep into the logging aspects of cloud platform observability and explain how it was enabled in GWCP.

GWCP Logging Requirements by Persona

As a distributed cloud platform serving a worldwide customer base, there are number of development and operational personas who interact with GWCP and rely on essential logging capabilities to provide the observability they need. These include:

Customer Developer: Leverages logging to troubleshoot customized InsuranceSuite and other applications that run on GWCP
Customer Site Reliability Engineer / Operator: Monitors the operational performance of their applications to ensure key performance metrics are being met, including Service Level Objectives (SLO), Service Level Acceptance (SLA) and Application Response Time (ART)
Guidewire Developer: Develops applications, tools, and services that run on ATMOS, accessing logs to perform debugging and troubleshooting as part of Guidewire’s standard development lifecycle
Guidewire Site Reliability Engineer / Operator: Manage and maintain cloud operations across all tenants, leveraging a single, centralized log management and diagnostics interface

Multitenant Logging on GWCP — Conceptual Architecture

The goal of log management (LM) in a cloud platform is to establish a robust and efficient process for collecting, storing, analyzing, and utilizing log data. In GWCP this required integrating and orchestrating interactions between various components and services, giving the personas described earlier the log access they need for their day-to-day work. The diagram below shows how this was implemented on GWCP.

Complex diagram showing the Guidewire Cluser with Node 1 and Node 2 and how that interacts with GWCP Internal LM. Next to that cluster is the Customer Cluster with Node 1 and Node 2 and how that interacts with GWCP Internal LM. Multi-tenants feed into the Customer Cluster. The Guidewire Cluster and Customer Cluster also feed into the SIEM, along with InfoSec / Audit users. — *GWCP Log Management Architectural Overview*

The Guidewire Cluster shown on the left-hand side is a multitenant cluster powered by ATMOS that houses all cloud-native microservices as well as Guidewire Cloud Console (GCC), unified GWCP tooling that supports a wide range of operational and development activities through a “single pane of glass.”

The Customer Cluster in the middle of the diagram is also a multi-tenant ATMOS cluster that enables application, process, and data isolation through assignment of namespaces by tenant. Each node within this ATMOS cluster can run applications from multiple customers.

Within each cluster the logs collected in GWCP are of two major types:

Application logs: App logs relate to activities and events captured within InsuranceSuite (IS) (e.g., PolicyCenter or ClaimCenter) which are essential for enabling debugging and troubleshooting efforts by both Guidewire and customer developers.
Infrastructure logs: Infra logs relate to activities and events captured within ATMOS and other cloud infrastructure components. These includes logs from Kubernetes as well as cloud provider services (e.g., AWS system, network, and database logs) and are typically used by Guidewire developers, site reliability engineers, and operators.

Multitenancy Logging

Digging a little deeper into multitenancy logging, this is required at two levels:

Customer/IS Applications: GWCP provisions multiple Kubernetes namespaces for each tenant/customer in a Customer Cluster through which logical boundaries for application resources are defined. As applications run within each customer-allocated namespace, they emit logs to stdout (standard output) and stderr (standard error) streams similar to container logging. These logs are accessed by customer personnel for their day-to-day development of IS and related applications, as well as by Guidewire SREs and operators as needed to maintain successful platform operations.
Cloud-native services: These are provisioned in a separate Guidewire cluster which hosts technical cloud platform services, business or function-specific services, as well as the control plane services used by GCC. These services are all multi-tenanted and therefore are shared across all Guidewire Cloud customers. These logs are typically used by Guidewire internal developers to monitor and troubleshoot issues.

Log Collecting and Forwarding

The Log Agents shown in the diagram are lightweight, highly efficient components that are responsible for collecting and forwarding logs from containers running within a Kubernetes cluster to a centralized log management system through a Log Router. The agent is usually installed as a DaemonSet, which ensures that there is one instance of the agent running on every node in the cluster.

Each ATMOS cluster has a single, highly available logical router. The primary purpose of the router is to aggregate, filter, and then forward logs. The Log Router follows a namespace-based naming convention to route logs to tenant-specific log management solutions as required.

Other key aspects of log management worth examining further:

Filtering: If there are log patterns at the App or Infra level considered as noise, they are filtered out via pattern match or custom rule implementation. Other critical log monitoring alerts can be incorporated to watch for specific scenarios such as microservice performance issues or extreme events like a PII (Personally Identifiable Information) leak.
Forwarding: GWCP provides an option to forward logs to the customer’s choice of supported log aggregators, which currently include Sumo Logic and Logstash. Support for additional log aggregators such as Datadog and Splunk, as well as HTTP endpoints are being planned.
Tenant LM: This persists customer/tenant-specific InsuranceSuite application logs that are only available to those customers. To facilitate this, Guidewire has implemented IdP (Identity Provider) federation through the GWCP Okta Hub. Once configured with SSO through the customer’s IdP, users can access these logs directly through Guidewire Cloud Console (GCC).
Guidewire LM: This captures logs generated across GWCP for all tenants and all services. Guidewire internal users including SREs and Operators typically access these logs to diagnose and respond to application or infrastructure-related incidents.

Log Cost Optimization

As described above, logging is an essential component of observability and an invaluable tool for identifying and responding to operational issues. But while log management is generally easy to configure, run, and scale, often times the cost of logging is among the hardest things to monitor, control, and optimize.

To mitigate the difficulty of tracking logging cost, Guidewire implemented the practices described below to gain better cost visibility and control.

Log Indexes

It is not unusual for cloud platforms and applications to generate logs at a rate of millions of entries per minute. But because all logs are not equally valuable across all periods of time, implementing log indexes can help control cost in several ways. With log indexes in place, you can implement fine-grained control over your LM budget by segmenting data into value groups for differing retention periods, quotas, usage monitoring, and billing. While a single log index can work in some situations, using multiple indexes is recommended to provide various advantages such as:

Reduced Storage Costs: Organizations can allocate different retention policies and storage options to different types of logs based on their importance or compliance requirements. For example, frequently accessed logs can be stored in more expensive, high-performance storage, while less frequently accessed logs can be stored in less expensive, archive storage.
Reduced Retention Costs: Organizations can set different retention policies for each index based on their specific needs. This can help reduce data retention costs by allowing less important or less frequently accessed logs to be deleted after a short period of time while retaining important logs for longer periods.

It is important to note the log indexes themselves require storage resources, but the benefits of faster log retrieval, selective data storage, efficient log analysis, and proactive monitoring will usually outweigh the costs of maintaining these indexes.

Log Archival

While the multiple-index model certainly helps optimize log storage and retention costs, log volumes may still grow even as the need to access older logs reduces over time. Accordingly, Guidewire implemented archiving for older logs which are not used for real-time platform and application monitoring. Archived logs can be compressed and stored in less expensive storage systems (such as AWS S3), and when access is needed it can easily be rehydrated and made available without impacting user productivity. It is recommended that the log archive period be balanced to avoid having to rehydrate logs frequently.

Many industries have regulatory requirements for data retention, with non-compliance potentially resulting in significant fines and related legal costs. Archiving logs can help organizations ensure compliance with these regulations while avoiding both cost and any negative impacts to the business as a whole.

Log-Based Metrics

Despite the cost saving advantages of effective log indexing and archival, increased platform usage will continuously drive-up log volumes and associated costs of ingestion and storage. One additional approach to mitigating this is to store log-based metrics that are generated from the content of log entries. They provide a condensed representation of log data by aggregating and summarizing relevant information instead of logging every single event.

Because you are capturing key metrics that represent overall behavior, performance, or trends within the platform, log-based metrics can be stored in a more efficient and cost-effective manner. For example, instead of storing thousands of log messages for a specific time period, a metric might represent a single value that summarizes the overall trend or pattern for that same time period.

Log Security

Given most modern logging solutions operate on cloud-based platforms like GWCP, it is important to understand key security requirements for log management. The following sections describes some of these considerations.

Role-Based Access Control

As mentioned previously, Guidewire provides role-based access control (RBAC) to manage access and ensure the security and integrity of logging information. This allows us to define specific roles and permissions to individuals or groups based on their responsibilities, such as defining different access levels for viewing, analyzing, and managing logs.

With this RBAC architecture, organizations can apply access controls at all relevant layers to prevent unauthorized access or tampering with logging data. As a simple example, in a multi-project setup a user assigned to work on Project A can be limited to only have access to Project A’s log data. This can help improve security and reduce the risk of data breaches, which can be costly in terms of damage from both financial and reputation perspectives.

Protecting PII

Preventing Personally Identifiable Information (PII) from being put in logs is essential to protect sensitive data and maintain compliance with data privacy regulations. As such, Guidewire has made it top priority to put in place preventative measures to safeguard the PII of our customers in the P&C insurance industry. We describe some of these precautionary measures in the sections that follow.

First, there needs to be a precise classification of any and all restricted information that will be ingested into logs. Guidewire’s InsuranceSuite application platform includes a Logging Wrapper that is designed to use these classifications to control and prevent PII from being written into the log. This provides a protective PII layer at the source, which is ideal.

Since GWCP consists of many multitenant apps and microservices across its distributed architecture, it is difficult to design a single platform logging wrapper that fully protects against all potential PII leaks. As such, we have designed the Log Router described previously to filter out sensitive PII data based on the same classification criteria used in the logging wrapper approach.

For additional protection, restricting or masking of classified PII data can be applied directly within the log management solution. Most leading LM platforms offer capabilities like using regex (regular expressions) to identify and extract log text that match PII patterns. This can prevent them from being indexed and stored in the log repository.

In the unlikely event that PII has already been indexed into logs, that data will need to be purged using tools made available from your log management vendor. This can be a costly and time-consuming process, first to mass scrub PII from the indexed data and then to re-index any affected logs.

Relationship to SIEM

SIEM (Security Information and Event Management) is closely related to log management as it utilizes log data for security monitoring, threat detection, and incident response. SIEM systems are designed to collect, analyze, and correlate log data from various sources, including system logs, network logs, application logs, and security logs.

Given SIEM has a different focus and objective of identifying security incidents to provide a holistic view of an organization’s security posture, this blog post will not cover SIEM in any depth. That being said, some log management solutions work to address both cloud platform and SIEM use cases because they require many of the same capabilities including log collection, aggregation, retention and compliance, as well as archiving and storage optimization.

Learnings and Caveats

Log Duplication: In the current architecture, application logs are duplicated between internal GWCP and customer/tenant-specific logging dashboards. We always assumed that maximizing operational and development visibility required that GWCP logging would be a superset of all logs across the platform. However, providing application log data for each customer does result in higher costs due to data duplication.
Cost Optimization: The cost optimization controls described previously were applied over a period of time. While in retrospect we could have established some of these controls earlier, it is never too late to apply controls that fulfill your cost optimization objectives.
Vendor Relationship: The chosen provider of log management and other observability tools has a significant role to play in maximizing the effective use of their platform. Observability is a complex discipline, which is why the most common aspects of logging, metrics, and traces require vendor assistance to implement them effectively. This makes it essential to spend quality time choosing the right vendor, and then actively working with them to put their platform to best use. It is also important to note that every vendor will have some feature gaps that might be on their roadmap but not immediately available. Guidewire had several cases where required observability features kept being delayed, so you need to adjust your plans accordingly.

Closing Comments

At Guidewire we believe providing great platform observability begins with effective logging. Building a robust and fully optimized log management solution is key to providing the essential operational and development visibility needed by multiple stakeholders that rely on log data to do their jobs.

If you look at one very common operational metric such as Mean Time to Recover/Restore (MTTR), the time it takes to access, trace, and correlate log information to the source of the problem really matters. Meeting and exceeding these and other service levels is why Guidewire has invested to deliver world-class logging capabilities in GWCP.

Stay tuned for future blog posts describing how we have enabled other aspects of observability on the Guidewire Cloud Platform.

If you are interested in working on our Engineering teams building cutting-edge cloud technologies that make Guidewire the cloud leader in P&C insurance, please apply at https://careers.guidewire.com.