Building Reliable Infrastructure in Google Cloud

Published in

Google Cloud - Community

11 min readJul 16, 2023

What is Reliability means?

Reliability means trust. When you trust someone, you have confidence in their abilities, honesty, and dependability, which makes them reliable in your eyes. In the context of a system, It refers to the extent to which customers can trust and depend on it to consistently perform. reliability is the cornerstone of trust in a system, as it reassures customers that the system will consistently deliver accurate results, maintain stable performance, and fulfill their expectations without frequent failures or disruptions. So trust and reliability are closely connected.

Conversely, a system that is not reliable can damage trust. If a system fails or malfunctions frequently, users will lose confidence in the system and may be reluctant to use it. This can have a negative impact on the system’s performance and its ability to meet its users’ needs.

Therefore, it is important for systems to be reliable in order to build trust with their users.

Overview of Reliability in System Design:
An application or workload is reliable when it meets your current objectives for availability and resilience to failures. Availability and reliability are related but different concepts when it comes to assessing a system’s performance. While availability refers to the system being accessible and operational, reliability specifically focuses on the system’s ability to function as expected and deliver the desired results.

Let’s consider an example of an online shopping or E-commerce website.

Availability: The website is designed to be available 24/7. Which means that customers can access it at any time. However, there may be occasions when the server would go under maintenance leading to temporary go unavailable. Despite these brief interruptions, the website is generally accessible and has a high availability rate.

Reliability: A reliable online shopping website ensures that customers can browse products, add them to their cart, proceed to checkout, and complete their purchase without encountering significant errors or disruptions. In this case, a reliable system would accurately process customers’ orders, maintain their account information securely, and provide a seamless shopping experience without unexpected crashes or technical glitches.

Therefore, while availability ensures the website is accessible, reliability focuses on whether the website functions correctly and consistently, delivering a good customer experience for the users.

Depending on the purpose of the application, the indicators of reliability can vary. Here are three examples of indicators that can be used to assess the reliability of different types of applications:

Availability, latency, and throughput are important reliability indicators for applications that serve content.
For databases and storage systems, latency, throughput, availability, and durability.

Several important factors affect application reliability

The internal design of the application: Of course, A well-designed application with proper error handling, fault tolerance, and redundancy measures is more likely to be reliable.
Dependencies on secondary applications or components: If an application relies on other services or APIs, the reliability of those dependencies can affect the overall reliability of the application.
Google Cloud infrastructure resources: The reliability of the underlying Google Cloud infrastructure, including computing, networking, storage, databases, and security services, plays a crucial role. Google Cloud provides highly reliable infrastructure, but it’s important to configure and utilize these resources correctly to ensure application reliability.
Infrastructure capacity and quota that you provision for your deployment ensuring enough capacity is always available when your application scales or your business grows in peak hours.
DevOps processes and tools: A robust and well-implemented DevOps process, including CI/CD, and automated testing can contribute to a more reliable application.

the reliability of an application that’s deployed in Google Cloud depends on multiple factors

The building blocks of reliability in Google Cloud

Regions and Availability Zones: Google Cloud offers multiple regions and availability zones across the globe. Regions are geographically separate areas that contain one or more availability zones. Availability zones are isolated locations within a region that have their own power, cooling, and networking infrastructure. By deploying resources across multiple regions and availability zones, you can achieve high availability and resilience against regional failures.

Load Balancing: Google Cloud provides load-balancing services that distribute incoming traffic across multiple instances or services to ensure efficient utilization and high availability.

Auto Scaling: Auto Scaling allows you to automatically adjust the number of instances or resources based on demand. By setting up auto-scaling policies, your infrastructure can scale up during peak traffic and scale down during low-demand periods.

Managed Databases: Google Cloud offers managed database services, such as Cloud SQL and Cloud Spanner, which provide built-in replication, backups, and automated failover capabilities. These features contribute to the reliability of your databases by ensuring data durability, availability, and quick recovery in case of failures.

Monitoring and Logging: Google Cloud provides monitoring and logging tools like Cloud Monitoring and Cloud Logging, which enable you to collect and analyze performance metrics, logs, and other relevant data.

Disaster Recovery and Business Continuity: Google Cloud offers services and features, such as Google Cloud Storage, Google Cloud Datastore, and Cloud Storage Transfer Service, which facilitate data replication, backup, and disaster recovery. These tools help you create robust disaster recovery plans and ensure business continuity in the event of disruptions or failures.

Security and Compliance: Google Cloud has extensive security measures and compliance certifications to protect your applications and data.

Assess the reliability requirements for your cloud workloads:

Here are some things to consider when assessing the reliability requirements for your cloud workloads:

The criticality of the workload: How important is the workload to your business? Whether it is critical or Non-critical.
The impact of a failure: What would happen if the workload fails? Would it impact your customers?
The frequency of use: How often is the workload used? If the workload is used frequently, then you will need to have a higher level of reliability.
The availability requirements: How much uptime do you need for the workload? Do you need 99.9% uptime? 99.99% uptime? Remember, the more the number of nines, the higher the cost is expected.
The budget: How much are you willing to spend on reliability? Again how many nines do you want to target or what your workload is expected to be?

For example, an application that provides ATM services for a bank might need 5-nines availability. A website that supports an online trading platform might need 5-nines availability and a fast response time. A batch process that writes banking transactions to an accounting ledger at the end of every day might have a data-freshness target of eight hours.

Workload Specific Requirements:

Uptime requirements.
Recovery time objectives (RTOs).
Recovery point objectives (RPOs).
Fault tolerance requirements: The workload must be able to tolerate a certain number of failures without impacting availability.

Check Out some additional official resources that you may find helpful:

Google Cloud Platform Reliability Guide: https://cloud.google.com/architecture/infra-reliability-guide
AWS Well-Architected Reliability Pillar: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/
Azure Reliability Framework: https://docs.microsoft.com/en-us/azure/architecture/resiliency/

Design Reliable Infrastructure for your workloads in Google Cloud

Google Cloud infrastructure is designed to support a target availability of 99.9% for a workload that’s deployed in a single zone. The target availability is 99.99% for a multi-zone deployment and 99.999% for a multi-region deployment.

Avoid single points of failure: Application components are typically grouped into tiers based on the function that they perform and their relationship with the other components. So failure to any component may affect the overall stack. As shown in the below diagram, this example architecture contains a single load balancer, two web servers, a single app server, and a single database. The load balancer, app server, and database in this example are Single points of failure(SPOFs). A failure of any of these components can cause user requests to the application to fail. To remove the SPOFs in your application stack, distribute resources across locations and deploy redundant resources.

To Remove the Single point of failure, you need to distribute your application infrastructure across multiple locations, zones, and regions to make it redundant like multi-zones, and multi-region.

Multi-zone: Workloads that need resilience against zone outages but can tolerate some downtime caused by region outages.
Multi-region: Workloads that are business-critical and where high availability is essential, such as retail and social media applications.
Single-zone: Workloads that can tolerate downtime or can be deployed at another location when necessary with minimal effort.

2. Cost, latency, and operational considerations

When designing a distributed architecture with redundant resources, it’s crucial to consider not only the availability requirements of the application but also the impact on operational complexity, latency, and cost. For example, Business-critical Application needs distributed architecture for getting high availability which can often be caused increased operational complexity and cost. Whereas there are some less critical applications like batch processing which requires low latency and high bandwidth connections between the VMs typically need Single Zone Architecture. So here Availability is less concerned over Redundancy.

3. Deployment architectures: This is a very important section of this article as this provides the foundation for designing resilient, efficient, and reliable systems in the cloud environment.

In Google Cloud Platform (GCP), there are various deployment architectures available to suit different application needs. Here are a few common deployment architectures in GCP:

Single-zone deployment: This is single-zone Application Architecture that contains redundancy in every tier. As you can see, if any components in any tier fail, The application still processes the request but what if the zone gets fail? In the zonal outage, it affects the entire stack except for the Load balancer which is a regional resource however it can’t distribute traffic because of backend failure resources and thus affects your application. Since there is no redundancy, you must wait for Google to resolve the outage until you have the backup to restore the entire stack from scratch.

Single Zone Architecture is a Single Point of Failure in Zonal Outage

Multi-zone deployment: As the name says, Multi-zone deployment spans over two or more zones which improves the resilience of your application against a Single Zonal outage. This Architecture gives you redundant and zonal Availability for your application. if any components fail in any of the zones, the Application can still process the request from the other zonal resources as you can see in the diagram. If both zones in this architecture have an outage, then the application is unavailable. If a multi-zone outage or region outage occurs, you must wait for Google to resolve the outage, and then verify that the application works as expected.

Multi-Zone Deployment Architecture before (Left Image) and after an outage(Right Image)

Multi-region deployment with Regional load balancing: If you want o protect against region outages, distribute resources across multiple regions in Google Cloud. Please keep in mind, here the deployment strategy is a regional load balancer in each region so we can avoid the single point of failure of the Global Load balancer. In this Architecture, if any region goes down, The application is still available and manages the user’s traffic (of course the capacity has to be available in the failover region to serve the load) because an independent application stack is deployed in each region. The DNS zone steers user requests to the region that’s not affected by the outage. The Cloud Spanner instance in this architecture uses a multi-region configuration, which is resilient to zone outages. If any two of the regions in this architecture have outages, then the application is unavailable. Wait for Google to resolve the outages.

Note: You can also use Cloud SQL or SQL server running on VM but that does not provide you to serve the parallel users requests in each region. The best practice is to keep the Async node SQL server as a hot service in another region in case of disaster recovery but please be mindful that the entire architecture will be different in this case. I will cover this later in a separate article.

Multi-Region Deployment Architecture with **Regional LB** before (Left Image) and after an outage (Right Image)

Multi-region deployment with global load balancing: The following diagram shows an alternative multi-region deployment that uses a global load balancer instead of regional load balancers. this architecture uses a global external HTTP/S load balancer (with Cloud CDN enabled) to receive and respond to user requests. Each forwarding rule of the load balancer uses a single external IP address; you don’t need to configure a separate DNS record for each region. The load balancer routes requests to the region that’s closest to the users.

Google Cloud HTTP(S) load balancing is implemented at the edge of Google’s network in Google’s points of presence (POP) around the world. User traffic directed to an HTTP(S) load balancer enters the POP closest to the user and is then load-balanced over Google’s global network to the closest backend that has sufficient capacity available. Google Cloud HTTP(s) use anycast IP to address the routing of requests from the closed location to the users.
What is Anycast?
In anycast, a collection of servers share the same IP address and send data from a source computer to the server that is topographically the closest. This helps cut down on latency and bandwidth costs, improves load time for users, and improves availability. It is important to remember that topographically closer does not inherently mean geographically closer, though this is often the case.

Benefits of Having Global Load Balancer:

Single Load balancer which makes your life easy.
Single Anycast IP. ( Google Advertise this IP in each region in the background as explained above).
Resilient to region Outage.
Global Load balancer support features like Cloud CDN, Storage Bucket as backends, and Cloud Armor.

Risks With Global Load Balancer:

Single point of failure: An incorrect configuration change to the global load balancer might make the application unavailable to users. For example, if let’s say frontend accidentally deleted or forwarding rule, the Load balancer stops receiving requests. The effect of this risk is lower in the case of a multi-region architecture that uses regional load balancers in the previous architecture design.
An infrastructure outage that affects global resources might make the global load balancer unavailable.

Multi-Region Deployment Architecture with **Global HTTPS LB** before (Left) and after an outage (Right)

To mitigate these risks, you must manage changes to the global load balancer carefully, and consider using defense-in-depth fallbacks where possible.

Conclusion: Reliability is the bedrock of trust in any system. Whether it’s a person or a technological infrastructure, reliability ensures confidence by consistently delivering expected results and maintaining stable performance. We have explored the significance of reliability in system design, understanding different factors and deployment patterns. A reliable system not only ensures uninterrupted availability but also consistently functions as expected, meeting the demands and expectations of its users.

I hope that this article has provided you with valuable insights and enriched information to your knowledge on the topic of reliability. If you Like it, please share it with others, and Feel free to leave your comments with any doubts or questions. I am happy to assist.

Building Reliable Infrastructure in Google Cloud

Several important factors affect application reliability

Assess the reliability requirements for your cloud workloads:

Design Reliable Infrastructure for your workloads in Google Cloud

Written by Sumit K