SAS/COM: An architect’s guide to system design

Charles
Google Cloud - Community
8 min readMay 20, 2019

I spend a lot of time thinking through different architecture designs and some time ago I developed a set of design principles that I call SAS/COM, for short. It’s an acronym for Scalability, Availability, Security, Cost, Operations and Maintenance. I’ll make the claim that any architecture has to consider these principles during the design. In this post, I’ll describe why each one of these criteria is important. Since I spend most of my productive time working with Google Cloud, I’ll share examples related to that product set.

Scalability

Understanding the volume, velocity and variety of user traffic helps you understand the sheer size of the problem. It’s a a simple place to start your design.

Knowing the number of queries per second (qps), latency requirements, variability in usage patterns, size of the requests and geographical origins all help understand how to structure an app and infrastructure. On top of that, understanding how these requirements may change over time is also a critical to design a system that continues to work as traffic scales. These requirements impact the selection of the platform, caching requirements, storage technologies and how to design an overall app architecture.

For example, if you have highly variable traffic, then a PaaS product like App Engine that can automatically scale up and down based on traffic is a good consideration. In fact, going deeper, you may want to leverage the pre-warming features in App Engine to provide the fastest ability to scale to meet demand. For low latency apps, you’ll want to cache information both at the browser and app and leverage a CDN like CDN for static content. For storing app state, you’ll want to choose a low-latency database for app storage. You’ll also want to think about separating the app into a frontend and a backend and building the service using microservices or RPCs. Compute options for microservices on GCP range from Google Kubernetes Engine to App Engine or even Cloud Functions. For apps with global user bases, you’ll want to consider locating your frontend as close to the user as possible to reduce latency while ensuring the frontend/backend services communication is robust. Google Cloud global load balancer provides an easy way to efficiently route user traffic.

Availability

Why does availability matter? It matters because you’ll need to design your system to offer the right availability. The Site Reliability Engineers (SREs) have a lot to say in their books and blog series about designing scalable systems with the appropriate level of availability and are worth a read if you haven’t already. Understanding the availability requirements allows you to design a system with the appropriate amount of uptime, but not more. Increasing uptime can be costly and so, it’s important to understand the availability requirements from the business. A high traffic, global web site or service has very different availability requirements than your internal analytics pipeline. Both are important, but the public website/service may have higher availability requirements than your internal analytics pipeline.

When you develop an app, it’s a very good idea to formally capture the availability requirements by defining a Service Level Objective (SLO) even if you don’t have a formal Service Level Agreement (SLA) with users or customers. SLOs help drive the design of the app by requiring use of storage with specific availability or performance characteristics. This also helps to define the plan for monitoring the app which will be important once the app is operational.

Cloud service providers, including Google Cloud, offer SLAs for their services. Building your app on top of cloud infrastructure means that you need to consider the underlying cloud infrastructure SLAs.

For highly available systems, you need to select a compute, storage and network service design that will provide high availability by providing redundancy. For example, you’ll want to include features such as Google Cloud’s global load-balancers, multi-regional Cloud Storage and redundant Cloud Interconnects for any cloud-to-on-prem communication. These compute, storage and network options each offer redundancy. Using a load-balancer means that you can spread the traffic load across a group of instances or a set of services on a GKE cluster. Using the multi-region option for Cloud Storage or Cloud Datastore means that your content is automatically replicated into multiple locations which provides a higher availability guarantee. Building redundancy into a cloud-to-on-prem communication channel can be done by setting up 2 Cloud Interconnects to your on-prem. All of these design choices are driven by the availability requirements.

Security

Security and privacy are a critical components of any system design regardless of where the system is deployed. You may have very specific requirements based on your organization’s policies, compliance or regulatory requirements. Keep in mind that security is a very broad term that can include very broad requirements such as “all data must be encrypted at rest and in transit with keys that you control” or very specific requirements such as “all TLS connections must use the TLS_RSA_WITH_AES_256_CBC_SHA SSLv3 cipher”. Or you may have policies that govern how PII data is used in your apps. Your organization may also have a team that certifies cloud services for usage which can be an important consideration for app design. Taken as a whole, security and privacy are broad topics and deeply technical considerations that you’ll need to consider during app design.

For example, one common design requirement is that Kubernetes clusters or Compute Engine VM instances should not be exposed through public IPs. In Google Cloud, this policy is implemented via an organizational policy constraint. Another common requirement is to connect to on-prem networks with private IP space. In Google Cloud using VPC Service Controls, Private Google Access and Cloud Interconnect together allow you to continue using private IP space while making use of Google Cloud services. Requirements related to data storage regions services also may impact the way that you choose to architect the app. As an example, many of the services in Google Cloud are zonal and regional services which allow you to be specific about where the app is running and stores data. Other apps may require compliance certifications including BAA-covered services. For example, in Google Cloud, you can select from the broad list of BAA-covered services while designing your app. Understanding these and all security-related requirements can dramatically influence the design of your infrastructure and app.

Cost

The efficiency of a design can have a material impact on the development and cost of an app. Inefficient designs may include inefficient database calls, unnecessary service calls and non-optimal caching each of which may incur additional cost. Since most cloud storage services are billed as usage-based, the more services that you use, the more that you pay. This can definitely offer advantages for workloads that scale up and down over time though it can also make the overall cost more difficult to estimate the cost before you build your app.

Some important items to consider are the differences in costs between different storage options, egress costs across regions and compute costs. Other items to consider are per-operation costs of individual components such as load-balancer costs, storage operations and logging volumes. In fact, cost is one of the reasons that serverless technologies are increasingly popular.

For example, if you deploy a microservices app that makes service calls across a Google Cloud region, you will incur egress traffic charges. Picking the storage class can also have a dramatic impact on the cost depending on the current and projected future volumes. As an example, US multiregional coldline storage costs $0.007/GB while the US multiregional standard storage costs $0.026 (as of current pricing today here). Thats a 3X difference in costs and can become material quickly as storage sizes grow over time. As a last example, picking an instance or GKE cluster has a different cost than using a serverless option such as Cloud Run for workloads that may scale up and down. Considering the cost is a critical component of your system design.

Operations

Observability is key design consideration for your app. This area focuses on running the app which is commonly done by the DevOps team or, as this function is known at Google, the Site Reliability Engineering (SRE) team. Thinking about important metrics such as Service Level Indicators (SLIs) and your app’s Service Level Objectives (SLOs) during the design phase can help you include the required instrumentation in your design. The SRE /DevOps team will be very interested in how your app can be monitored and what the dashboard levels mean for the identification and resolution of errors.

For example, Stackdriver Monitoring provides infrastructure and app-level monitoring for Google Cloud services. In Stackdriver, you can use the provided GCP infrastructure metrics or write your own custom metrics to track your app’s Service Level Indicators. In fact, for App Engine and Istio-enabled apps, Stackdriver’s Service Monitoring will automatically track your SLIs and report against your defined SLOs. Stackdriver alerting and dashboards provide SREs detailed insights into specific metrics of interest for your app.

For debugging and troubleshooting, it’s important to consider the logging strategy and the use of application performance management (APM) tools. The use of APM tools usually require some instrumentation at the code level and should be considered as a part of the software development lifecycle.

For example, Stackdriver’s APM tools include Profiler, Debugger and Trace and are integrated with OpenCensus. Each provide useful information for the SRE or development team to use when troubleshooting or analyzing an app’s performance.

Maintenance

The design of an app can have a very significant impact on the deployment time/cost. Developing a design that meets the current requirements, can scale with differing traffic loads and that is easy to modify in the future as new requirements surface, make the app easier to maintain over time.

For example, if you are using the Bigtable for your app storage, then carefully designing the key will mean that the app meets current requirements, will scale with higher traffic volumes and will be easy to change as the app requirements evolve over time.

A common way to help maintain development velocity while reducing deployment time/cost is to use CI/CD tooling. Using CI/CD tooling means that the development teams can design their tests and apps to use a standard process which also makes the apps more maintainable because changes are easier to deploy. Getting code to production faster, with less errors and with less human time makes developers more productive and reduces costs.

For example, you may set-up a CI/CD pipeline using Jenkins with Google Kubernetes Engine. There are many different ways to set-up a pipeline, but the end result should provide an automated way to build, test and deploy your code. From a design perspective, it’s important to think about your code being released through the CI/CD pipeline and develop automated testing to take advantage of the automation.

Conclusion

Evaluating the requirements against a rubric of Scalability, Availability, Security, Cost, Operations and Maintenance helps design systems that scale appropriately, meet user SLOs, protect sensitive data, optimize cost, are well instrumented and take advantage of deployment automation.

--

--