Computer System Architecture Part 8 — Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism

12 min readNov 29, 2023

Reading about Warehouse-Scale Computers is like getting a backstage pass to the ‘Super Bowl’ of computing — where servers are the real MVPs, and the only tailgate party is for mitigating ‘Tail Latency’!

Introduction
Programming Models and Workloads for Warehouse-Scale Computers
Computer Architecture of Warehouse-Scale Computers
The Efficiency and Cost of Warehouse-Scale Computers
Cloud Computing: The Return of Utility Computing
Cross-Cutting Issues
Putting It All Together: A Google Warehouse-Scale Computer
Conclusion
Appendix

Introduction

The warehouse-scale computer (WSC) is the fundamental infrastructure behind the Internet services that billions of people use daily, such as search engines, social networks, online maps, video sharing platforms, online shopping websites, and email services. The design of WSCs is a natural extension of computer architecture. Operating at an advanced scale necessitates innovations in power distribution, cooling, monitoring, and operations. WSCs can be considered the modern successors to supercomputers, with Seymour Cray recognized as the pioneer influencing today’s WSC architects.

WSC architects share common objectives and requirements with server architects, including:

Cost-performance
Energy efficiency
Dependability through redundancy
Network I/O
Handling both interactive and batch processing workloads (Note: MapReduce jobs are utilized to convert web-crawled pages into search indices)

However, there are characteristics unique to WSC architecture:

Abundance of parallelism (Note: Request-level parallelism, where independent tasks can proceed simultaneously with minimal communication or synchronization)
Consideration of operational costs and location
Efficient computing at low utilization
Addressing scale-related opportunities and challenges

Interactive Internet service applications, also known as Software as a Service (SaaS), can leverage the independence of millions of users. In SaaS, reads and writes are typically not interdependent, minimizing the need for synchronization.

The predecessors of WSCs are computer clusters — collections of independent computers connected via local area networks (LANs) and switches. Clusters, especially for workloads with limited communication needs, offered more cost-effective computing than shared-memory multiprocessors.

Comparing WSCs to conventional data centers, traditional data centers usually aggregate machines and third-party software from various sources, centrally running services for different entities. Their primary focus is consolidating numerous services onto fewer machines, with isolation to protect sensitive information. Virtual machines are increasingly crucial in data centers.

Similarly, virtual machines play a significant role in WSCs but with a distinct purpose. They provide isolation between different customers and segment hardware resources into varying shares, available for rent at different price points.

Programming Models and Workloads for Warehouse-Scale Computers

Aside from hosting well-known public Internet services like search engines, video sharing platforms, and social networking sites, Warehouse-Scale Computers (WSCs) also handle batch applications. These applications include tasks such as video format conversion and generating search indexes from web crawls.

A widely used framework for batch processing in WSCs is MapReduce, along with its open-source counterpart Hadoop. Inspired by Lisp functions with the same names, Map initially applies a programmer-supplied function to each logical input record. The Map operation runs on hundreds of computers, producing key-value pairs as an intermediate result. The Reduce operation collects and collapses the output using another programmer-defined function. Assuming the Reduce function is commutative and associative, it can run in logarithmic time.

Conceptually, MapReduce can be seen as a generalization of the Single Instruction Stream, Multiple Data Streams (SIMD) operation. However, in MapReduce, the function is passed to the data, followed by a function used in the reduction of the Map task output. To address performance variability among hundreds of computers, the MapReduce scheduler assigns new tasks based on how quickly nodes complete prior tasks. However, a single slow task can delay the completion of a large MapReduce job, leading to what is known as Tail Latency.

Dependability is integral to MapReduce. Each node in a MapReduce job is required to periodically report completed tasks and updated status to the master node. If a node fails to report by the deadline, the master node considers the node dead and reassigns its work to other nodes.

Programming frameworks like MapReduce for batch processing and external SaaS applications like Search rely on internal software services for success. For instance, MapReduce depends on the Google File System (GFS) or Colossus to provide files to any computer, enabling task scheduling anywhere. To enhance read performance and availability, these systems often create complete replicas of data instead of relying on RAID storage servers. Replicas, strategically placed, can mitigate various system failures.

WSC storage software typically opts for relaxed consistency over adhering to all the ACID (Atomicity, Consistency, Isolation, and Durability) requirements of traditional database systems. For example, eventual consistency is acceptable for video sharing, making storage systems more scalable — a crucial necessity for WSCs.

The hardware and software of WSCs must manage load variability based on user demand and cope with performance and dependability fluctuations inherent in hardware at this scale.

Computer Architecture of Warehouse-Scale Computers

In Warehouse-Scale Computers (WSCs), networks play a crucial role in connecting 50,000–100,000 servers, forming a hierarchical structure. Servers are housed in units known as racks, and while the width of racks varies, with some being the classic 19-in. wide and others two or three times wider, the height typically does not exceed 6–7 ft to facilitate servicing by personnel. Each rack contains approximately 40–80 servers. The switch located at the top of the rack, where network cables are often connected, is commonly referred to as a Top of Rack (ToR) switch.

Racks provide high bandwidth within, making it less critical for software to determine the placement of sender and receiver within the same rack. This flexibility is advantageous from a software perspective.

ToR switches typically have 4–16 uplinks, connecting to the next higher switch in the network hierarchy. Consequently, the bandwidth leaving the rack is 6–24 times smaller than the bandwidth within the rack, a ratio known as oversubscription. The switch linking an array of racks is more expensive than the ToR switch, owing to higher connectivity and the need for greater bandwidth to address the oversubscription issue.

Storage

A practical approach to design involves filling a rack with servers, leaving space for necessary switches. From a hardware standpoint, a straightforward solution is to incorporate disks within the rack and utilize Ethernet connectivity for accessing information on remote servers’ disks. An alternative, albeit more costly, option is to employ network-attached storage (NAS), potentially over a storage network like InfiniBand. Present-day system software often implements RAID-like error correction codes to reduce storage costs while maintaining dependability.

Note — It’s important to recognize that a Warehouse-Scale Computer (WSC) is essentially an exceptionally large cluster. To avoid confusion, we use the term “array” to refer to a large collection of racks organized in rows, preserving the original definition of the word “cluster” to represent anything from a group of networked computers within a rack to an entire warehouse filled with networked computers.

WSC Memory Hierarchy

Figure shows the latency, bandwidth, and capacity of memory hierarchy inside a WSC.

Network overhead significantly increases latency between local DRAM and Flash, rack DRAM and Flash, or array DRAM and Flash. However, all these latencies remain more than 10 times better than accessing the local disk. The network effectively narrows the bandwidth difference between rack DRAM, Flash, and disk, as well as between array DRAM, Flash, and disk.

In a Warehouse-Scale Computer (WSC), most applications fit into a single array. For those requiring more than one array, sharding or partitioning is employed. This involves splitting the dataset into independent pieces and distributing them to different arrays.

Note — When it comes to block transfers beyond a single server, whether the data is in memory or on disk becomes irrelevant because the bottleneck is the rack switch and array switch. These performance limitations influence the design of WSC software and underscore the necessity for higher-performance switches.

The Efficiency and Cost of Warehouse-Scale Computers

The majority of construction costs for a Warehouse-Scale Computer (WSC) are attributed to infrastructure expenses for power distribution and cooling. Cooling is achieved using a computer room air-conditioning (CRAC) unit, which cools the server room air using chilled water, similar to how a refrigerator removes heat by releasing it outside.

Measuring Efficiency of a WSC

A widely used metric for evaluating the efficiency of a data center or WSC is power utilization effectiveness (PUE), calculated as:

PUE must be greater than or equal to 1, with a higher PUE indicating lower efficiency in the WSC.

In a WSC, the DRAM bandwidth within a server is 200 times greater than within a rack, which is, in turn, 10 times greater than within an array.

Due to the paramount concern for user satisfaction in Internet services, performance goals are often defined in terms of service level objectives (SLOs), specifying a threshold for latency that a high percentage of requests should meet, rather than just targeting average latency.

The term “tail tolerant” describes systems designed to meet goals despite variability and occasional latency spikes. Instead of trying to eliminate variability, WSCs employ tail-tolerant techniques such as fine-grained load balancing to mitigate queuing delays caused by contention for shared resources, variable microprocessor performance, software garbage collection, and more.

WSC designers carefully evaluate both operational expenditures (OPEX), which are the costs incurred during operation, and capital expenditures (CAPEX), which cover the initial construction expenses of the WSC. To simplify the financial analysis, CAPEX can be transformed into OPEX through a cost of capital conversion, assuming a 5% borrowing cost. This involves spreading the CAPEX amount uniformly over each month throughout the equipment’s effective lifespan (Amortized CAPEX). The fully burdened cost of a watt per year in a WSC, including amortizing power and cooling infrastructure, is calculated as:

At a rate of about $0.10 per server per hour, WSCs provide a cost advantage over many companies with smaller conventional data centers, leading large Internet companies to offer computing as a utility, known today as cloud computing, where users pay only for what they use.

Cloud Computing: The Return of Utility Computing

Driven by the growing user demand, major Internet companies like Amazon, Google, and Microsoft have constructed larger warehouse-scale computers using standard components. This surge in demand prompted advancements in systems software, leading to innovations like BigTable, Colossus, Dynamo, GFS, and MapReduce. Meeting this demand also required improvements in operational techniques to ensure a service uptime of at least 99.99%, despite component failures and security threats. The scale of procurement in warehouse-scale computers results in volume discount prices for various components, contributing to economies of scale in both purchasing and operational costs.

Amazon Web Services

While various attempts post-timesharing have aimed to provide pay-as-you-go services, many faced setbacks. Amazon, in 2006, pioneered utility computing through Amazon Simple Storage Service (Amazon S3) and later Amazon Elastic Compute Cloud (Amazon EC2). Noteworthy decisions made by Amazon include:

Virtual Machines
Very Low Cost
(Initial) Reliance on Open-Source Software
No (Initial) Guarantee of Service
No Contract Required

In 2014, AWS introduced Lambda, a service that shifts from managing virtual machines to allowing users to supply source code (e.g., Python). AWS automatically manages the resources required by the code, scaling with input size and ensuring high availability. This evolution is termed Serverless Computing, where users are relieved from managing servers (even though these functions are executed on servers). Serverless Computing can be conceptualized as a set of processes running in parallel across the entire WSC, sharing data through a disaggregated storage service such as AWS S3.

Cross-Cutting Issues

Ensuring the effective performance and cost management of the Warehouse-Scale Computer (WSC) network requires careful attention. The potential for the WSC network to become a bottleneck poses challenges for data placement and complicates WSC software. Given that this software is a highly valuable asset for a WSC company, the additional complexity comes at a significant cost.

While Power Utilization Effectiveness (PUE) assesses WSC efficiency, it doesn’t address internal workings within the IT equipment. Another source of electrical inefficiency is the power supply within the server, responsible for converting high-voltage input to the voltages used by chips and disks. Computer motherboards also feature voltage regulator modules (VRMs), which can have relatively low efficiency. The overarching goal for the entire server is energy proportionality, meaning energy consumption should align with the amount of work performed. Systems software, designed to maximize performance without necessarily considering energy implications, tends to utilize all available resources.

Putting It All Together: A Google Warehouse-Scale Computer

Power distribution in a Warehouse-Scale Computer (WSC) follows a hierarchy, with each level corresponding to a distinct failure and maintenance unit: the entire WSC, arrays, rows, and racks. Software is designed to understand this hierarchy, strategically spreading work and storage to enhance dependability.

After delivering power from utility poles to the WSC floor, the challenge becomes efficiently managing the generated heat. One straightforward approach to improve energy efficiency is to operate the IT equipment at higher temperatures, reducing the need for extensive air cooling.

Now let’s delve into the rack. A WSC comprises multiple arrays (referred to as clusters by Google). While array sizes vary, some consist of one to two dozen rows, with each row holding two to three dozen racks.

The Google WSC network employs a Clos topology — a multistage network using low-port-count (“low radix”) switches. This design offers fault tolerance, increases both the network scale and its bisection bandwidth, and scales by adding stages to the multistage network. Inherent redundancy ensures fault tolerance, minimizing the impact of any link failure on overall network capacity.

Now, having covered power, cooling, and communication, let’s explore the computers responsible for the WSC’s actual work. The example server presented features two sockets, each housing an 18-core Intel Haswell processor running at 2.3 GHz. This server typically deploys with 256 GB total DDR3–1600 DRAM across 16 DIMMs. The Haswell memory hierarchy includes two 32 KiB L1 caches, a 256 KiB L2 cache, and a 2.5 MiB L3 cache per core, resulting in a 45 MiB L3 cache. Local memory bandwidth is 44 GB/s with a latency of 70 ns, and intersocket bandwidth is 31 GB/s with a latency of 140 ns to remote memory.

Power Usage Effectiveness (PUE) improves as the facility operates closer to its fully designed capacity, enhancing efficiency. This increased utilization reduces the demand for new servers and new WSCs.

Conclusion

The Warehouse-Scale Computer (WSC) has unlocked economies of scale, bringing the long-desired vision of computing as a utility to reality. Cloud computing allows individuals and businesses worldwide to harness thousands of servers instantly, realizing their visions. Cloud computing’s appeal lies in economic incentives for conservation, as well as encouraging efficient use of computation, communication, and storage through a transparent pricing scheme based on usage.

The era of Moore’s Law, coupled with advancements in WSC design and operation, led to a continuous improvement in the performance-cost-energy curve of WSCs. However, as this era concludes and major inefficiencies in WSCs are addressed, the field will likely need to explore innovations in computer architecture for the chips within WSCs to sustain ongoing improvement.

Appendix

Service Availability

Service availability can be estimated by calculating the outage time due to failures of each component:

With 365 x 24 or 8760 hours in a year, availability is determined as follows:

Amdahl’s Law Application in Warehouse-Scale Computers: Impact on Latency and Processor Selection

Amdahl’s Law remains relevant to Warehouse-Scale Computers (WSC). Serial work for each request can impact request latency, especially if it runs on a slow server. If serial work increases latency, the cost of using a less powerful processor must account for software development costs to optimize the code and return to lower latency. The larger number of threads on many slow servers can also pose challenges in scheduling and load balancing, leading to variability in thread performance and potentially longer latencies.