Why is HPC with liquid cooling?

Parallel computing / Facility efficiency / Scalability Management

wen tsen liao
Wen’s writing blog
7 min readJul 20, 2024

--

High-Performance Computing (HPC) refers to using supercomputers and computer clusters to solve complex problems. These problems span various domains, including science, engineering, and business, which standard computers may struggle to handle. HPC systems differ from general-purpose computers in that they employ parallel processing, where multiple processors work together simultaneously to address these challenges. This contrasts standard systems, which typically use a single processor for sequential tasks. For instance, HPC systems often utilize low-latency, high-bandwidth networks to achieve fast data transfer between nodes and storage systems. Additionally, they may incorporate GPUs to leverage their strengths in mathematical computations, machine learning models, and graphics-intensive tasks.

HPC uses parallel computing to perform multiple tasks simultaneously on multiple computer servers or processors. Massively parallel computing is a form of parallel computing that uses tens to millions of processors, or processor cores. Each computer in the cluster is often called a node, and each node is responsible for a different task. for processing. HPC differs in scale and performance from PCs or rack servers. A PC is a general-purpose computer usually used by a single person to perform daily tasks. HPC, on the other hand, combines the power of multiple computers (often thousands) to solve complex problems that are too large or time-consuming for a single PC or even a rack of servers. HPC clusters enhance performance by connecting multiple HPC nodes into a cluster or supercomputer with parallel data processing capabilities. This enables them to compute large-scale simulations, artificial intelligence inference, and data analysis that may not be possible on a single system.

There are two main ways in which HPC clusters enhance performance: Scaling performance This involves leveraging hardware and software architecture to spread computing out to resources on a single server. This provides performance gains but is limited by the capabilities of a single system. Scale-out performance: When multiple systems are configured as one system, the resulting HPC cluster has scale-out performance. This is achieved by distributing computation to more nodes in parallel. HPC clusters typically consist of CPUs, accelerators, high-performance communications fabrics, and complex memory and storage. These components work together across nodes to prevent bottlenecks and provide optimal performance.

Due to the need for horizontal scaling, control and cooling between nodes become crucial. Traditional air cooling methods, which rely on coolers and Computer Room Air Conditioning (CRAC), face limitations as air cooling approaches its limits. When the maximum rack density is in the range of 5–10 kW, the efficiency of air cooling decreases, and attempts to increase airflow only complicate matters due to low energy efficiency. As a result, systems are both heat-constrained and wasteful in terms of energy. Liquid cooling provides a more efficient alternative. The transition to liquid cooling represents a significant advancement in heat management. Direct Liquid Cooling (DLC) systems allow coolant to circulate directly over heat components such as CPUs and GPUs. Not only do these systems support higher rack densities (15–30 kW), but they also improve energy efficiency, making them an excellent replacement for air cooling. Immersive cooling submerges hardware entirely in non-conductive liquid, eliminating the need for air cooling infrastructure. This innovative approach can handle rack densities exceeding 50 kW while balancing system complexity and performance. Within a single cabinet, chip temperature differences are less than five degrees, and individual node temperature differentials are also minimal. This significantly enhances quality control and efficiency in managing horizontally scaled nodes.

In the world of Bitcoin mining, the importance of cooling technology cannot be overlooked. As mining hardware performs intensive computational tasks, it generates significant heat. Effective cooling systems are crucial for maintaining hardware performance and longevity. Cooling not only impacts mining efficiency and energy consumption but also directly affects the economic benefits of mining, including overclocking performance and reduced failure rates. The most common cooling method is air cooling, which uses fans to expel hot air from within the mining rig. However, air cooling efficiency is limited by ambient temperatures and is less effective in high-temperature environments. Additionally, fan noise can be problematic. Immersive cooling, on the other hand, is an emerging technique. It fully submerges mining rigs in non-conductive liquid. This liquid has better thermal conductivity than air, allowing it to rapidly and evenly dissipate heat from the mining equipment. The benefits of immersive cooling include improved mining efficiency, reduced energy consumption, decreased hardware wear, extended equipment lifespan, lower noise levels, and, in specific cases, the ability to recover waste heat. However, immersive cooling also presents challenges, such as initial investment costs, the need for liquid cooling system maintenance, and decisions regarding coolant selection and handling. Miners must weigh these factors to determine the most suitable cooling method for their operations. Cooling plays a crucial role in Bitcoin mining, ensuring sustainability and profitability. As mining competition intensifies, adopting advanced cooling technology becomes a key success factor for miners. Innovations like the next-generation immersive cooling rigs offer further opportunities for extreme overclocking performance.

Effective cooling systems are not only crucial for improving mining efficiency but also for protecting hardware from overheating damage, ensuring the sustainability and profitability of mining operations. As the Bitcoin mining competition intensifies, adopting advanced cooling technology becomes a key success factor for miners. Now, let’s compare the differences in applications between mining rigs and servers. First, in terms of hardware setup: Mining Rigs: These rigs utilize ASIC chips that are consistent with the motherboard, allowing for straightforward cooling designs. For instance, a large aluminum extruded heatsink can effectively dissipate heat from the ASIC chips. Servers: In contrast, server hardware involves complex motherboard designs with multiple modules and chips. The high thermal density demands specialized independent cooling solutions, which are not as straightforward as the mining rig’s approach. Additionally, from an operational perspective: Mining Facilities: Mining farms benefit from easier network management due to parallel connections among multiple nodes, making it more manageable than data centers. Data Centers: Data centers face challenges related to complex server configurations, maintenance, and the adoption of novel technologies during facility setup. In summary, cooling plays a critical role in both Bitcoin mining and data center design, and miners must carefully weigh these factors to optimize their operations.

In the realm of information technology, data centers serve as the lifeblood of the digital age. However, their energy efficiency and operational costs have long been focal points for the industry. From power facilities to end-user equipment, every step of energy conversion and utilization impacts overall efficiency. The conversion of high-voltage AC power, the voltage reduction process of the transformer, the losses of internal power transmission, and the energy consumption of the cooling system are all important aspects of energy management. The cooling system, in particular, is the main source of energy consumption while keeping the equipment running at optimal temperatures. In addition, the management of load fluctuations and idle power consumption of IT equipment is also crucial to improving energy efficiency. Infrastructure cooling technology has also made progress. Methods such as hot and cold aisle closures and natural air cooling systems have improved efficiency and reduced energy consumption when erecting multiple rows of cabinets. The increasing popularity of liquid cooling, especially in high-density configurations, further reduces the thermal footprint of cooling efficiency. These advances highlight the data center industry’s proactive approach to balancing computing power with environmental management, such as power reduction and virtualization, significantly reducing operating costs and environmental impact. Faced with these challenges, data centers are exploring more efficient transformer designs, improved power transmission methods, innovative cooling technologies, and energy efficiency improvements in server hardware. In the future, liquid immersion cooling, the application of artificial intelligence in power management, and the integration of renewable energy will be key directions to improve data center energy efficiency, such as reducing current leakage in immersion cooling, better power transmission, and cooling down. The evaluation of energy-saving technology ensures higher power and performance loads online. These are indispensable management advancements and system reliability for digital infrastructure.

--

--