Watts The Deal With Power ? Part I
How We Implement Power Benchmarking In The Billion-Scale Approximate Nearest Neighbor Search Challenge
We announced in May that NeurIPS 2021 will host a unique data and algorithm challenge in billion-scale approximate nearest neighbor search (ANNS.) Participating teams will be evaluated across a set of challenging datasets, each with a billion records. We employ search accuracy ( measured as recall vs throughput ) as the defining ranking metric for the T1 and T2 competition tracks, which limit the size of RAM to 64GB within a standard server-grade system hosted in Microsoft’s Azure Cloud. However, in the T3 track, we don’t enforce the same hardware restrictions. T3 will also add two additional leaderboards, one that ranks participants relative to power usage and one related to hardware cost. In this blog, we will discuss the power leaderboard and we’ll get into some of the details around how we will collect and compute the power benchmarks.
Why Do We Care About Power Consumption ?
Let’s first ask ourselves why we even care about power consumption. For most of us, our direct relationship with a machine’s power consumption becomes most apparent when our laptops warn us the battery power is low. Gamers understand the need to purchase additional external power supplies required for the higher-end GPU cards when building a top-notch PC gaming workstation. Crypto miners who run their own on-prem hardware know all too well the importance of acquiring power efficient hardware to lower their power bill. For those of us who leverage the public cloud for day-to-day work, we are typically far removed (both literally and figuratively) from the power consumption of the cloud services we use and the underlying machines that run them (other than how it might be abstractly factored into the service’s cost.)
All that said, awareness surrounding the growing power demands of data centers has increased significantly over the past few years. We all generally agree that indeed “software is eating the world,” and implicitly we expect more and more machines are required to power this software-as-a-service, industrial revolution. It’s then not too surprising to learn what should be shocking facts about the growing demand for power at data centers:
- Global data centers consumed an estimated 205 terawatt-hours (TWh) in 2018, or 1 percent of global electricity use .
- The amount of energy used by data centers doubles approximately every four years, meaning that data centers have the fastest-growing carbon footprint of any area within the IT sector .
This will likely increase further as workloads become more data-intensive and AI-centric. In one highly recognized paper, researchers measured the cost of training a 200 million parameter transformer-based NLP neural network optimized with neural architecture search . They found that model training and model optimization consumed enough energy to equate to 626K pounds of CO2 emissions. To put that into perspective: round trip air travel between SF and NY produces 2K pounds of CO2 emissions, and the average car produces 126K pounds of CO2 emissions in its lifetime. And that was reported in 2019. The latest NLP transformer models in 2021 are well into the billions of parameters.
What’s Being Done?
Fortunately, the major public cloud providers ( Microsoft’s Azure, Amazon’s AWS, and Google’s GCP ) have all already started major efforts to improve the power efficiency and to offset the carbon footprint of their data centers:
- Amazon AWS has recently announced they are providing low power ARM-based processors in their EC2 compute instance service. ARM-based chips are the CPUs that power most IOT devices and mobile devices .
- In 2017, Google became the first large-scale public cloud provider to match 100% of its electricity consumption with renewable energy .
Companies like Apple, Facebook, and Uber with similarly large-scale but private compute server footprints also have “green data center” initiatives underway. Industry wide collaborations like the Open Compute Project promote openly sharing ideas, specifications, and other intellectual property to maximize innovation in this space .
Clearly, new hardware and chipsets specifically designed for power efficiency will continue to be a major component of the design of future data centers. But software also has an equally important role to play. For example, compilers will need to be hardware-aware and will need to be able to produce compiled code that can leverage power efficient hardware features when they are available. Software engineers, data scientists, and algorithm developers must also be aware of their coding decisions and how those decisions can affect not just the typical metrics such as speed and accuracy, but also power consumption.
Creating this awareness was one of the goals of the T3 challenge, and that is why we are maintaining a separate leaderboard that ranks participants based on the measured power consumption of their algorithm. Before we get into those details, let’s take a step back and review some of the basic science around power and power consumption.
The Science Of Power
For many of us, our first encounter with the science of power occurred during our first lessons in physics and chemistry. We are taught very early that the classical notion of “energy” is conserved in any closed system, constantly being converted to and from its kinetic and potential forms. The symmetry of the energy conservation law of the universe is not only intuitively appealing, but is a fundamental underpinning of all the physical sciences and applied engineering fields, from astrophysics to biochemistry to civil engineering.
Often the concepts of energy and power are used synonymously, but there is a very subtle and important difference. A good example that demonstrates the difference- imagine lifting a box from the ground: it takes the same amount of energy to lift a box no matter how fast you lifted it. Power is another matter. Power defines how fast energy is consumed or transferred. Power is a function of energy and time. To lift the box faster, more power is required.
Power is usually reported as watts. To convert from power to energy over a certain period of time, you would just multiply the power measured over that time interval by the time interval ( assuming the power was constant throughout that time interval.). That is why you will typically see power consumption metrics like kilowatt-seconds or kilowatt-hours. Those quantities are reporting the total energy used over a time period.
How Do We Measure Power Consumption In A Server?
If you’ve seen a picture of a modern datacenter, you’ve likely noticed rows and rows of cabinets side-by-side. If you looked inside each cabinet (often called a rack) you would see multiple modules stacked vertically on top of each other. Typically each module is a rectangular chassis full of electronics, such as a motherboard with a CPU or two, add-on PCIe boards, network cards, hard drives, power supplies, and other electronics. The size of a module chassis is typically measured in integral units related to its height, starting with 1U (or 1 unit). Typical sizes are 1U, 2U, 3U. 4U module chassis are becoming more popular, for example, NVidia’s DGX system needs a 4U form factor to house all 8 of its GPUs boards.
A modern chassis is more than just a metal box that houses the electronics. The chassis itself will contain some light-weight, dedicated electronics that serve very specific purposes. The primary purpose is to support remote power management. Most chassis will host a small web server in which you access a simple web app which enables remote power down and power up to the system. Some support KVM, which is technology that allows you to remotely view the video output of the system and also control the mouse and keyboard input. These are tremendously useful tools if you need to manage a system remotely. Sometimes “turning it off, turning it back on” is the only resolution to a problem at a server in a datacenter!
These remote control capabilities are so useful that the server and datacenter industry has come up with a standard called IPMI (Intelligent Platform Management Interface.). Chassis manufactures that support the IPMI standard will benefit from all of the existing tools and services that data center engineers already use to manage their fleet of systems.
In addition to remote management, IPMI supports the notion of sensors. There are a broad array of sensors that a chassis manufacturer can build into their systems and this typically includes power monitoring.
The image below shows a listing of the sensor that are available for a chassis we use from the company called Advantech. This particular model is a 2U chassis system which houses a 2-CPU/56 core Intel Xeon chipset .
The sensors which contain the sub-string “POWER_IN” are the sensors related to all the power supplies for the chassis and all the electronics the chassis contains. Notice there are two sensors related to two separate power supplies. It’s not unusual for a chassis to contain several independent power supplies. The total power to the system is simply the sum of all individual “POWER_IN” sensors.
We leverage these IPMI power sensors to assess the total power consumption of the participant algorithms in the T3 track of the NeurIPS Billion-Scale Approximate Nearest Neighbor Search Challenge.
So How Much Power Do Different Server Workloads Consume?
The following graphic below shows the power consumption of various workloads, all running for the same period of time (2 minutes). The workloads were run in Ubuntu Linux in a system with a 2-CPU/56 core Xeon chipset and VT100 GPU, which is one system we will be leveraging for the T3 track. We sampled all the power IPMI POWER_IN sensors at 1 second intervals and computed the total power consumption, reported as kilowatt * seconds. We took 3 different readings for each application to capture of any significant variance ( the red line is the median value ). Details of the workloads are outlined below the graphic.
Definition of the workloads A-F:
- For A, we ran no special workloads. So this is the power consumption of the system when it’s idle, for 2 minutes.
- For B, we ran the linux utility called “stress” with the parameters “-cpu 1” which keeps 1 of the 56 CPU cores busy at 100%, for 2 minutes .
- For C, we ran the linux utility called “stress” with the parameters “ -cpu 14” which keeps 14 of the 56 CPU cores busy at 100%, for 2 minutes.
- For D, we ran the linux utility called “stress” with the parameters “-cpu 56” which keeps all 56 CPU cores busy at 100%, for 2 minutes.
- For E, we ran the utility called “gpu-burn” which keeps all cores of the V100 GPU busy, for 2 minutes. None of the CPU cores were stressed during this time .
- For F, we ran simultaneously the linux “stress” utility and the “gpu-burn” utility for 2 minutes. For stress, we used the parameters “-cpu 56 -vm 56 -io 56 -hdd 56” which keeps all 56 cores busy at 100% and applies load to virtual memory, RAM, and the hard drive. “gpu-stress” keeps all cores of the V100 GPU busy at 100%.
In Part 2 of this blog we get into the following details:
- How we convert the time series information from the IPMI power sensors to total power consumption and the software tools we use to do this.
- Why there is small variance in the computed power consumption for the same workload and how we deal with that for the competition.
- How we use the use power consumption to rank participants algorithms in the T3 track’s power leaderboard.