Measuring Intel Hyper-Thread Overhead
Multi-core processors are capable of running multiple software streams/tasks concurrently. Multi-core allows a physical processor to simultaneously execute instructions from multiple processes or threads. Core of a processor is the part that executes application instructions.
A core of a processor is shared by hardware threads (called Hyper-Threads). When two hyper-threads are active in the same core, it results in lower performance of compute intensive tasks as compared to a single thread using core exclusively.
Per Intel docs:
“ Intel® HT technology is a great performance feature that can boost performance by up to 30%..”
HT does not double the core throughput, only improves it by 30%. Thus two compute intensive tasks sharing the core will run at 60–70% performance (30–40% slower).
Traditional Linux tools (vmstat, mpstat..) do not show true core utilization to help estimate core sharing cost. One can measure hyper-thread overhead by: Disabling Hyper-thread, selectively binding tasks to available cores, or comparing CPI (cycles per instruction) or IPC (Instruction per cycle) metrics collected via Linux perf.
Similar to Software multithreading (MT), that refers to execution of multiple tasks within a single process, multi-core processor does the same in the hardware by executing multiple software threads simultaneously across multiple cores and hardware threads (Hyper-threads or HT) within a single physical processor (socket).
Multi-core processors are ideal for throughput computing. Concurrency in the software is required in order to gain significant throughput by utilizing all available hardware threads and cores in physical cpu.
Hardware threads in each core are seen by Linux scheduler as a separate cpu where the task can be scheduled.
Caches in physical processor are also shared by hardware threads.
Linux scheduler uses hierarchical relationship when scheduling a process/task to a cpu.
Hyper-Threads → Core → Physical CPU (Socket)
When there is an available core in the physical cpu, new task is assigned to this core. Once all cores are occupied, then core is shared (two HT/core).
Why Multi-Core
Processor Industry before multi-core was primarily focused on increasing cpu clock and deep pipelining to improve serial performance, thus requiring more logic and silicon space, resulted in higher power requirements and heat dissipation.
Multi-core architecture took a different approach. It traded serial performance for a higher throughput. Instead of implementing complicated logics and pipelining, it duplicated compute logic by implementing multiple dedicated processing units instead of just one.
End result is a simple processor design with low power, less heat dissipation but massive throughput capabilities.
Multi-core processors are ideal when software is designed to run multiple tasks in parallel (multi-thread or multi-processes) to use full potential of vast number of compute engines (cores) packed in a CPU socket
As the gap between processor and memory speeds widens, performance gain by ramping up the processor clock begins to have diminishing returns with processor stalling waiting for memory.
Studies have shown that processors in most servers in real world deployments spent 80% of their time stalled waiting for memory or IO and thus high clock rates and deep pipelines of traditional processors are wasted stalling on cache refills from main memory.
Hardware threads in Multi-core processor reduce the overhead of these frequent cache stalls and achieve maximum memory bandwidth by automatically parking stalled hardware threads and switching to next ready to run hardware threads leading to efficient processor utilization. Multi-core processor can access instructions from both threads within the same time slice, and that reduces cpu stalls and improves efficiency and throughput.
Cores and Hyper-threads (HT) bindings
Linux counts each HT as a vcpu. Hierarchical relationship seen by Linux running on the instance may not be the same as it is on the physical system.
$egrep “(( id|processo).*:|^ *$)” /proc/cpuinfo
can show relationship between: HT and cores. Where:
Socket: Physical cpu on a motherboard
Cores: Number of cores built into physical cpu or socket
Core ID: Each core is assigned an ID
HT: Each core is shared by 2 HT
To see sibling HT (vcpu) sharing the core, use the script below:
#!/bin/bash
for num in `cat /proc/cpuinfo|grep processor|awk '{print $3}'`
do
echo sibling of cpu$num
cat /sys/devices/system/cpu/cpu$num/topology/thread_siblings_list
done
How to Disable HT
One can disable HT in BIOS. However, no access is allowed on cloud instance BIOS. There are other ways to disable HT:
- Update grub: maxcpus=<#ofcores>
- Use script to disable HT on a live system
#!/bin/sh
if [ "$(id -u)" != "0" ]; then
echo "This script must be run as root. You should type:sudo -s" and then run the script 1>&2
exit 1
fi
cat /sys/devices/system/cpu/cpu*/topology/thread_siblings_list |
sort -u |
while read sibs
do
case "$sibs" in
*,*)
oldIFS="$IFS"
IFS=",$IFS"
set $sibs
IFS="$oldIFS"
shift
while [ "$1" ]
do
echo Disabling CPU $1 ..
echo 0 > /sys/devices/system/cpu/cpu$1/online
shift
done
;;
*)
;;
esac
done
Enable or online all cpus by using the script below:
#!/bin/bash
NCPUS=`lscpu|grep ^CPU\(s\)|awk '{print $2}'`
NUM=1
for (( cpuid=$NUM; cpuid<$NCPUS; cpuid++ ))
do
echo enabling cpu$cpuid
echo 1 > /sys/devices/system/cpu/cpu$cpuid/online
cat /sys/devices/system/cpu/cpu$cpuid/online
done
When To Disable HT
Multi-Core processors are designed for throughput computing.
Throughput computing is about performing multiple tasks in parallel by spreading the work across many compute engines (HT and cores).
Each task may take little longer due to slower clock rate and shared cpu resources used by HT, but many task will be completed in a unit time and that improves application throughput.
In general, when HT is enabled, number of cpu resources are statically allocated and shared to run extra thread in the cpu core. How much HT hurts performance depends on application design:
- Compute intensive application with small working set that fits into cpu caches are the one impacted the most with HT enabled.
- Lack of concurrency in application may result in higher contention for shared resources. More active processors means more contention.
Higher contention in application due to lack of concurrency will cause less execution and thus processors will be either sitting idle or doing no productive work due to blocking on lock (context switch) or spinning on lock (busy-waiting).
One should also take into account additional factors such as:
- Proper application sizing (threads). Performance penalty due to lack of concurrency in application is aggravated with more cpus.
- Having additional cpus do not help if the hot code (frequently run functions) runs sequentially.
- Heavy memory intensive application that is capable of utilizing full memory controller bandwidth may not see performance gain when HT is enabled.
- False sharing can happen when two processors share the same cache-line, commonly occurs for global and static variables. This results in inefficient use of cpu caches and may cause application to run at memory speed due to frequent load/store operations.
- NUMA latencies. Verify if the system is numa (Non-Uniform Memory Access). If not planned correctly, application running on NUMA may experience higher memory latencies. Application should use numa library or “numa” utility to hint kernel how its memory allocation should be handled.
Measuring HT Overhead
To test HT overhead, one can use:
Linux “taskset” utility to set task affinity.
Linux containers or Docker and constrain them to subset of cpus.
Compare performance data captured with or without HT will help quantify performance gain or loss. Lower cpu utilization is a sign of scaling problem with application due to insufficient software threads, serialized code and lack of concurrency.
To estimate HT overhead, one should measure:
Core Utilization: CPU utilization may not the best way to measure and compare HT overhead. Utilization measures how much cpu headroom is available. One may assume cpu utilization would cut into half considering HT doubles the number of vcpus. It does not, however, translate into 2 x speed up if all vcpu are utilized.
Instead of cpu utilization, one should look at other metrics such as: work done per unit time (RPS) and elapsed time (latency) to assess performance changes due to HT.
Linux tools like top, mpstat and others do not offer clear insight into core utilization. All you get is the vcpu utilization.
One can use the script below to capture core utilization
#!/bin/bash
SOCKETS=`grep "physical id" /proc/cpuinfo|sort -ru|head -1|awk '{print $4}' `
sockets="SOCKETS" #converts into integer
NCORES=`grep cores /proc/cpuinfo|sort -u|awk '{print $4}'`
ncores="NCORES"
NUM=0
if [ $sockets != 0 ]; then
NCORES=$((($sockets + 1) * $ncores))
fi
for (( core=$NUM; core<$NCORES; core++ ))
do
SIBLING=`cat /sys/devices/system/cpu/cpu$core/topology/thread_siblings_list`
echo Core $core Utilization: Threads:$SIBLING
mpstat -P $SIBLING 1 2
done
HT doubles the number of vcpu that Linux can schedule a task. Thus twice many threads will be running simultaneously.
Let’s assume a system with four cores with HT disabled, running four compute threads in parallel. If the 1 unit of work is computed by each thread in a second, then four threads will compute 4 unit of works in a seconds (4 units/s). With HT enabled, we can now run 8 threads (sharing core) in parallel. Expected gain should be: 4 x 1.25 = 5 units/s instead of 8 units/s. Due to shared core, compute latency is increased = 8 units / 5 units/s = 1.6 seconds. Thus HT improved overall throughput by 25% but at a cost of higher latency.
Although it seems like response time may increase with HT, it is normally not the case due to less context switching with more available cpus.
Core and Thread CPI: CPU utilization metric does not offer reliable insight on the use of cpu.
CPI stands for Cycles per Instruction. It is an average time it takes to execute a given set of instructions. CPI is an indicative of instruction level parallelism in the code.
CPI can also be used to estimate memory fetch latency when a cache line is invalidated due to stale data found in cpu caches.
For example: Intel processors core can execute 4 instructions per clock, that is equivalent to a CPI 0.25. Due to cache misses and branch mis-predictions, real-world applications has an average CPI of 1.0 or 2.0.
To capture Core CPI, disable HT and measure CPI. Since the core is dedicated to single thread, it gives you Core CPI. Now enable HT. Since two threads shared the core, they may execute different number of instructions and CPI.
Let’s assume over a sampling period, two threads sharing a core utilized 1 Million core cycles. During that period Thread-1 executed 750k and Thread-2 500k instructions. In this case, Thread-1 CPI:1.33, Thread-2 CPI: 2.0 and Core CPI :0.80 (1 Million cycles/ 750+500 instructions).
Note: CPI data is available through Intel PMU (Performance Monitoring Unit) and can be extracted using Linux perf tool and Intel pcm utility. Unfortunately, access to PMU registers is restricted on Amazon instances. We are working with Amazon to provide these capabilities.
start compute bound job:
#/bin/bash
for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
Change 1..2 to 1..4 to start four process or use /proc data (as shown earlier) to start the compute bound job on selected cores and vcpus.
Originally published at http://techblog.cloudperf.net.