ARM Processor

SSE, OOE, Pipeline, Von Neumann, Harvard, Cortex, SoC, Affinity

Vince

Published in

vswe

15 min readMar 3, 2021

SSE vs OOE

The Arm architecture describes instructions following a Simple Sequential Execution (SSE) model, which means the processor fetched, decoded and executed one instruction at a time, and in the order in which the instructions appeared in memory.
However, in practice, modern processors have pipelines that can execute multiple instructions at once, and may do so out of order, which called out-of-order execution (OOE or O3)

Source: This diagram shows an example pipeline for an Arm Cortex processor.

Target (minimize CPU Time)

CPU Time = duration of clock cycle * clock cycle
duration of clock cycle = 1 / clock rate
clock cycle = CPI * IC
CPI: cycles per instruction = Total program execution cycles / IC
IC: instruction count

減少 duration of clock cycle：拉高 frequency
減少 clock cycle (CPI * IC)
=> 減少 CPI：更深的 pipeline / 提高 pipeline 利用率
=> 減少 IC：改良 instruction set (RISC/SISC) / compiler / program

但 pipeline 並不一定越深越好，可能導致 hazard 的產生，因此減少 CPI 另外一個方式就是 ILP (instruction level parallelism)，因此出現 Superscalar 和 Out-of-order execution 的技術

Superscalar：同時提高 fetch 的 IPC (instructions per cycle) 和 decode 的 macro-ops(mops) per cycle 和 dispatch/issue 可以處理的 micro-ops (uops) 頻寬，藉此提減少 CPI。注意如果只提高其中一個 stage 並沒有用，因為 bottleneck 還是會卡在其他 stage。
Out-of-order execution：HW 有一個 sliding windows of consecutive instuctions，稱為 instruction window，並且在 Execute 階段亂序執行。

流水線 (Pipeline)

Pipeline 就是把指令的處理分級幾個不同的步驟

ARM7: 三級流水線 3-Stage Pipeline

在 ARM7 採用三級流水線，第一條指令在提取指令，第二條在解碼，第三條在執行

PC (R15) instruction pointer 指向下一條要 Fetch 的指令地址，不是 Decode 或 Execute，每條指令是 4 bytes，因此在 ARM 狀態，PC = 執行指令地址 + 8 bytes。而在 Thumb 狀態，是加4 bytes。

IF (Instruction Fetch) 提取指令：取指從儲存器裝載一條指令
ID (Instruction Decode) 解碼：譯碼識別將要被執行的指令
EX (Execute) 執行：執行處理指令並將結果寫會暫存器

Cortex-M4 (Armv7E-M) — 3-stage + branch speculation

五級流水線 5-Stage Pipeline

ARM9 採用五級流水線，增加了 LS1(緩衝/數據) 和 LS2(回寫) 的階段，但是只有在 LDR (Memory to register) 和 ASR (register to memory) 的時候有效，其他指令不需要這兩個階段。減少每個週期需要完成的工作量，使每個階段功能更平衡，且可以提升 clock rate。

LS1(Buffer/Data)：如果需要，讀取或寫回 Data memory，否則緩衝一個週期，使所有指令有同樣的 pipeline。
LS2(Write back)：將指令運算或 Load 的結果寫回 register 中
Hazards：One of the source registers being read in decode might be the same as the destination register being written in writeback. On silicon, many implementations of memory cells will not operate correctly when read and written at the same time.
Since ARM9 cores were released from 1998 to 2006, they are no longer recommended for new IC designs, instead ARM Cortex-A, ARM Cortex-M, ARM Cortex-R cores are preferred.

六級流水線 6-Stage Pipeline (Branch Prediction)

如果程式碼遇到 Branch，有可能導致 pipeline 中的資料失效，要重新 fetch。在 ARM10 之後，支援 Branch Prediction，減少 pipeline 資料失效被 flush 的機率。

在沒有分支預測前，處理器將會等待 Branch 指令通過了指令管線的執行階段，稱為 Pipeline Stalled or Bubbling，導致效能落差。

現代微處理器趨向採用非常長的管線，因此分支預測失敗 penalty 會更大，因此，越長的管線就需要越好的分支預測。

---------- ---------- ---------- ---------- ---------- ---------- 
|          |  ARM &   |          | AddrCalc | Multiply |          |
|  Branch  |  Thumb   | Register |----------|  Adder   |          |
| Predict  |  Decode  |   Read   | Multiply |__________| Register |
|----------|----------|----------|----------|          |  Write   |
|          | Co-proc  |  Final   |   ALU/   |  Memory  |          |
|  Fetch   |  Issue   |  Decode  |  Shift   |  Access  |          |
 ---------- ---------- ---------- ---------- ---------- ---------- 
   Fetch      Issue      Decode    Execute     Memory     Write

八級流水線 8-Stage Pipeline (Scalar 架構)

ARM11採用 Scalar 架構的 Pipeline，並在 Issue 階段支援 ALU (arithmetic logic unit), MAC, Load/Store 各自的 pipeline。

Issue 負責執行指令的路徑分派，在 1 個 cycle 完成。
第一階段 Fetch 做 Dynamic Branch Prediction，根據歷史分支轉移的結果來預測。

第二階段 Fetch 做 Static Branch Prediction，會處理不在第一階段範圍中的分支預測記憶體位址，All decisions are made at compile time。
平均來說，會有 85% 的分支命中準確度。

10/13-Stage Pipeline (ARM Cortext A Superscalar 架構)

ARM Cortext A 導入了 Superscalar 架構的 Pipeline，Superscalar 指一個週期觸發 (issue) 多條指令的 pipeline 架構，例如：Cortex A8 支援 13-Stage 的整數 pipeline 和 10-Stage 的 neon pipeline，支援 Dual-Issue In-Order Pipeline，同時 Issue 兩個整數 instruction 在同一個 period 中平行處理兩個指令。

Dynamic Instruction Scheduling: Register Renaming & Out-of-Order (亂序執行)

重新排序指令避免 hazard 跟 latency，分成靜態在 compiler 決定，和動態由 CPU 先往後看幾個指令，只要沒有 dependency，就全部丟下去執行，稱為 Out-of-Order Execution (OOO)。

Fetch many instructions into instruction window (Cortex-X1 has 224 entries)
Rename registers to avoid false deps
Execute instructions as soon as possible

Throughput (Typical MIPS @ MHz)

1994 ARM7: 40 MIPS at 45 MHz, 0.889 IPC
2002 ARM11: 515 MIPS at 412 MHz, 1.25 IPC
2005 ARM Cortex-A8: 2000 MIPS at 1.0 GHz, 2.0 IPC
2016 ARM Cortex-A73 (4-core): 71120 MIPS at 2.8 GHz, 25.4 IPC, 6.35 IPC/core

Memory Architecture

Von Neumann Architecture

將程式指令和資料合併儲存的結構，將儲存裝置和 CPU 分開，經由同一個匯流排傳輸。

Harvard Architecture

將程式指令和資料分開儲存的結構，優點是資料和指令的存取可以同時進行，且可以有不同的資料寬度。
哈佛結構的微處理器通常具有較高的執行效率，因為指令執行時，可以預先讀取下一條指令。
Arm 的 ARM9、ARM10和ARM11 採用哈佛結構，多數 DSP 採用此結構。

Modified Harvard Architecture

大多數現代處理器採用此架構，在 CPU 內部的 cache 會把指令和資料分開，所以採用 Harvard Architecture，但外部 RAM 其實沒有區分指令和資料，所以採用 Von Neumann Architecture。大多 DSP 因為沒有 cache，所以直接採用 Harvard Architecture。

Cortex Processor Family (A > R > M)

https://zh.wikipedia.org/wiki/ARM%E6%9E%B6%E6%A7%8B

早期是叫做ARM1~ARM11，後來變成Cortex A,R,M
Cortex-Application (A78) 應用處理器核心，基於虛擬內存的作業系統和用戶應用，有 MMU(內存管理單元)，可以安裝 General Purpose OS，像是 Linux/Android 多用戶多進程分時作業系統 (Time-sharing OS)。
Cortex-Real-time (R52) 即時應用的高效能核心，沒有MMU，軟體看到的都是物理地址，不能裝 Linux，只能安裝 RTOS，RTOS 會在最短時間內執行該 process ，不會有較長的延遲，確保高優先度 task 優先執行。
Cortex-Microcontroller (CM4) 各類嵌入式應用的微控制器核心，不要求性能，以低功耗為主，相較於 R 更精簡，更短的流水線(3條 vs 8條)、更少的運算單元。

Cortex Processor Spec

List of ARM microarchitectures

Cortex-A76
Application profile, AArch32 (non-privileged level or EL0 only) and AArch64, 1–4 SMP cores, TrustZone, NEON advanced SIMD, VFPv4, hardware virtualization, 4-width decode superscalar, 8-way issue, 13 stage pipeline, deeply out-of-order pipeline
64 / 64 KB L1, 256−512 KB L2 per core, 512 KB−4 MB L3 shared

4-width decode superscalar：decode 一個 cycle 可以將 4 個 instructions 轉換成 4 個 macro-ops (Mops)
8-way issue：issue 一個 cycles 可以 issue 8個 uops
13 stage pipeline

SoC範例

https://en.wikipedia.org/wiki/ARM_Cortex-X1

2020 Q4 — Qualcomm Snapdragon 888 (samsung 5nm)

2.84 GHz 單核心 ARM Cortex-X1 客製化處理器
2.4 GHz 三核心 ARM Cortex-A78
1.8 GHz 四核心 ARM Cortex-A55
搭載產品：小米11

Introducing the Arm Cortex-X Custom Program

2017 Q1 — Helio X30 MT6799 (TSMC 10nm), W5y

2.5 Ghz 雙核心 ARM Cortex-A73
2.2 Ghz 四核心 ARM Cortex-A53
1.9 Ghz 四核心 ARM Cortex-A35
搭載產品：魅族 Pro 7 Plus
高階打不贏高通：聯發科X30與驍龍835區別對比

2015 Q2 — Helio X20 MT6797 (TSMC 20nm), E5t

2.3 GHz 雙核心 ARM Cortex-A72
2 GHz 四核心 ARM Cortex-A53
1.4 GHz 四核心 ARM Cortex-A53
An integrated sensor hub featuring an embedded ARM Cortex-M4 processor(CM4) operates on an isolated, low-power domain to support diverse always-on applications, such as MP3 playback and voice activated apps.
搭載產品：樂視樂2、Sharp Z2、紅米Note 4、魅族MX 6

Processor Affinity (taskset COREMASK PID)

# 看 PID 1 的 CPU Affinity
$ taskset -p 3551
pid 1's current affinity mask: 3 (binary)
$ taskset -c -p 1
pid 1's current affinity list: 0,1# 以特定的 CPU 核心執行程式
$ taskset 0x2 top
$ taskset -p $(ps -a | grep top | awk '{print $1}')
pid 3690's current affinity mask: 2# 改變已存在 process 的 Affinity
$ taskset -p 0x2 $(ps -a | grep top | awk '{print $1}')
pid 3733's current affinity mask: 3
pid 3733's new affinity mask: 2

Scheduling Priority reference (nice -n NUM COMMAND)

niceness 可用的數值從-20(最高優先權)到19(最低優先權)
nice -n NUM COMMAND# Run top with 10 niceness
$ nice -n 10 top
$ ps -a -o pid,ni,cmd
2048  10 top