SNAPDRAGON 8CX PROCESSOR — AN OVERVIEW

Published in

somosfit

18 min readJul 12, 2023

1. INTRODUCTION

Snapdragon is a set of Qualcomm processors widely used in mobile devices from a variety of manufacturers (Asus, Xiaomi, Motorola, etc) and they are divided into families or lines according to its intrinsic features and capabilities.

The Snapdragon 8cx is a SoC (System on Chip), with an integrated Adreno 680 GPU and a Octa-core Kryo 495 CPU and X24 LTE.

It has support to the Quick Charge 4+ and also the Aqstic and aptX audio technologies from Qualcomm.

It can be highlighted that use of SoC can be advantageous since the integration of multiple subsystems, each customized for a particular application domain and specializing each subsystem to a task, may result in performance and power enhancements beyond what could be possible with a homogeneous CPU-based computing platform.

Some key features of the Snapdragon 8cx will be described as follows.

2. SNAPDRAGON 200–800 PROCESSOR LINES

2.1 Main characteristics of processor lines

The snapdragon processors are divided into distinct lines according to their application and release data, which specific features, as depicted by the following picture:

Figure 1 — Qualcomm categorization for their processors

And some key features of Snapdragon S600-S800 family models can be viewed in the following table:

Table 1 — Snapdragon S600-S800 models specifications

Regarding its architectural design, the Snapdragon has the subsystems depicted on Figure 3 below:

Figure 2 — Block Diagram of the Snapdragon 800 SOC

Figure 3 — Die size and transistor count comparison

When considering the Snapdragon 8cx, some of its key features can be described as follows:

It was announced in 12/2018 to be shipped in devices in 2019.
It includes: 4 Cortex A-76-based big cores and 4 Cortex-A55-based Little cores
Manufactured on TSMC 7 nm technology
Die size of ~112 mm², transistor count ~10b
The Snapdragon 8cx is the Qualcomm’s first processor designed exclusively for Windows 10 devices
“C” means “Compute” and x means “eXtreme”
Sustained power consumption of 7W, which allow fan-less systems with multi-day battery life.

The Snapdragon processor has some functional units in order to comprise specific functionalities on the chip as depicted in the figure:

Figure 4 — Main functional units on the Snapdragon 8cx

In the following sections let’s present the main components presented on Figure 4.

2.1 Adreno 680

Adreno is a series of Graphics Processing Units (GPU) developed by Qualcomm and used int many of their SoCs. Adreno is an anagram of AMD’s graphic cards brand, Radeon, it started as Qualcomm’s in-house brand of graphics technologies used in their mobile chipset products.

Early Adreno models included the Adreno 100 and 120, which had 2D graphics acceleration and limited multimedia capabilities, the 3D graphics on mobile platforms were commonly handled using software-based rendering engines, which limited their performance. In 2009, Qualcomm bought the Imageon (previously ATI Imageom) from ATI (later AMD), a series of media co-processors and mobile chipsets used no mobile phones and PDAs, in order to add hardware-accelerated 3D capabilities to their mobile products.

The Qualcomm Adreno 680 is an integrated graphics card in the Qualcomm Snapdragon 8cx for Windows laptops. It is 2x faster than the previous Adreno 630 in the Snapdragon 850 with a 60% improved efficiency thanks to the 7nm process, according to Qualcomm. Adreno 200 was released in 2008, which was integrated into the first Snapdragon SoC.

The performance of Adreno 680 should be similar to Intel UHD Graphics 620 (8th generation Core i5) when running native ARM64 compiled Windows apps and games. When running emulated 30-bit games (64 bit emulation is not supported) the performance is notably slower.

Compared to Adreno 630 (Snapdragon 859), Adreno 680 has 2x more transistors, 2x, memory bandwidth. Adreno 680 also supports Direct X 12 API and dual 4k HDR external monitors.

2.3 Hexagon 690

Qualcomm developed the Hexagon Digital Signal Processor (DSP) with both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modern functions. It is an advanced, variable instruction length, Very Long Instruction Word (VLIW) processor architecture with hardware multi-threading.

The Hexagon architecture and family of cores provides Qualcomm Technologies a competitive advantage in performance and power efficiency for modern and multi-media applications, such as: Image Enhancement, Computer Vision and Augmented Reality, Video, Sensors, media acceleration and modem processing. This offloads multimedia tasks from the CPU to the DSP, it is design to deliver a superior energy efficiency compared to a mobile CPU alternative.

Figure 5 — Hexagon DSP use cases

The Snapdragon 800 has two instances of Hexagon DSP:

The modem (mDSP) customized for modem processing;
The application DSP (aDSP) for multimedia acceleration;
The mDSPis a closed subsystem and is programmed only within Qualcomm Technologies. The aDSP, however is licensed for programming.

Figure 6 — Hexagon architecture

In 2011, Hexagon started to allow customers to program the DSP and thus exploit the power and performance benefits of offloading the ARM cores, which improved performance, power dissipation and concurrency requirements. Hexagon cores are optimized for both high performance and energy efficiency. Energy efficiency is often the most critical metric. Rather than pushing for higher frequencies, they are designed for high levels of work per cycle, but at a reduced clock speed, which allows it to avoid power cost.

One of the designs of multi-threading is to have the power scale with a high number of running threads, Hexagon cores use a semi-custom physical design methodology with customization oriented to power reduction.

All versions of the Hexagon DSP core are hardware multi-threaded to enable superior concurrency needed in mobile applications. Implementations have evolved from simple interleaved multi-threading (IMT) to more advanced prioritized scheduling to obtain the maximum efficiency to schedule as many execution slots as possible. The initial Hexagon V1 core supported six threads, but the most recent version, Hexagon V5, features three threads. To the programmer, these hardware thread can be considered as separate processor cores with shared memory and are programmed using conventional software threading.

The RTOS maps user software threads onto the processor’s hardware threads. These hardware threads share the entire memory hierarchy including L1, Thus, it is beneficial for the software to employ threads that cooperate on shared data. to facilitate this, a very fast RTOS kernel has been designed for Hexagon. The RTOS globally schedules the highest priority runnable software threads and always directs interrupts to the lowest priority hardware thread. The instruction set originated and evolved assuming the existence of a multi-threaded implementation. The inherent latency tolerance afforded by multi-threading enabled ISA optimizations that would not otherwise be practical.

It allows for grouping of both independent and many forms of dependent instructions. As an example, the common load-compare-branch idiom can be expressed in a single Hexagon instruction packet. Such techniques enable extraction of high instruction parallelism even from irregular control-code applications.

Hexagon is a multi-threaded very long instruction word (VLIW) DSP Multi-threading and VLIW are complementary technologies. Multi-threading hides pipeline latencies which make instruction latencies appear low. The perception of low instruction latencies allows the compiler to more effectively utilize the VLIW packets.

The Hexagon ISA is a hybrid DSP-CPU that features a 4-issue VLIW comprised of a dual load/store slots and dual 64-bit vector execution slots. All instructions operate on a shared 32-entry per-thread register file. Vector operations use register pairs from the general register file. The ISA features a rich set of DSP arithmetic support including 16-bit and 32-bit fractional and complex data types, 32-bit floating-point and full 64-bit integer arithmetic support.

Figure 7 — Hexagon block diagram

The figure shows Hexagon block diagram. The architecture features a four-wide very long instruction word (VLIW) with dual load/store and dual single-instruction multiple-data (SIMD) execution units and supports hardware multithreading.

The Hexagon processor features a unified byte-addressable memory. This memory has a single 32-bit virtual address space that holds both instructions and data. It operates in little-endian mode. A Full-featured memory management unit (MMU) translates virtual to physical address.

There are two sets of user registers:

General registers
Control registers

The general registers include 32 32-bits registers that can be accessed either as single registers or as aligned 64-bit register pairs, they contain all pointer, scalar, vector and accumulator data.

The control registers include special-purpose registers such as the program counter, status register and loop registers.

There are 2 identical 64-bit single-instruction, multiple-data (SIMD) execution units. Each unit supports all multiply, shift, arithmetic logic unit (ALU) and bit manipulation instructions.

Supported data types include:

8-, 16-, 32-, and 64-bit integers;
16- and 32-bit fractional with optional rounding and saturation;
16-bit complex; and single-precision IEEE-compatible floating point.

Each unit can support:

Four 16x16 multiplications;
Two 32x16 multiplies; or
One 32x32 multiply, one complex;
Multiply, or one floating-point fused;
Multiply-add (FMA).

Many of the instructions are complex and application specific. Complex instructions targeted to a particular application can provide high performance and energy efficiency. As an example of its optimization, the most used algorithm in Signal Processing is depicted in the picture below:

Figure 8 — FFT algorithm diagram

The Figure 7 shows a complex multiply instruction used in a 16-bit fixed point fast-Fourier transform (FFT), and it is worthy to point out that, without such an instruction, it would take 4 multiplies, 4 shifts, 4 adds and 2 saturates to perform the operation. In this sense, it is clear that packing all the work in a single instruction executed in a single pipelined execution unit provides large efficiency gains.

The Hexagon Instruction Set Architecture (ISA) contains numerous special-purpose instructions designed to accelerate key multimedia kernels. Multimedia algorithms which special instruction support include:

Variable length code/decode, such as context-adaptive binary-arithmetic coding processing in H.264 video standard;
Features from accelerated segmented (FAST) corner detection image processing;
FFT algorithms;
Sliding window filters;
Linear feedback shift;
Table lookup from an arbitrary bit field index;
Elliptic curve cryptography;
Cyclic redundancy check (CRC) calculation;

2.4 Hexagon Instruction Set

2.4.1 Load/Store Instruction

Dual load/Store units access signed or unsigned 8,16,32 and 64-bit values in memory. There is a rich variety of addressing modes, including:

Absolute 32-bit
Base plus scaled immediate and base plus scaled register
Auto-incrementing by register and immediate
Circular addressing and bit reversed

To increase the number of instruction combinations allowed in packets, the load/store units also support 32-bit ALU instructions

2.4.2 Conditional execution and program flow

A unique feature of Hexagon conditional execution is that the processor can generate and use a predicate in the same VLIW instruction packet. This reduces packet count and creates denser packets, both of which improve performance and reduce energy consumption.

Like many DSP, Hexagon includes a zero-overhead hardware counted looping mechanism with support for two levels of nesting. An instruction is used to initialize the loop count and the start address. Bits encoded in the last packet of the loop delineate the end of the loop. This architecture allows execution of loops with no branch miss predicts or stalls, and no hardware devoted to loop branch prediction.

2.4.3 Compound and memop instructions

Compound instructions combine two or more dependent operations in a single instruction. These instructions improve code size and save power by reducing register file and forwarding power. Some instructions that support compound instructions are: shift-add, shift-or, add, compare-branch, shift-xor and many classic DSP multiply-add.

Another class of instruction performs simple operations directly on memory, including add, subtract, logical-or and logical-and. Without these memory operations, three instructions would be necessary to perform the same task:

Load the value;
Perform the arithmetic/logical operation;
Store the result.

2.4.4 VLIW instruction grouping

VLIW instruction packets are variable sized and contain one to four instructions. If a packet contains more than one instruction, the instructions execute in parallel. The instruction combinations allowed in a packet are limited to the instruction types that can be executed in parallel in the four execution units. The processor uses parallel execution semantic. All registers are read, then all instructions are executed, then all registers are written.

2.4.5 Duplex Instructions

Hexagon instructions are fixed size and 32 bit length. To improve code size, the Duplex feature enables some use of 16-bit instructions by creating a 32-bit sub packet containing two 16-bit instructions. These sub packets are called duplexes.

Figure 9 — Hexagon duplex instructions

The figure is a visualization of a duplex. A duplex is a 32-bit sub packet containing two 16-bit instructions.

Because duplexes are always 32 bits, packet sizes continue to be multiples of 32 bits. This leads to a simpler and lower-power implementation as compared to instruction sets with a mixed 16/32-bit instruction set. Additionally, duplexes must always end a packet, and are always dispatched to the same two execution units. which further simplifies the implementation.

These instructions allowed in duplexes, called, sub instructions, are the most common subset of normal Hexagon instructions, with reduced ranges or registers and immediate operands.

2.5 Hexagon Multi-threading and micro architecture

The number of threads in the Hexagon processor varies by implementation. Early implementations included sex hardware threads, but more recent cores include three hardware threads. Additional threads provide more latency tolerance and enable power-saving opportunities in the micro architecture by serializing work rather than speculating work. On the other hands, it increases cache pressure and increase the software programming burden.

Figure 10 — Hexagon multi-threading view

The figure shows the programmer’s view of multi-threading. To the programmer, it appears as three VLIW cores with shared caches. Software threads are mapped to hardware threads by the Hexagon operating system.

Hexagon is designed to look like a multi core architecture with communication throughout shared memory. In physical memory, however, there is only one processor, which the three hardware threads share. Hexagon V1 through V4 implemented a simple round-robin Interleaved Multi-Threading (IMT) approach. On every clock tick, a different thread is given turn at each pipe stage.

Figure 11 — Hexagon pipeline example

The figure shows a 3-stage execution pipeline and with 3 threads taking turns dispatching packets.

With the number of threads matched to the execution pipe depth, all of the thread’s instructions from a VLIW packet are complete before the next VLIW packet starts.

The obvious problem with IMT is that when threads are idle or stalled, their slide of the processor goes unused. With Hexagon V5, the processor will opportunistically execute packets faster if threads are idle or stalled and simple packets are available.

Figure 12 — Hexagon Instruction per cycle (IPC)

The figure shows the instructions per cycle (IPC) for various multimedia benchmarks. They are sorted as multi-threaded applications on the left and single-threaded on the right. The boost from V5 dynamic multi-threading is shown as the additional red bar. The DSP includes an extensive ability to help hide cache latency. The CPU and DSP are not cache coherent with each other so coherency must be maintained in software with explicit cache maintenance operations.

A software remote procedure call (RPC) interface lets a CPU application offload work to the DSP. When the RPC is made, any data associate with the call is flushed to the main memory from the CPU cache and mapped into the DSP virtual address space. The DSP is then interrupted to process the RPC call, after which any results are flushed from the DSP caches back to main memory, and a completion interrupt is sent to the CPU.

Figure 13 — Hexagon DSP Instructions Per Cycle

The overheads of software-managed coherency preclude offloading very small tasks to the DSP. Large kernels that run continuously or process large data (full image frames) are typically needed to amortize the overhead.

A simple round-robin thread scheduling is implemented:

Number of threads match execution pipe depth (3 threads → 3 execute stages);
All instructions complete before next packet dispatch;
Compiler schedules for zero-latency which helps to increase instructions/VLIW packet;
Hexagon DSP V5: Dynamic HW multi-threading.

Removes a thread from IMT rotation

On L2 cache misses;
When in wait-for-interrupt or off mode.

Additional forwarding to support 2-cycle packet

VLIW packets with dependencies between long latency instructions will stall

Many VLIW packets with simple instructions can complete in 2 processor clocks.

Communication between the DSP and CPU is done though a traditional shared-memory-plus-interrupt mechanism. Both the DSP and CPU an access the full physical address space and share the external memory. The ccess to memory is cached based and there is no explicit data mover.

2.6 Kryo 495

Kryo series are the successor of Krait cores. Unlike Krait, Kryo is not Qualcomm’s won design, but a semi custom implementation of 4x Arm Cortex-A55 (Kryo Silver) and 4x Arm Cortex-A76 (Kryo Gold) , arranged in configuration with DynamIQ and big. LITTLE

Figure 14 — Kryo 495 CPU architecture

Arm big. Little technology is a heterogeneous processing architecture that uses two types of processor. “LITTLE” processors are designed for maximum power efficiency, while “big” processors are designed to provide maximum compute performance. With two dedicated processors, the big.LITTLE solution is able to adjust a the dynamic usage pattern: Big.LITTLE adjusts to periods of high-processing intensity, such as those seen in mobile gaming and we browsing and periods of long low processing tasks such as texting, e-mail and audio.

Above all of it, sits a software layer (Global Task Scheduling) that schedules the right application onto the right CPU On Kryo:

The A-75 (Kryo Gold) is a low-power/high performance processor, the “Big” cluster on Big.LITTLE;
The A-55 (Kryo Silver) is an ultra-high efficiency processor, the LITTLE cluster on Big.LITTLE.

Each group of processors is located in a different cluster, in a way they would not share resources with other clusters.

All cores in each cluster are the same and each cluster share L2 cache with all cores running at the same frequency with the same implementation.

DynamIQ improves big.LITTLE technology. Now, Big and LITTLE cores can be placed in the same cluster, moving all cores under one cluster will help reduce memory latency, which will increase performance without increasing power consumption and also making it easier for the cores to communicate with one another. (DynamIQ Shared Unit, DSU)

Now there can be any combination of cores, each core can be operated at different frequencies. ARM processors will have the L3 cache for the first time.

Now all the processors are in the same cluster, inside a cluster, manufactures can just pick and choose which cores they ant to put in that core. ARM claims there are more than 3k different configurations one could make in a cluster.

Figure 15 — DynamiQ possible configurations

Now up to 8 cores a supported in each cluster and all the cores can be different from each other, with different micro architecture, implementations, cache configuration and frequencies. Each core has it’s own L1 and L2 cache, but and L3 cache for the entire cluster, which leads to more performance and efficiency

2.7 Connectivity

2.7.1 Snapdragon X24 Modem

Snapdragon X24 Is a LTE (4G) integrated modem, which provides up to 4x4 MIMO on five aggregated carriers, and up to 20 LTE spatial streams. It is designed to help mobile operators to fully mobilize their spectrum assets, support Gigabit speeds across a wide range of scenarios, and maximize the capacity of their Gigabit LET networks, reaching up to 2Gbps.

2.7.2 Snapdragon X55 Modem

Lenovo Flex 5G comes with a x55 Modem external to the SoC, which provides 5G sub-6Ghz and mmWave connection. It promises speeds up to 7 Gbps for download and 3 Gbps for upload.

2.7.3 Wifi 6.0

Wi-fi 6 uses 1024-QAM to provide a signal with more data (more efficiency) and a channel of 160 Mhz to offer a wider channel for more speed. It also uses uplink/downlink8x8, MU-MIMO, OFDMA and BSS Color to provide up to 4x more capacity and deal with more devices.

1024-QAM, each symbol carries 10 bits instead of 8, which improves the raw speed in 25% compared to 802.11ac 256-QAM.

Figure 16–1024QAM x 256QAM Bit Rate comparison

OFDM is a symbol which transfers data. It divides the data into smaller sub carriers for more stability and larger coverage area. The AX Wifi uses a 4x longer OFDM symbol to create 4x more sub carriers. For this reason, OFDM symbol is longer in Wi-fi, offers a better coverage area and returns 11% faster.

Wifi 6 expands the bandwidth from 80Mhz to 160Mhz, making the bandwidth double and creating a faster connection from the router to the device.

Bluetooth 5.02.7.5 -

Characteristics:

Standard: IEEE 802.15.1;
Frequency: 2.4Ghz;
Range: 200 meters;
Power: 100mW (class 1);
Speed: 50 Mbps (maximum);

Advantages over previous version:

2x faster pairing time and less transmission delay;
4x range, which is 40m indoors and 200m in a line-of-sight distance;
8x data transfer rate and support dual audio mode to connect multiple devices to the same source;
Wireless coexistence with other technologies;

2.8 Spectra 390

Spectra is a ISP (Image Signal Processor) which will off load the image processing from the CPU. This saves 4x more power compared to the other generations as the ISP is more energy efficient for image processing than the CPU. Depth detection, object classification, object segmentation, it all can happen in 4K HDR and 60Hz. It also improves Artificial Intelligence and Augmented Reality. More resolution: 22MP at 30Hz for double cameras and 48MP at 30hz for solo cameras. It integrates a parallax-based depth-sensing system that works much like the human eye, judging the relative distance of object from two-lens perspective.

Qualcomm claims it will enable many dual-camera devices to achieve competitive depth-sensing performance at low cost

2.9 Security

Qualcomm Snapdragon 8cx uses TPM to provide security. TPM is an integrated chip on Snapdragon 8cx SoC.

The TPM generates encryption keys and keeps part of them to itself rather than on the disk. This means an attacker can’t just remove the drive from the computer and attempt to access the files elsewhere

This chip provides hardware-based authentication and tamper detection, so an attacker can’t attempt to remove the chip and place it on another motherboard, nor tamper with the motherboard itself to attempt to bypass the encryption. It ensures the boot process starts from a trusted combination of hardware and software and continues until the operating system has fully booted and applications are running.

The responsibility of ensuring this integrity using TPM is with the firmware and the OS. UEFI can use TPM to form a root of trust: The TPM contains several Platform Configuration Registers (PCRs) that allow secure storage and reporting of security-relevant metrics, which can be used to detect changes on previews configurations and decide how to proceed.

There are some types of TPM, the one used on Snapdragon 8cx is TPM 2.0, which supports newer algorithms, this improves performance on the generation of new keys and digital signatures. In TPM 2.0 SHA-1 and SHA-256 are required for hashes. RSA and ECC with Barreto-Naehrig 256-bit curve and NIST P-256 curve are used for public-key cryptography and asymmetric digital signature generation and verification.

As for symmetric digital signature generation, it uses HMAC and 128- bit AES for symmetric-key algorithms.

The table below compares TPM 1.2 with TPM 2.0

Table 2 — TPM 1.2 and TPM 2.0 comparison

3. CONCLUSION

This article has presented that SoC concept allows an optimization for system applications in notebooks and mobiles. On mobile market, the Snapdragon is leveraging the efficiency of the recent devices allowing the programmers to develop better applications. In this sense, this article has also presented a brief overview of Snapdragon main architectural components in order to illustrate its capabilities allowing a better understanding on why its presence in market is growing consistently.