An Architectural Deep-Dive into AMD’s TeraScale, GCN & RDNA GPU Architectures

Published in

High Tech Accessible

25 min readNov 11, 2019

With an overview of AMD’s GPUs and supporting prerequisite information behind us, it’s time to delve into TeraScale, GCN and RDNA’s architectural depths…

This post has been split into a two-part series, please find Part 1, An Overview of AMD’s GPU Architectures, here:

An Overview of AMD’s GPU Architectures

Prefacing our deep-dive into TeraScale, GCN & RDNA…

medium.com

1. TeraScale

Let’s start within TeraScale’s core and build our way out:

A processor is a complex ensemble of fetch, decode & execute logic coupled with storage registers and cache all working in tandem to carry out whatever number-crunching is required, and these execution blocks are themselves built-up of simpler foundational blocks.

With TeraScale, as with all of AMD’s GPUs today, this most fundamental execution block is the Stream Processor or SP. An SP is what AMD chooses to call an Arithmetic and Logic Unit (ALU) in the context of their GPUs; ALUs, as their name suggests, specialize in executing mathematical operations.

In TeraScale, several SPs and a branch control unit along with storage registers all come together to make up a single Stream Processing Unit, or SPU.

Further up, several SPUs along with more control units and storage registers together make up a single SIMD core. Several SIMD cores and even more control hardware ultimately come together to make a complete TeraScale GPU.

Of course, a pictorial representation will do a far better job of getting all this across:

Several Stream Processors along with registers & control logic comprise a single Stream Processing Unit (SPU), while several SPUs along with more control hardware & registers comprise a single SIMD core…

…And several SIMD cores together with additional logic & cache build a complete TeraScale chip!

Above we see 10 SIMD cores coming together to make a RV770, or Radeon HD 4870 GPU

It goes without saying that the complete GPU is more complex than what’s seen here, but this gives you a fair idea as well as a glimpse into the heart of this now defunct beast.

Let’s now see how it all comes together to process those vectors we spoke of earlier:

A VLIW Heart

TeraScale happens to be what’s called a VLIW chip.

VLIW stands for Very Long Instruction Word and is another type of Instruction Set Architecture (ISA). Recall from earlier that an ISA comprises the set of instructions that a chip can understand and therefore act on, and that ISAs can be of different types: x86–64 is a CISC type ISA while ARM is a RISC type ISA. Similarly, AMD’s TeraScale GPUs were a VLIW-type ISA.

Simply put, VLIW is another attempt at speeding up chips. While the obvious approaches involve simply building faster cores that churn through more instructions per clock cycle, another is to simply do more stuff at once. This latter approach necessitates multiple processing cores in a single system which explains the many-core CPUs of today with even cellphones now boasting of 8-core CPUs.

Having multiple cores is one thing, utilizing them effectively is quite another. When you run a program, it creates a system process (visible in the task manager) which in-turn spawns one or more “threads”. A thread is a self-sufficient bunch of instructions awaiting the CPU’s attention for execution (self-sufficient as they contain all the data and state information necessary for their execution). A thread is thus the smallest sequence of instructions that can be scheduled for execution by a scheduler.

A simple approach to utilizing more cores then would be to have them execute several independent threads in parallel. Indeed, this approach is used by CPUs and is called Thread Level Parallelism (TLP).

Instruction Level Parallelism (ILP) is an alternate take on parallel processing: with ILP several threads from a process are packed into a single, large thread giving us what’s called a Very Long Instruction Word (VLIW). This VLIW thread is then sent off to the processor, where it’s unpacked at execution time and the resulting threads executed by available processing cores.

Both ILP and TLP approaches share a common critical requirement though: operations executing in parallel must be independent of each other, be they disparate threads or the instructions from within a VLIW thread. This makes sense: if an operation relies on the output of another it’ll simply have to await those results before it can itself execute.

Consider a very simple example:

A + B = C

X + Y = Z

C * Z = R

While the first two are entirely independent of the other instructions, the third relies on the preceding two and will thus have to await their execution.

This might appear as a subtle difference but raises a very important question, that of whose job is it to identify such independent data for parallel execution? With TLP, that burden is shared by the application programmer and the hardware: while the programmer is responsible for writing thread-aware code takes advantage of multiple processing cores, the processor itself ultimately schedules threads for parallel execution at run-time, maximizing utilization. TLP thus follows a dynamic, run-time approach to scheduling wherein the processor itself acts as the scheduler.

With an ILP approach things are not as simple: the calling process must itself identify independent threads to be packaged into a single VLIW thread. This leaves the entire burden of scheduling on the software and more specifically, on the compiler. In software parlance, a compiler is a special program that converts code written in a near-English (and thus high-level) language such as Java or C/C++ into low-level machine code based on the processor’s ISA, thus acting as an intermediary translator.

While the compiler gets the advantage of a full view of the program and could therefore be expected to schedule intelligently, there are conditions the compiler remains blind to as some outcomes remain unknown until actual execution time. Exacerbating this problem is the fact that scheduling set by the compiler cannot be altered at run-time by the processor leaving us with a static, compile-time approach to scheduling in stark contrast to TLP.

So ILP is a static scheduling approach that complicates the design of the compiler and leaves compute resources inevitably idle at times, why ever use it then? Because graphics is a highly parallelizable application domain that can utilize an ILP approach arguably well. Further, when combined with TLP as done on TeraScale (surprise!) ILP can lead to some very impressive performance figures.

So how did AMD utilize ILP and further combine it with TLP on their VLIW-based TeraScale architecture? Let’s look down the compute lane:

TeraScale at Work: TLP + VLIW ILP on a SIMD Core

Recall that a GPU fetches several datapoints or pixels at once in a grouping called a “vector” along with a corresponding instruction in accordance with its SIMD nature. AMD likes to refer to these vectors as “wavefronts” and with TeraScale, 64 VLIW threads of pixel values or datapoints are grouped into a wavefront and dispatched to a SIMD core for processing. With 16 SPUs per SIMD core, the full 64-wide wavefront is executed in four cycles.

With the 16 SPUs of a SIMD core each processing a VLIW thread every clock cycle, we see thread level parallelism or TLP in action as 16 VLIW threads are processed at any given time.

Instruction level parallelism (ILP) comes in next as each VLIW thread is dissected for its constituent datapoints which are then executed individually by the stream processors within the SPU.

And with 16 VLIW thread executing against the same instruction at any given time, a SIMD (Single Instruction Multiple Data) architecture is in play throughout.

Utilization remains a big concern though, for both the SPUs and the SPs within them: not only must the compiler do its best to identify 5 independent datapoints for each VLIW thread, but so must 64 VLIW threads be packed together within each wavefront. Further, the 64 items in a wavefront should all execute against the same instruction; imagine a scenario wherein one thread executes against an entirely different instruction from the other 63! Opportunities for additional clock cycles & poor utilization thus abound and the compiler must do it’s best to schedule around them.

With 5 SPs in each SPU, attaining 100% utilization necessitates five datapoints per VLIW thread. That’s the best case; in the worst case an entire thread is comprised of just a single datapoint resulting in an abysmal 20% utilization as 4 SPs simply engage in idle chit-chat. Extremities aside, AMD noted an average utilization of 68% or 3.4 SPs per clock cycle. A diagram from AnandTech’s GCN preview article depicts this scenario, and it’s a good time to borrow it here:

Some cycles see 100% utilization of the SPs as others see just 20% utilization with only one SP engaged. On average, AMD notes 68% utilization per cycle, or 3.4 SPs.

TeraScale over Three Generations: Optimizing ILP until the End

TeraScale evolved three generations over its lifetime starting with Gen1 on the Radeon HD 2xxx series and finally culminating with the Gen3 based Radeon HD 69xx series. Three primary enhancements sum up the changes over this period: more SIMD cores, smaller process nodes and a more optimized SPU.

The Radeon HD 2900 XT served as TeraScale’s debut flagship: manufactured on TSMC’s 80nm process with just 4 SIMD cores, it puts into perspective how far we’ve come today!

The HD3000 series followed with similar specs albeit on TSMC’s newer 55nm process and like its predecessor, proved underwhelming against Nvidia’s offerings at the time. Things really did turn in AMD’s favor with the HD4000 series as the flagship HD 4870 dramatically upped SIMD core count straight up to ten in addition to adopting newer GDDR5 VRAM netting 1.5x gains in memory bandwidth.

While the HD 4000 series were good, the HD 5000 series would usher in TeraScale’s heyday: debuting the TeraScale2 architecture on a brand new 40nm process, the Radeon HD 5000 family of GPUs remain arguably AMD’s best to date and are in-fact so well regarded that AMD’s newest RDNA-based RX 5000 of cards are named in honor of this GPU family! With the HD 5870, AMD once again doubled the number of SIMD cores now on the flagship Radeon HD 5870 GPU along with the L2 cache & VRAM.

TeraScale 3 would feature only on the Radeon HD 6900 series with a significant change: reducing the number of stream processors per SPU from five to four. This was AMD responding to their observation of SP utilization averaging around 3.4 SPs per SPU every clock cycle. This reduction would aid utilization & efficiency as well as leave silicon for additional SIMD cores. Indeed, the flagship Radeon HD 6970 GPU modestly increased SIMD core count to 24.

The HD 6900 series would serve as the last of the flagship TeraScale GPUs, even as TeraScale based cards continued to release until October of 2013. As compute applications began to take center-stage for GPU acceleration, games too evolved. The next generation of graphics API’s such as DirectX 10 brought along complex shaders that made the VLIW-centric design of TeraScale ever more inefficient and impractically difficult to schedule for. The Radeon HD 7000 series would accordingly usher in the GCN architecture, TeraScale’s inevitable successor that would abandon VLIW and ILP entirely and in doing so cement AMD’s focus on GPU compute going forward.

2. GCN — Graphics Core Next

With a mission to end Nvidia’s dominance in the GPU compute space, GCN set out with big goals. To do so would require AMD to free their GPUs of VLIW’s shackles & its accompanying instruction-centric approach to parallelism, building a new GPU architecture from the ground-up. New architectures are never easy, and with this transition necessitating major changes & complete overhauls, it would be no mean feat.

Why though? Why go through all this fuss? Simply put, because AMD really had no choice in the matter:

You see, the enterprise & high-performance compute (HPC) space are some of the highest margin customers hardware companies appeal to and the compute potential of GPUs happens to be of great value to these folk, a target audience who’s potential Nvidia had recognized first and addressed with Fermi, their first-ever compute-centric architecture. Nvidia complimented Fermi’s development with heavy investments in the surrounding software ecosystem resulting in the creation of the CUDA ecosystem, which continues to dominate the GPU compute space even today.

If you’re wondering what’s the point here, it’s all in the numbers: Fermi released in October of 2010 and by the third quarter of 2011, Nvidia was already enjoying the spoils: Q3’11 saw Nvidia declare a net income of $146M from a total $644M in gaming revenue while during the same period, profits from the professional market amounted to $95M from a total revenue of just $230M: that’s equivalent to 65% of the gaming profit from just 35% of the sales revenue; talk about healthy, healthy margins!

And so with both finances and the evolving software & gaming ecosystem demanding it and nearly two years late, GCN was to be AMD’s Fermi moment boldly announcing their arrival on the GPU compute playground. What changed & how did AMD’s take on a thread parallel GPU shape up? Let’s dive right in:

Recall that with TeraScale, the stream processor (SP) forms the foundational compute execution block. “Stream Processor” is another term for an ALU and TeraScale houses five SPs (four with TeraScale 3) in a single Stream Processing Unit (SPU) with 16 SPUs coming together to make a single SIMD core, several of which build a complete chip.

In the case of GCN this layering is shifted up a stage: individual stream processors still form the foundational blocks, but now 16 of them come together directly to build a single SIMD core. Further, four SIMD cores together build a single Compute Unit, or CU with several CUs finally coming together to build a single GCN chip.

Diagrams once again:

A SIMD core in GCN comprises of sixteen Stream Processors (SPs) rather than sixteen Stream Processing Units (SPUs) which themselves each comprise of five or four SPs

And four SIMD cores now come together in a single Compute Unit (CU), which a GCN GPU contains several of. Also illustrated above are the Scalar ALU, the branch & fetch/decode logic and the registers/cache that form part of the CU.

Let’s look at work distribution in GCN:

With VLIW and ILP out the window, GCN is a pure SIMD architecture: wavefronts are no longer comprised of VLIW threads but rather of 64 individual datapoints which are executed by the 16 SPs within SIMD cores. Wavefronts remain 64-wide, necessitating the same four cycles to churn through. Further, each compute unit contains four SIMD cores & each of these may work on separate wavefronts so at any time, a CU may be processing up-to four different wavefronts.

All this brings along a massive benefit: that of the software no longer having to identify and schedule independent data into VLIW threads, greatly simplifying the design of compilers. Independent threads are instead dynamically scheduled by the hardware at runtime resulting in a much simpler approach to scheduling. All this lends itself very favorably to compute applications as well as to modern games.

New to GCN & specifically for the scheduling of compute workloads across CUs are the Asynchronous Compute Engines, or ACE which preside over resource allocation, context switching & task priorities. As GCN is built to concurrently work on multiple tasks, ACEs independently schedule wavefronts across CUs. A GCN GPU may carry multiple ACEs.

The Graphics Command Processor (GCP) serves as the graphics counterpart to the ACE & makes a return from TeraScale. The GCP works to schedule activities across the graphics subsystem, primarily involving scheduling across the “primitive” pipelines: complex surfaces & objects in games are built up of simpler geometrical shapes, called primitives, bunched together in large numbers. Triangles are the dominant choice here as their position in 3D-space can be entirely determined by just three points. This process of building up complex objects, shapes & surfaces from simpler data is referred to as Tessellation. The primitive pipelines are responsible for this tessellation in addition to other geometry & high-order surface processing & the GCP is responsible for scheduling work over these pipelines.

Do notice the Scalar ALU within the CU: this is a special ALU dedicated to any “one-off” mathematical and transcendental (i.e. logarithmic, sin/cosine etc.) operations. The very meaning of a SIMD core implies vector processing and that involves applying an instruction to a group of values (a vector) rather than to a single value (a scalar). A scalar or a one-off operation disrupts this flow, and a separate Scalar ALU alleviates this by keeping these operations out of the SIMD cores.

So where was this with TeraScale? Unfortunately, within the SPUs: in the case of the TeraScale Gen1 & 2 the 5th SP in each SPU served as the Special Function Unit, or SFU. Gen3 bunched 3 of the 4 SPs within a SPU together for this. This resulted in a severe latency for scalar operations as they had to be scheduled within a wavefront: the HD 6900 series had a nasty 44-cycle latency for scalar operations. With the separation of the Scalar ALU from the SIMD core entirely, GCN brings this down to one cycle.

GCN Through the Years & Today

The first GPUs featuring GCN debuted on the 9th January 2012 with the Radeon HD 7950 & HD 7970 GPUs. On the 7th of January 2019, AMD announced the Radeon VII: the last GCN GPU. That’s a good seven years, practically eons in the compute world. GCN wasn’t a stagnant architecture over this duration, instead evolving with a mix of typical incremental upgrades in addition to other, more significant enhancements. Let’s peek at GCN’s journey:

First & Second Generation GCN:

While Gen1 GCN entirely abandoned VLIW in favor of a pure SIMD architecture, Gen2 brought along incremental upgrades with more functional units, higher memory bandwidth & better power management. New compute centric instructions in GCN’s ISA accompanied these, along with support for a greater number of ACEs with the R9 290X flagship sporting 8 where Gen1 had a hard limit of 2.

Bridgeless Crossfire was introduced here as well: Crossfire enables the use of multiple Radeon GPUs in a system & previously necessitated a hardware bridge to connect these GPUs. Bandwidth limitations over the bridge would require the CPU to mediate exchange over the PCIe bus, invoking a frametime penalty. Dedicated hardware in the form of the XDMA Crossfire engine would now control this & the much higher bandwidth of the PCIe bus meant a dedicated bridge was no longer necessary.

In terms of raw numbers, the Gen2 flagship R9 290X came baring 44 CUs, up from 32 CUs on the Gen1 HD 7970 & R9 280X, along with a wider memory bus (512-bit vs 384-bit) & an additional gig of VRAM.

Third Generation GCN:

Debuting in the September of 2014, Gen3 GCN bought along two major features to the compute side: GPU pre-emption & support for FP16 arithmetic. Pre-emption is the act of interrupting the execution of a task for another higher-priority task, without its consent with the intention of resuming it later. This is a big deal as GPUs have always been poor at context switching.

WRT FP16 ops, GPUs deal almost explicitly with floating point numbers (non-integer decimals/fractions) which typically occupy 32-bits in computer memory and are thus referred to as FP32 numbers, or single-precision numbers. Not every application requires as much precision with many compute applications adequately addressed by half-precision numbers which occupy half the space in memory with just 16-bits. These are referred to as FP16 numbers & lead to significant memory savings.

On the graphics side AMD introduced lossless delta color compression for the transfer of large textures. The use of compression for textures is not new, though with this AMD claimed a massive 40% gain in memory bandwidth efficiency.

Lastly, FreeSync & HBM were introduced here as well. FreeSync is AMD’s implementation of adaptive refresh-rate technology, allowing the monitor to change its refresh rate on-the-fly to match the frame output rate of the GPU, eliminating stutter when the framerate falls below the refresh rate and screen tearing conversely. HBM or High Bandwidth Memory is a memory standard that places the VRAM and GPU core on the same large slice of silicon, dubbed an ‘interposer’, as opposed to soldering memory chips separately onto the PCB. HBM allows for much higher bandwidth, lower latency as well as reduced power consumption. The trade off? Much higher cost.

The flagship Fury X increased CU count to 64 while HBM enabled a colossal 8-fold increase in memory bus width to 4096-bits from 512-bits on the R9 290X, resulting in a 60% increase in memory bandwidth to 512GB/s from 320GB/s. The Fury cards were the first GPUs to use HBM, which makes a comeback on GCN Gen5 which uses HBM2 exclusively.

Fourth Generation GCN: Polaris

Gen4 GCN debuted on the Polaris RX 400 series of GPUs in the June of 2016. With the flagship RX 480 squarely a mid-tier card with just 36 CUs, Polaris represented a big shift in AMD’s GPU strategy as they set out to the mainstream market first, differing the launch of high-end GPUs to a hitherto-undisclosed date. This speaks volumes of the extent of GCN’s success, or lack thereof, in the high-end space but we’ll differ that discussion for a while, instead maintaining focus on new features Gen4 GCN brought along.

Polaris aimed to make big improvements in the domain of power consumption, an area where GCN had fared poorly so far with several flagships running hot & loud while happily chugging on the power lines.

In addition, Polaris brings along support for instruction pre-fetching, a predictive process wherein processors guess the instructions they’ll be executing next based on the current execution state & then fetch those instructions. Correct pre-fetching leads to significant performance gains as the processor need not wait for data to be read in from memory, which is magnitudes slower. Incorrectly pre-fetched instructions are discarded, degrading efficiency though pre-fetching techniques are constantly refined to minimize this. Either ways, GCN Gen4 GPUs could now pre-fetch instructions, something prior GCN GPUs simply could not do. A larger instruction buffer obviously accompanies this.

On the graphics side, AMD added the Primitive Discard Accelerator. As surfaces & complex shapes are ‘tessellated’, i.e. made up of many smaller & simpler polygons (typically triangles), the primitive discard accelerator culls visually insignificant triangles (hidden or too small) for increased performance.

The RX 500 series launched a year later, bringing improvements to clockspeeds & reductions in power consumption. While the RX400 series were AMD’s first cards on the 14nm process, the RX 590 launched on a more refined 12nm process mildly improving clockspeeds & power efficiency.

GCN Gen4 remains the only GPU family from AMD lacking a high-end flagship member.

Fifth Generation GCN: Vega

While an entire article can be written on Vega’s release shenanigans, we’ll refrain from going down that path here. Baring a little context, we’ll retain focus on the notable new features Vega brought along.

Vega launched on Monday the 14th of August 2017 as the Vega 56 and Vega 64 GPUs, with those numbers denoting the number of compute units in those respective GPUs. The high CU count marked AMD’s return to the high-end space after a two-year absence. AMD’s recently launched Ryzen CPUs had delivered on value & performance far beyond expectations, cementing hopes for Vega to do the same in the GPU space. This would prove to be flawed chain-reasoning causing expectations & excitement to spiral out of control prior to launch: at one-point enthusiasts were offering to privately fund the overseas travel of knowledgeable YouTubers such as Buildzoid to have them analyze AMD’s Vega events! Though I really blame AMD’s mix of dramatic & drip-feed marketing for this, I’ll refrain from talking further about it here.

The GPU landscape wasn’t favorable for AMD either: they barely held 25% of desktop GPU market share at this point so it would be very hard to get game developers to support any new Vega-centric gaming features as they’d benefit very few. Further exacerbating this situation was Nvidia’s dominance of the entire GPU landscape with their Pascal architecture, one of their best ever architectures now legendry for refinement, efficiency & raw performance.

Regardless, AMD’s engineers did consider Vega to be their largest architectural overhaul in five years even actively distancing themselves from the GCN tag & referring to this as the ‘Vega’ architecture instead. This is still very much GCN though, so naming conventions aside let’s peek into the changes:

This biggest change comes to GCN’s FP16 compute capabilities: while Gen3 introduced FP16, the operations themselves didn’t execute any faster as each individual stream processors could still handle only one operation at a time, be it FP16 or FP32. Vega changes that significantly: each SP can now handle two FP16 operations in place of a single FP32 op, a feature AMD dubbed ‘Rapid Packed Math’.

The next big change comes to the memory subsystem with the adoption of HBM2 and the introduction of the High Bandwidth Cache Controller. More relevant as a compute feature, HBCC extends the GPU VRAM’s reach to the system RAM & secondary storage device for datasets too large for the VRAM alone.

Next are improvements to the graphics engines with the introduction of primitive shaders which allow for high-speed discarding of primitives, i.e. visually insignificant polygons (hidden or very small triangles) along with the Draw Stream Binning Rasterizer to further aid in this regard. This continued emphasis on culling helps prevent the clogging of render pipelines & frees the GPU of useless work, reducing memory access and saving memory bandwidth while reducing power consumption.

Lastly, we have better load balancing across the geometry/shader engines resulting in a 2x speedup in geometry throughput. While geometry performance hadn’t been poor on GCN, this helps squeeze out any potential bottlenecks.

Built on the 14nm process, the Vega 56 & 64 GPUs packed 8GB of HBM2 each. While HBM2 was to double the bandwidth over HBM1, it missed that target by a bit affording Vega 484GB/s in bandwidth over its 2048-bit wide bus, where the Fury cards enjoyed 512GB/s over their 4096-bit wide bus.

The Radeon VII would rectify that: released on the 7th of February 2019, the “world’s first 7nm gaming GPU” and the last GCN GPU carries 60CUs and a whooping 16GB of HBM2 memory over a full 4096-bit wide bus unleashing a mindboggling 1TB/s in memory bandwidth! Slightly anti-climatically though, it cost and performed nearly the same as the mighty GTX 1080 Ti, the then two-year older Pascal flagship which also carried a 45W lower TDP.

Overall, the Vega cards lacked appeal against Nvidia’s highly refined & efficient Pascal-based GTX 10x series which not only offered similar performance at similar price points but had also been on the shelfs for a while already. The use of expensive HBM2 memory also hindered AMD’s ability to price the cards more aggressively, and these factors together sadly crippled Vega’s success as a gaming architecture.

GCN: Conclusion

While AMD did lose significant GPU market share with their GCN architecture at the helm, GCN did have its moments: the HD 7970 and R9 290X are remembered as formidable flagships that held the performance crown in their day, GCN did see AMD make inroads into the compute space, Vega gave us some great Ryzen APUs & cards like the RX470 and Vega 56 have done well for AMD. GCN is also in all the consoles, from the PlayStation 4 to the Xbox and even on the Nintendo Wii U.

Notably, GCN did well in the low-power envelopes of mid-range and embedded graphics products such as the consoles and APUs while it did suffer at the high-end: the R9 290X ran hot and loud, the R9 Fury cards still cause their past owners to break into random bouts of sweating and worse of all, the Vega cards consumed the power equivalent of Nvidia GPUs from a tier above. Indeed, it appeared that the flagship cards worsened in respects to power draw and efficiency as GCN matured and clearly, something needed to change.

RDNA seems set to rectify all that GCN did wrong, especially in the domain of power consumption and efficiency. While it’s barely out the door and therefore too early to judge, it’s enjoyed some large wins in the enterprise space spanning the cloud, supercomputer and low-power envelope spaces. Indeed, RDNA seems set to answer every enthusiast’s call for a new GPU architecture from AMD, but does it ultimately fare well and what does it change? Let’s dig in.

3. RDNA: Radeon Re-defined

RDNA arrives to make significant changes to the core of GCN, or as AMD prefers to put it more organically: to the very DNA of the Radeon architecture. While not as major a redesign as the move away from VLIW, RDNA does bring along a complete reworking of AMD’s approach to thread level parallelism on its GPUs. Accordingly, we get three major changes manifesting as updates to the compute units (CUs), the addition of a new caching layer and accompanying improvements to power efficiency.

Beginning at the heart of the matter, the fundamental SIMD cores see a major change once again, now beefing up to twice the size of GCN’s SIMD cores, doubling the number of steam processors (SPs) & the register space:

TeraScale ushered in the 16-wide SIMD design, which GCN followed despite the move away from VLIW by simply replacing TeraScale’s 16 SPUs (each of which housed 4 or 5 SPs) with the fundamental stream processors themselves. RDNA continues shaking things up for the SIMD core.

This in turn alters the compute unit itself:

As we see, significant changes to the compute unit abound with RDNA. Most notably, the four SIMD cores of GCN are now merged into two, each twice as wide doubling both the number of SPs as well as the register space. Each SIMD core now also gets its own scheduler, as opposed to GCN’s approach of deploying a single scheduler across the entire CU.

Further on we see that while GCN equips it’s CUs with a single scalar unit, RDNA’s carries two Scalar ALUs in a CU, one per SIMD core. Recall that the SALU is dedicated to one-off operations which would otherwise bog down the SIMD cores, wasting their parallel compute capabilities.

Further changes alter resource sharing across CUs: RDNA pairs two CUs into a single Work Group Processor, or WGP with the scalar data & instruction cache shared across the WGP along with the Local Data Share (LDS). GCN on the other hand shared its scalar cache across four adjacent CUs while maintaining a dedicated LDS per CU.

While such merging of SIMD cores and changes to sharing across CUs may appear as trivial changes, they’re anything but and have far reaching implications on how work is distributed and executed by the GPU: recall that work is distributed to the CUs in a fundamental grouping called a wavefront, with 64 pixels or datapoints in each wavefront. Ideally, these datapoints hold no inter-dependencies and await execution against the same instruction making them ideal for parallel execution via the SIMD cores which had been 16-wide with both GCN & TeraScale, thus processing a wavefront every 4 cycles.

RDNA changes this four-cycle execution model entirely: with the SIMD cores doubled up to 32 SPs, the wavefronts too are slashed in half to 32 elements per wavefront. The implication is obvious: a wavefront is executed every cycle resulting in a 4x speedup in wavefront execution times!

The reduction in wavefront size helps tremendously: identifying 64 independent datapoints for each wavefront can prove challenging even for highly parallelizable applications like graphics and GPU compute tasks. This challenge often necessitates additional cycles where some datapoints execute against different instructions from the others, also resulting in poor utilization. With just half the datapoints required per wavefront, this situation is greatly eased.

So RDNA’s Compute Units can execute wavefronts four times faster than their predecessor, this necessitates keeping the CUs well fed with data: a new caching layer is thus introduced. Notice the CU block diagram above wherein GCNs’ local L1 cache is changed in nomenclature to the ‘L0’ cache. But that’s all that it is: a change in nomenclature. The real change comes at the next step: RDNA now adds a new L1 layer that’s shared across 10 Compute Units (5 GCPs) which in-turn interacts with the L2 cache by the memory controllers. On the other hand, the L1 cache in the CUs of GCN interacts directly with this L2 cache, with no intermediate caching layer.

Caching layers have a significant impact on performance, lowering execution times & upping efficiency further: when data isn’t found in the caching layers there’s no choice but to look for it in the VRAM, which isn’t just many magnitudes slower (on the scale of thousands of times slower as compared to the caches close to the cores), it’s also significantly more energy intensive.

All put together, that’s a 4x speedup in wavefront execution times coupled with smaller wavefronts and additional caching, improving both performance & efficiency. The results reflect this: RDNA’s mid-range 5700XT GPU performs in the same ballpark as the Radeon VII GPU, which represents the very best of GCN. Built on the same 7nm process, the R7 carries 3840 SPs and a 295W TDP. For RDNA to match that with 33% fewer SPs and a 25% lower TDP is very good progress indeed: the 5700XT carries 2560 SPs and a 225W TDP with the most impressive reduction being the $400 USD price tag against the Radeon VII’s $700 USD MSRP!

One thing is clear: AMD has made significant strides with RDNA, and that’s great news for the GPU market. Hopefully, RDNA signals AMD’s long overdue return to competitiveness in the GPU space.

RDNA: Additional New Features

Two significant enhancements accompany the design reworks described above, chief among them being the shift to GDDR6 memory. First introduced on Nvidia’s RTX 2000 GPUs in 2018, GDDR6 brings along a 1.75x speedup over GDDR5, affording the 5700XT with 448GB/s in memory bandwidth over the same 256-bit wide bus as the RX 580, which merely enjoys 256GB/s.

Next up is the shift to the PCIe Gen4 standard with 32GB/s in bandwidth over 16 PCIe lanes, doubling bandwidth over Gen3’s 16GB/s. Though GPUs today struggle to saturate Gen3’s bandwidth, the move to Gen4 aids with the exploding popularity of NVMe SSDs, which utilize PCIe’s much higher bandwidth to attain read/write speeds unattainable by typical drives over SATA. Since a GPU can now be adequately fed over 8 Gen4 lanes, more lanes remain available for these SSDs.

RDNA: Looking Forward

For now, the best we have with RDNA is the 5700XT and with 40 CUs, 4 Asynchronous Compute Engines and a 256-bit wide memory bus, it’s clearly a mid-range part. “Big Navi” remains due at an undisclosed date, with rumors of AMD referring to this part as the “Nvidia killer” internally. It would be nice for this rumor to be true: Team Green’s domination over the past few years has been absolute, resulting in their complete stagnation on price/performance improvements, so a shakedown is definitely overdue.

One thing is for certain: efficiency will play a major role in dictating RDNA’s success much more than raw performance alone. While the 5700XT seems to be doing okay, if Big Navi pops out sporting maxed out voltages and bursting along the clock speed limits like Vega did, then all hope is truly lost for a formidable AMD challenger.

Things with RDNA do look very promising for now though many definitive products remain due: beyond just high-end enthusiast class graphics cards, it’s the APUs, console SoCs and compute GPUs that RDNA spawns that’ll ultimately dictate its success. For now, I dare say it would be safe to remain cautiously optimistic!

4. Conclusion & Looking Forward

Over the course of this article, we’ve looked at AMD’s GPUs and observed the constant & significant transformation they’ve undergone over the past 13 years, from aggressively scaling up and optimizing VLIW to abandoning it entirely for a more compute-centric architectural layout. While TeraScale made significant impact on the gaming market, GCN has made in-roads into the GPU computing and heterogenous system landscape over the past decade. Today, RDNA arrives to take over with significant changes aggressively rectifying GCN’s shortcomings in a focused effort to dominate the GPU landscape. Indeed, RDNA’s many early wins firmly lay out this path.

With just three new cards out for now, it’s clear that RDNA’s story has just begun. While AMD seems to have big plans for RDNA and are already enjoying several large wins across the spectrum, a few more generations will be necessary to effectively gauge RDNA’s impact. This sure seems like the architecture that restores AMD’s technology lead while cementing their mindshare, though. Indeed, only time (and a few more releases) will tell the complete tale.

Please find Part 1, An Overview of AMD’s GPU Architectures, here:

An Overview of AMD’s GPU Architectures

Prefacing our deep-dive into TeraScale, GCN & RDNA…

medium.com

An Architectural Deep-Dive into AMD’s TeraScale, GCN & RDNA GPU Architectures

An Overview of AMD’s GPU Architectures

Prefacing our deep-dive into TeraScale, GCN & RDNA…

1. TeraScale

A VLIW Heart

TeraScale at Work: TLP + VLIW ILP on a SIMD Core

TeraScale over Three Generations: Optimizing ILP until the End

2. GCN — Graphics Core Next

GCN Through the Years & Today

First & Second Generation GCN:

Third Generation GCN:

Fourth Generation GCN: Polaris

Fifth Generation GCN: Vega

GCN: Conclusion

3. RDNA: Radeon Re-defined

RDNA: Additional New Features

RDNA: Looking Forward

4. Conclusion & Looking Forward

An Overview of AMD’s GPU Architectures

Prefacing our deep-dive into TeraScale, GCN & RDNA…

Written by Abheek Gulati