Stories by The Arch Bytes: From Core to Code on Medium

A Phone Chip Inside a Laptop? The Curious Case of MacBook Neo and the A18 Pro

The Arch Bytes: From Core to Code — Sat, 07 Mar 2026 05:10:11 GMT

When Apple launched the new MacBook Neo, one detail caught the attention of many hardware enthusiasts: the laptop is powered by the Apple A18 Pro — a chip originally designed for the iPhone 16 Pro.

At first glance this might not seem unusual. Apple has been designing its own silicon for years. But historically, Apple has kept its chip families separate:

A-series chips power iPhones and iPads
M-series chips power Macs

The MacBook Neo changes that pattern. For the first time, Apple has shipped a Mac laptop using an A-series processor. This decision raises an interesting question:

If the MacBook Neo uses the same chip as the iPhone, how similar are they really?

The A18 Pro in Two Very Different Devices

Both the MacBook Neo and the iPhone 16 Pro use variants of the A18 Pro SoC. At a high level, the architecture is the same.

Typical configuration includes:

6-core CPU
2 performance cores
4 efficiency cores
Apple GPU
16-core Neural Engine
Built on TSMC’s 3-nm process

From a microarchitecture perspective, the CPU complex and neural engine are largely identical. However, one specification stands out when comparing the two devices. A18 Pro in iPhone has 6 cores, while Macbook Neo chip has 5 cores.

At first glance, this seems counterintuitive. One would expect a laptop to have equal or greater GPU resources than a smartphone.

So why does the MacBook Neo ship with fewer GPU cores?

Understanding GPU Core Differences

There are several reasons this configuration makes sense from a chip design and product strategy perspective.

1. Silicon Binning

Modern chips are manufactured in massive batches, and not every die comes out perfect. Some chips may have a small defect in one GPU core or may not meet the frequency target for that unit.

Instead of discarding the entire chip, manufacturers disable the faulty core and sell the chip with a reduced configuration.

This process is known as silicon binning.

A simplified example:

Fully functional die  → 6 GPU cores → iPhone 16 Pro
Minor defect die      → 5 GPU cores → MacBook Neo

This allows Apple to improve manufacturing yield while still using nearly every produced chip.

2. Product Segmentation

Another factor is product positioning.

Apple’s Mac lineup already includes laptops powered by M-series chips, such as the Apple M3 and Apple M4. These chips offer significantly larger GPUs and higher memory bandwidth.

If the MacBook Neo shipped with the full GPU configuration of the A18 Pro, it could start to overlap with higher-end Macs in certain workloads.

Reducing the GPU core count helps keep the product stack clean:

MacBook Neo  → entry-level Mac
MacBook Air  → mainstream laptop
MacBook Pro  → high performance

3. Target Workloads

The MacBook Neo is designed as an entry-level laptop. Typical workloads include:

web browsing
document editing
light programming
media playback

These tasks rarely saturate the GPU. In many cases, CPU performance and battery life matter far more.

Disabling one GPU core has minimal impact on these workloads but can improve chip availability and cost efficiency.

The Bigger Story: Mobile Chips Are Now Laptop-Class

While the GPU difference is interesting, the bigger takeaway is something else entirely.

A chip originally designed for a smartphone thermal envelope is now powerful enough to run macOS on a full laptop.

This highlights how far mobile SoCs have evolved.

A decade ago:

laptop CPUs consumed 15–45 W
smartphone chips consumed 3–5 W

Today, modern mobile silicon is efficient enough that the same architecture can scale across multiple device classes.

Apple’s silicon strategy now looks something like this:

A-series   → phones
A-series   → entry-level laptops
M-series   → mainstream Macs
M Pro/Max  → high-performance systems

The MacBook Neo is an example of how these boundaries are starting to blur.

Final Thoughts

At first glance, the MacBook Neo having fewer GPU cores than the iPhone seems odd. But when we consider manufacturing yield, product segmentation, and real-world workloads, the decision makes sense.

In many ways, the real story isn’t about GPU cores at all.

It’s about the fact that a phone-class processor is now capable of powering a full laptop.

And that says a lot about the trajectory of modern computer architecture.

How I Simulated a GPU Scheduler: A Deep Dive into microGPU

The Arch Bytes: From Core to Code — Sat, 21 Feb 2026 01:10:53 GMT

In my last post, we went deep into the architectural blueprint of GPU Compute Units, discussing how they manage massive parallelism in theory. But as any architect knows, there is a massive gulf between understanding a block diagram and seeing those cycles actually move. To bridge that gap, I decided to build it. I’ve spent the last few weeks developing microGPU, a functional C++ simulator designed to demystify the hardware-software contract. It’s one thing to read about ‘Warps’ and ‘Round-Robin Scheduling’; it’s another thing entirely to watch a scheduler dispatch threads across virtual silicon in real-time. In this post, I’m breaking down how I modelled the execution pipeline and sharing the repository so you can compile, trace, and even break your own GPU model.

Classes defined in codebase:

Thread

To simulate a GPU, we must first define the smallest unit of execution: the Thread. In microGPU, Thread class acts as a container for the architectural state of a single lane of execution.

Instead of a complex, bloated object, I kept the thread model lean to ensure the simulation stays performant:

State Management: Each thread exists in a ThreadState—either ACTIVE or INACTIVE. This is crucial for simulating "predication" or "branch divergence," where some threads in a warp might be disabled during execution.
The Register File: Each thread is allocated a private RegisterFile. In this implementation, I’ve defined THREAD_REGISTER_COUNT as 64 registers per thread using a std::array. This provides a fixed-size, fast-access memory space for computational operands.
Identification: Every thread carries a unique id, allowing the Global Scheduler and Warp units to track work distribution across the entire virtual chip.

Warp

If the thread is the atomic unit, the Warp is the management unit. In microGPU, a Warp groups 32 threads together to execute in lockstep — a fundamental concept in GPU architecture known as SIMT.

The Warp class is responsible for maintaining the shared state that these threads rely on:

Shared Program Counter (PC): Unlike a CPU where every thread has its own PC, all threads in this Warp share a single PC. They move through the code together, one instruction at a time.
The Active Mask: I implemented the ActiveMask using a std::bitset<32>. This is critical for handling branch divergence. If an if/else statement causes half the threads to take one path, the mask simply "turns off" the inactive threads during that cycle.
Reconvergence Stack: To handle complex control flows, I included a reconvergenceStack. This allows the warp to remember where divergent paths should meet back up, ensuring the threads stay synchronized after a conditional block finishes.
Pipeline Tracking: Each warp tracks its own PipelineStage (from STAGE_0 to DONE) and WarpState (READY, RUNNING, or STALLED). This allows the Compute Unit to easily identify which warps are waiting for data and which are ready to execute.

Compute Unit

The ComputeUnit class manages the complexity of cycle-accurate simulation through several key mechanisms:

Warp Collection: Each CU maintains its own internal pool of warps. This mimics real hardware where a specific number of warps are "resident" on a shader core or streaming multiprocessor.
The Round-Robin Scheduler: To keep the execution fair and prevent any single warp from hogging resources, I implemented a calculateNextWarpId() method. It follows a simple yet effective Round-Robin strategy, rotating through ready warps every cycle.
The Pipeline State Machine: A major feature of this class is the 5-stage pipeline simulation. Each warp progresses through:
STAGE_0 to STAGE_3 (Execution & Latency)
DONE (Retirement)
Cycle-Accurate Tracking: Using the incrementCycle() method, the CU tracks the precise passage of time. This allows us to measure the performance and throughput of the simulated kernels.

GPU

The MicroGPU class is the entry point of the entire simulation. It represents the physical GPU chip, managing a collection of 16 Compute Units (defined by CU_COUNT) and a global pool of work. While the CUs handle the heavy lifting of execution, the MicroGPU acts as the hardware's controller and dispatcher.

Key responsibilities defined in this top-level class include:

Global Warp Collection: Before execution begins, all work is stored in the globalWarpCollection. This acts as the "Global Scheduler's" queue, holding all warps that are waiting to be dispatched to an available Compute Unit.
The Global Scheduler: I implemented a performWarpScheduling() method that handles the distribution of work. In the current iteration, I've also included a performWarpSchedulingSimple() method—a testing-focused scheduler that assigns alternating warps to specific CUs to verify that the hand-off between global logic and local execution is seamless.
The System Heartbeat: The executeGPU() method is the main loop of the simulation. It drives the currentCycle forward, calling executeComputeUnits() on every tick until the allWarpsCompleted() check returns true.
Verification & Testing: To ensure the hardware model actually works, the class includes createGlobalWarpCollectionTest(). This populates the GPU with test warps containing simple instructions (like ADD), allowing for a full "dry run" of the pipeline from dispatch to retirement.

Explore the Code

If you want to dive deeper into the implementation or contribute to the project:

Source Code: Check out the microGPU Repository on GitHub
Technical Docs: Full API Reference & Documentation

I’ll be adding more features like memory hierarchy simulation in the future — feel free to star the repo to follow the progress!

An Architectural Look at GPU Compute Units

The Arch Bytes: From Core to Code — Thu, 12 Feb 2026 05:34:59 GMT

A common belief about GPUs is that they are fast because they contain thousands of cores.

While that is partially true, it misses the real story.

GPUs are fast because they are designed to stay busy, even when individual operations take hundreds of cycles to complete. At the center of this design sits one of the most important building blocks of modern GPUs:

Compute Unit

Think of the compute unit as a miniature processor — an independent execution engine capable of scheduling, managing, and executing groups of threads with remarkable efficiency.

Understanding this component is key to understanding why GPUs behave so differently from CPUs.

Before diving into compute units, let’s establish an important contrast.

A CPU is optimized to minimize latency. When a program requests data from memory, the CPU deploys large caches, sophisticated branch predictors, and out-of-order execution to reduce waiting time.

A GPU takes a very different approach.

Instead of trying to make memory faster, GPUs assume memory will be slow — often 400–800 cycles for global memory accesses.

GPUs simply switch to another set of threads that are ready to run.

And the hardware responsible for orchestrating this constant motion is the compute unit.

A compute unit is an independent hardware block inside the GPU that fetches instructions, schedules work, and executes groups of threads known as warps (or wavefronts in AMD terminology).

Each compute unit contains everything needed to keep execution flowing:

Warp schedulers

The scheduler continuously searches for a warp that is ready to execute.

Every cycle, it asks:

“Which warp can make progress right now?”

If one warp stalls on memory, the scheduler immediately pivots to another.

This ability to rapidly switch work is what allows GPUs to tolerate massive latency without slowing down.

Execution pipelines

Once a warp is selected, its instruction flows into execution pipelines — arithmetic units, load/store units, and specialized math hardware.

But here is something beginners often misunderstand:

More pipelines do not automatically mean higher performance.

Performance depends on whether the scheduler can keep those pipelines fed with ready work.

If all warps are stalled, even the widest machine goes idle.

A large register file

One reason compute units can switch between warps so quickly is that each warp’s state lives in a massive on-chip register file.

Unlike CPUs, there is no expensive context switch.

No saving to memory.
No restoring state.

The hardware simply selects a different register bank and continues execution.

This is one of the quiet design choices that enables GPU efficiency.

On-chip shared memory
Control logic

You can think of it as a highly specialized throughput machine whose primary goal is simple: Always have something to execute. DO NOT remain idle.

Everything sounds good so far, but what happens inside Compute Unit every cycle.

Let’s zoom into a single cycle inside a compute unit.

A simplified flow looks like this:

Cycle N:

The scheduler selects a READY warp
An instruction is issued
Some warps may stall (for example, waiting on memory)
The scheduler searches for another runnable warp

And the loop repeats.

Compute Unit class definiton

I have created a simple Compute Unit class in C++. I have added comments to explain the purpose of class methods and variables.

#ifndef COMPUTEUNIT_HH
#define COMPUTEUNIT_HH

#include
#include
#include
#include
#include
#include

#include "../warp/warp.hh"

enum SMState {
    IDLE,
    BUSY,
    ERROR
};


class ComputeUnit {
    std::vector warps;

    // Each compute unit has its own ID
    int smId;

    // Current warp ID being executed
    size_t currentWarpId;

    // Current cycle
    int currentCycle;

    // State of the compute unit
    SMState state;

    public:
        ComputeUnit() : currentWarpId(0), currentCycle(0), state(SMState::IDLE) {}
       
        // Setter methods
        void setState(SMState newState);
        void setCurrentWarpId(int warpId);
        void setWarp(const Warp &warp);
        void setSmId(int id) { smId = id; }

        // Increment cycle count for the compute unit
        void incrementCycle() { currentCycle++; }

        // Execute the current warp and advance its pipeline stage
        void execute(); 

        // Getter methods
        int getCurrentWarpId();
        SMState getState() const;
        int getWarpCollectionSize() const;
        int getCurrentCycle() const { return currentCycle; }
        int getSmId() const { return smId; }

        // Print method for debugging
        void printId() const { std::cout << "(ComputeUnit) ComputeUnit ID: " << smId << std::endl; }

        // Method to calculate the next warp ID to execute based on round-robin scheduling
        void calculateNextWarpId();
        
};

#endif // COMPUTEUNIT_HH

From Threads to Warps: How GPUs Actually Execute Code

The Arch Bytes: From Core to Code — Sun, 25 Jan 2026 06:29:58 GMT

In the previous post, we looked at threads as the fundamental programming abstraction in GPU programming. Threads are how we think about parallelism: each thread has its own registers, its own thread ID, and its own piece of work.

However, threads are not the unit of execution in GPU hardware.

To understand performance, control flow, and memory behavior on a GPU, we need to introduce the concept that sits one level below threads:

The warp — the hardware execution unit of the GPU.

What Is a Warp?

A warp is a fixed-size group of threads that are executed together in lockstep by the GPU.

On NVIDIA GPUs, a warp consists of 32 threads

All threads in a warp:
Share a single program counter
Execute the same instruction at the same time
Operate on different data

This execution model is known as SIMT (Single Instruction, Multiple Threads).

Why Do GPUs Use Warps?

GPUs are designed for throughput, not single-thread latency. A modern GPU may need to manage tens of thousands of active threads. Tracking a separate instruction stream for each thread would be prohibitively expensive in hardware.

Instead, the GPU:

Groups threads into warps
Shares control logic across the group
Executes them together

This design dramatically reduces hardware complexity while still exposing massive parallelism to the programmer.

Warp Execution: Lockstep in Practice

Consider the following code:

int tid = threadIdx.x;
A[tid] = B[tid] + C[tid];

From the programmer’s perspective:

Each thread computes its own tid
Each thread updates a different array element

From the hardware’s perspective:

One instruction is issued
32 threads execute it simultaneously
Each thread uses its own registers and memory addresses

Same instruction. Same cycle. Different data.

Warp Divergence: When Threads Disagree

The lockstep nature of warps becomes visible when control flow differs between threads.

if (tid % 2 == 0)
  A[tid] = 1;
else
  A[tid] = 2;

Within a single warp:

Some threads take the if pathB
Others take the else path

The GPU handles this by:

Executing the if path while masking inactive threads
Executing the else path while masking the other threads
Reconverging the warp

Both paths execute serially.

This phenomenon is called warp divergence.

Bottom line: Divergence does not break correctness, but it does reduce performance.

Warps and Scheduling

Each SM maintains a pool of resident warps. In every cycle, a warp scheduler selects a ready warp and issues its next instruction.

Key properties:

Warp context switching is essentially free
When one warp stalls (e.g., waiting on memory), another warp is scheduled
Latency is hidden through warp-level multithreading

This is why GPUs rely less on large caches and more on massive parallelism.

Why Warps Matter for Performance

Understanding warps explains many GPU performance behaviors:

Branch-heavy code performs poorly due to divergence
Memory accesses should be aligned across a warp
Occupancy is measured in warps per SM, not threads
More threads do not automatically mean more performance

Efficient GPU code:

Minimises warp divergence
Encourages uniform control flow within a warp
Keeps many warps ready to run

Conceptual Warp model (C++)

I always explain things with code if possible to provide information in a readable and concise manner. I have created a C++ class for Warp with basic functionality.

#ifndef SRC_WARP_WARP_HH_
#define SRC_WARP_WARP_HH_

#include
#include
#include
#include
#include
#include "../thread/thread.hh"

// Define the number of threads in a warp
#define WARP_THREAD_COUNT 32

// Instruction types enumeration
enum InstructionType {
    ADD,
    SUB,
    LOAD,
    STORE,
    BRANCH
};

//Instruction structure
struct Instruction {
    InstructionType type;
    int dest;
    int src1;
    int src2;

    Instruction(InstructionType t, int d, int s1, int s2)
        : type(t), dest(d), src1(s1), src2(s2) {}
    Instruction() : type(ADD), dest(0), src1(0), src2(0) {}

};

// Reconvergence point structure
struct reconvergencePoint {
    int pc;
    ActiveMask mask;

    reconvergencePoint(int pc_, const ActiveMask& mask_)
        : pc(pc_), mask(mask_) {}
    reconvergencePoint() : pc(0), mask() {}
};

// Warp state enumeration
enum WarpState {
    READY,
    RUNNING,
    STALLED
};

// Type alias for a group of threads in a warp
using ThreadGroup = std::array;

// Type alias for the active mask of threads in a warp
using ActiveMask = std::bitset;

class Warp {
    int id;
    int pc;
    ThreadGroup threads;
    ActiveMask activeMask;
    Instruction currentInstruction;
    std::vector reconvergenceStack;
    WarpState state;

    public:
    Warp();
    Warp(int warpId, const ThreadGroup& threadGroup, WarpState warpState = WarpState::READY);

     // Getter and Setter methods
    int getId() const;
    int getPc() const;
    void setPc(int pc_);
    void setCurrentInstruction(const Instruction& instr);
    Instruction getCurrentInstruction() const;

    const ActiveMask& getActiveMask() const;

};


#endif  // SRC_WARP_WARP_HH_

Note: This is a conceptual model intended to make the warp abstraction concrete. It is not a cycle-accurate GPU simulator. The goal is to expose the shared program counter, active mask, and lockstep execution semantics.

Threads in GPUs: The Smallest Units That Drive Massive Parallelism

The Arch Bytes: From Core to Code — Thu, 22 Jan 2026 02:34:09 GMT

When people talk about GPUs, they often mention thousands of cores, massive parallelism, or SIMT execution. But at the heart of all of this is a much smaller abstraction: GPU thread.

Understanding GPU threads — how they are created, scheduled, grouped, and executed — is foundational to writing fast GPU programs and to understanding modern accelerator architecture. This post breaks down what GPU threads really are, how they differ from CPU threads, and how they fit into the larger execution model.

Why GPUs care about thread?

GPUs are designed for throughput, not latency.

While CPUs optimize for:

Fast single-thread execution
Sophisticated control flow
Large caches

GPUs optimise for:

Running many threads at once
Hiding memory latency with execution
Simple control logic replicated at scale

The result: a GPU may run tens of thousands of threads concurrently, each doing a small piece of work.

But what is a GPU thread?

A GPU thread is the smallest unit of execution in a GPU program.

Each thread:

Executes the same kernel code
Has its own registers and local variables
Has a unique thread ID
Works on a different portion of data

If you’ve used NVIDIA CUDA, this is the threadIdx.x you’re familiar with.

Conceptually:
One thread = one data element (or a few elements)

This “one thread per element” mindset is key to GPU programming.

A GPU expects many threads to stall on memory while others continue executing. This is how GPUs hide memory latency without large caches or speculation.

Of course, threads are grouped before being scheduled to a compute unit in the GPU, and this scheduling plays an important role in determining the speedup.

Some workloads (e.g., graphics) can be divided into sub-problems that can be handled in parallel with minimal dependencies. This is where the GPU shines and this is one of the reason they are used extensively in gaming.

Implementing the Thread class in C++

To make the idea of a GPU thread concrete, let’s model a thread using a simple C++ class.

Each GPU thread:

Has a unique ID
Owns a private register file
Can be active or inactive depending on the control flow

Below is one implementation of a thread if someone wants to write functional model of GPU:

#ifndef SRC_THREAD_THREAD_HH_
#define SRC_THREAD_THREAD_HH_

#include 
#include 
#include 
#include 
#include 
// Define the number of registers available to each thread
#define THREAD_REGISTER_COUNT 64

using RegisterFile = std::array;

enum ThreadState {
    ACTIVE, // Thread is active and can execute instructions
    INACTIVE // Thread is inactive and should not execute instructions
};

class Thread {
    int id;
    ThreadState state;
    RegisterFile registers;

public:

    Thread();
    Thread(int threadId, ThreadState threadState);

    // Getter methods
    int getId() const;
    ThreadState getState() const;
    int getRegisterValue(int index) const;

    // Setter methods
    void setId(int threadId);
    void setState(ThreadState threadState);
    void setRegisters(const RegisterFile& regs);
    void setRegisterValue(int index, int value);
};


#endif  // SRC_THREAD_THREAD_HH_

Note that this is just a functional model and might not be hardware-accurate.

I will be writing more blog posts on the programming model and architecture of GPU, and will add to the code I shared here.

Want to learn Grad-level Computer Architecture? This Github Repo is a game changer

The Arch Bytes: From Core to Code — Wed, 26 Nov 2025 19:40:49 GMT

Most people learn computer architecture from textbooks —Hennessy & Patterson, the RISC-V Reader, or an ISA spec—plus some experimentation with microcontrollers. A few explore heavyweight simulators like gem5 or ChampSim.

But there’s a missing middle ground:
A simple, clean, step-by-step simulator that teaches microarchitecture through coding, not through 400-page PDFs.

Recently, I found a GitHub repository that fills this gap perfectly:

https://github.com/fabwu/eth-computer-architecture

This open-source project was originally part of ETH Zurich’s Computer Architecture coursework. It implements a small, pipelined CPU with cache support — and more importantly, it’s organised into four lab assignments that walk you through building or extending parts of the architecture.

For anyone learning computer architecture or anyone considering building their own simulator, this is an incredible resource.

Why is this repo different?

There are thousands of architecture repos on GitHub, but most are either:

incomplete classroom exercises,
extremely complicated research simulators,
or undocumented student projects.

This ETH repo is the opposite.

✅ Clear structure

The project is broken into multiple labs, each focusing on a single aspect of microarchitecture. You follow them in sequence, and each builds on the last.

✅ Readable and hackable codebase

The simulator is small enough to understand but realistic enough to behave like a real pipeline. Perfect balance.

✅ High-quality academic design

ETH is known for clean, well-designed architecture coursework. This repo reflects that quality: good documentation, a modular simulator, and meaningful lab goals.

✅ Practical learning, not theoretical

Instead of reading about forwarding or cache hits, you implement them and see the effect immediately.

This is the kind of resource I wish I had when I started learning microarchitecture.

Take some time to go through the repository and make sure to go through lab assignment description carefully. The repository also includes research papers that can be implemented in the simulator for each lab, which I find amazing.

This project comes from the Computer Architecture course at ETH Zurich, taught by Prof. Onur Mutlu — a globally recognised researcher whose work spans memory systems, DRAM architecture, processing-in-memory, and microarchitecture design. His lectures and course material are regularly cited and used by universities around the world. The design quality of this simulator reflects the same clarity and rigour that his courses are known for.

I hope the repository proves useful to people looking for hands-on projects in Computer Architecture.

Memory Coalescing in GPU

The Arch Bytes: From Core to Code — Sat, 22 Nov 2025 06:13:11 GMT

Modern GPUs rely on enormous memory bandwidth to keep thousands of threads busy. But raw bandwidth alone isn’t enough — the way threads access memory determines whether a kernel is fast or painfully slow.
This is where memory coalescing becomes one of the most important performance concepts in GPU programming.

Memory coalescing is the process by which a GPU hardware unit (usually the memory subsystem + L1 coalescer) merges multiple memory requests from threads in the same warp into as few DRAM transactions as possible.

A warp = 32 threads, all executing the same instruction in lockstep (SIMT).
If those 32 threads read or write data stored in consecutive, properly aligned addresses, the GPU can combine their memory requests into one large memory transaction.
If they access scattered or misaligned positions, the hardware performs multiple transactions → slower.

If memory coalescing is successful, we will have fewer transactions, which will handle all requests from threads. All threads will be able to access data quickly without stalling.

GPUs access memory in fixed-sized segments (typically 32-, 64-, or 128-byte-aligned, depending on architecture and data type).

If warp accesses fall inside the same aligned segment, the hardware merges them.

Let’s say the below segment map to the following addresses

Segment 0: [0 ... 127]
Segment 1: [128 ... 255]

If threads in a warp access addresses:

[20, 24, 28, ... up to 120] → All inside segment 0 : 1 transaction

You might be thinking:

“If all the addresses are on the same cache line, why do we even need to think about memory coalescing?”

And the short answer is:

Because GPU coalescing is not just about cache lines — it’s about how the warp’s memory requests map to fixed-sized memory segments and how many hardware transactions get issued.

Let’s take a step back and understand few things:

GPU memory transactions operate at the warp level, not the cache-line level

Even if data is cached, the GPU still:

collects 32 addresses from the warp,
groups them into memory segments (e.g., 32B, 64B, 128B),
and issues the minimum number of segment fetches.

Caches reduce latency but don’t magically turn many scattered accesses into one request.

We still pay:

extra memory transactions (even if served from cache)
extra L1/L2 bandwidth
extra instruction cycles
more stress on cache and TLB

So coalescing improves both DRAM hits and cache hits.

2. Coalescing is all about aligning warp access, not cache loading

Coalescing rule:

Warp accesses must fall in as few aligned memory segments as possible (typically 32B/64B/128B‐aligned).

Cache is hierarchical, but the GPU pipeline must still issue one request per segment.
If we touch 8 segments per warp, you pay 8× the internal bandwidth.

If you want to know more about coalescing, refer to this paper: https://scispace.com/pdf/warppool-sharing-requests-with-inter-warp-coalescing-for-3yqgz2qq0a.pdf
I find these papers useful for understanding the fundamentals that are not readily available online.

Speed up with a multiple cache hierarchy

The Arch Bytes: From Core to Code — Sun, 24 Aug 2025 02:31:04 GMT

Whenever the CPU tries to access its cache, it leads to either a cache hit or a cache miss. In case of a miss, it needs to go down the memory hierarchy. Here, we will see how multiple Cache levels help in reducing average memory access time.

Now, let’s see the equation for memory access time:
Average Memory Access Time (AMAT) = Hit latency + miss-ratio * Miss penalty

Cache and main memory characteristics

L1 cache

Hit latency: 2 cycles

Hit ratio: 80%

L2 cache

Hit Latency: 5 cycles

Hit ratio: 90%

Main memory

Latency: 20 cycles

Case 1: Only L1 is present

AMAT = Hit latency + miss-ratio * Miss penalty

= 2 + 0.2 * 20

= 6 cycles

Note that above AMAT is from L1 perspective.

Case 2: L1 and L2 are present

From the L2 perspective:

AMAT_L2 =. Hit latency + miss-ratio * Miss penalty

= 5 + 0.1*20 = 7 cycles

From the L1 perspective:

AMAT_L1 = Hit latency + miss-ratio * Miss penalty

Here, the miss-ratio refers to the miss ratio of the L1 cache. Miss penalty is AMAT for L2 (calculated previously) since miss will lead to L2 access.

So AMAT_L1 = 2 + 0.2 * 7 = 3.4 cycles

So, having both L1 and L2 cache reduces average memory access time from 6 cycles to 3.4 cycles, which is a significant improvement!

To achieve this improvement, we need extra area to accommodate L2 cache and logic to allocate lines to L2 from L1 and main memory.

Cache access pattern

The Arch Bytes: From Core to Code — Sun, 24 Aug 2025 01:40:00 GMT

We run multiple applications on our computing devices every day. It can be browsing, playing games, editing videos and so on. Here we are at the application layer. If we go down the level of abstraction and move towards the processor, we will find that each of the applications mentioned above will be moving data/instructions towards or out of the cache in the processor. If we look at the pattern of cache line access as a whole, it might not make sense but we can try to identify few patterns and analyse each of them to understand how microarchitecture changes will affect cache performance. Remember, we are able to run a processor at a Gigahertz frequency, but memory is still the bottleneck. So, any improvement in cache performance will drastically improve the overall performance of the system.

Recency-Friendly

Cache gets incoming requests in this pattern:

a_1, a_2, a_3, .. ……., a_k, a_k, a_(k-1)

a_1, a_2 refer to a unique address sequence.

All addresses map to the same block in the cache. We are loading a memory block into the cache and then accessing it again. This will lead to maximum hit-ratio. In this case, Least-Recently Used replacement will provide the best result.

If k is equal to associativity, then we will have the best-case scenario which will lead to a 100% hit ratio after warm-up.

Thrashing access pattern

a_1,a_2,a_3,….,a_k, a_1, a_2, a_3,..

If k is greater than associativity, then the new line will replace the line which will be accessed later. This will be a nightmare for LRU replacement policy

Streaming Access pattern

a_1, a_2, a_3, a_4,……………

Here, the sequence has poor temporal locality, and no replacement policy will help to prevent cache misses.

In reality, we have a mix of the above three patterns, and each workload will have some characteristic, such as the majority of access patterns coming under the Recency-pattern, and in whichcase, LRU will be helpful. Computer architects study these characteristics to decide what microarchitecture features will be included in a processor.

RRIP: Smarter Cache replacement than LRU

The Arch Bytes: From Core to Code — Sat, 09 Aug 2025 08:27:14 GMT

When a processor’s cache fills up, something has to go.
The cache replacement policy decides which cache line gets evicted to make space for new data.

For decades, LRU (Least Recently Used) has been the go-to choice. The idea is simple:

“If it hasn’t been used recently, it’s probably safe to evict.”

This works well for workloads with strong temporal locality — where data is likely to be reused soon after it’s accessed.
But LRU struggles badly with streaming or scan workloads: imagine reading a huge array sequentially. Each access evicts something you actually will need soon, and by the time you come back to it, it’s gone.

So, can we do better?
In 2010, researchers from Intel and University of Maryland proposed RRIP — Re-Reference Interval Prediction — a smarter way to decide what to evict.

The Core Idea

Instead of tracking the exact order of past accesses (like LRU), RRIP predicts how far in the future a cache line will be reused.
It keeps a tiny counter called RRPV (Re-Reference Prediction Value) for each cache line:

RRPV ValueMeaning

0 : Will be used very soon — keep it.

1–2: Medium-term reuse likelihood.

Max (e.g. 3): Will be used far in the future — best eviction candidate

Higher RRPV = more likely to be evicted.

Most implementations use 2 bits per cache line → RRPV values from 0 to 3.

How RRIP Works

1. Victim Selection

Look for a line with RRPV = max (e.g., 3). Evict it.
If none found, increment all RRPVs (saturating at max) and repeat.
This gradual “aging” simulates the line becoming less useful over time.

2. Insertion Policies

RRIP’s strength lies in how it sets the initial RRPV for new lines:

SRRIP (Static RRIP)
Insert with RRPV = max–1 (e.g., 2).
→ New lines get a short trial before being evicted.
BRRIP (Bimodal RRIP)
Insert with RRPV = max most of the time, and max–1 occasionally (e.g., 1 in 32 insertions).
→ Keeps most new lines “low priority,” good for streaming data.
DRRIP (Dynamic RRIP)
Dynamically switches between SRRIP and BRRIP using set-dueling: a few sets run each policy, and the better-performing one is applied globally.

3. On a Cache Hit

When a line is hit, reset its RRPV to 0 — meaning “will be used soon.”

Why RRIP Works Better Than LRU

Scan-resistant: Doesn’t let streaming data evict useful lines prematurely.
Simple hardware: Just a few bits per line and simple update logic.
Adaptable: DRRIP can auto-tune itself to different workloads.

In many last-level cache studies, RRIP outperforms LRU by 5–15% in hit rate for real workloads.