<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by The Arch Bytes: From Core to Code on Medium]]></title>
        <description><![CDATA[Stories by The Arch Bytes: From Core to Code on Medium]]></description>
        <link>https://medium.com/@himanshu0525125?source=rss-9288ddc93351------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*DLkvLDFL_WdDbEn8pdlZnw.jpeg</url>
            <title>Stories by The Arch Bytes: From Core to Code on Medium</title>
            <link>https://medium.com/@himanshu0525125?source=rss-9288ddc93351------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Fri, 15 May 2026 08:39:52 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@himanshu0525125/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[A Phone Chip Inside a Laptop? The Curious Case of MacBook Neo and the A18 Pro]]></title>
            <link>https://medium.com/@himanshu0525125/a-phone-chip-inside-a-laptop-the-curious-case-of-macbook-neo-and-the-a18-pro-c318f8afecf2?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/c318f8afecf2</guid>
            <category><![CDATA[macbook-neo]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[apple-silicon]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Sat, 07 Mar 2026 05:10:11 GMT</pubDate>
            <atom:updated>2026-03-07T05:10:11.347Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZDZcDDTCnIhV5cMN2k4xUw.png" /></figure><p>When Apple launched the new MacBook Neo, one detail caught the attention of many hardware enthusiasts: the laptop is powered by the Apple A18 Pro — a chip originally designed for the iPhone 16 Pro.</p><p>At first glance this might not seem unusual. Apple has been designing its own silicon for years. But historically, Apple has kept its chip families separate:</p><ul><li><strong>A-series chips</strong> power iPhones and iPads</li><li><strong>M-series chips</strong> power Macs</li></ul><p>The MacBook Neo changes that pattern. For the first time, Apple has shipped a Mac laptop using an <strong>A-series processor</strong>. This decision raises an interesting question:</p><blockquote><em>If the MacBook Neo uses the same chip as the iPhone, how similar are they really?</em></blockquote><h3>The A18 Pro in Two Very Different Devices</h3><p>Both the MacBook Neo and the iPhone 16 Pro use variants of the <strong>A18 Pro SoC</strong>. At a high level, the architecture is the same.</p><p>Typical configuration includes:</p><ul><li><strong>6-core CPU</strong></li><li>2 performance cores</li><li>4 efficiency cores</li><li><strong>Apple GPU</strong></li><li><strong>16-core Neural Engine</strong></li><li>Built on <strong>TSMC’s 3-nm process</strong></li></ul><p>From a microarchitecture perspective, the CPU complex and neural engine are largely identical. However, one specification stands out when comparing the two devices. A18 Pro in iPhone has 6 cores, while Macbook Neo chip has 5 cores.</p><p>At first glance, this seems counterintuitive. One would expect a laptop to have <strong>equal or greater GPU resources</strong> than a smartphone.</p><p>So why does the MacBook Neo ship with fewer GPU cores?</p><h3>Understanding GPU Core Differences</h3><p>There are several reasons this configuration makes sense from a chip design and product strategy perspective.</p><h4>1. Silicon Binning</h4><p>Modern chips are manufactured in massive batches, and not every die comes out perfect. Some chips may have a small defect in one GPU core or may not meet the frequency target for that unit.</p><p>Instead of discarding the entire chip, manufacturers disable the faulty core and sell the chip with a reduced configuration.</p><p>This process is known as <strong>silicon binning</strong>.</p><p>A simplified example:</p><pre>Fully functional die  → 6 GPU cores → iPhone 16 Pro<br>Minor defect die      → 5 GPU cores → MacBook Neo</pre><p>This allows Apple to <strong>improve manufacturing yield</strong> while still using nearly every produced chip.</p><h4>2. Product Segmentation</h4><p>Another factor is product positioning.</p><p>Apple’s Mac lineup already includes laptops powered by <strong>M-series chips</strong>, such as the Apple M3 and Apple M4. These chips offer significantly larger GPUs and higher memory bandwidth.</p><p>If the MacBook Neo shipped with the full GPU configuration of the A18 Pro, it could start to overlap with higher-end Macs in certain workloads.</p><p>Reducing the GPU core count helps keep the product stack clean:</p><pre>MacBook Neo  → entry-level Mac<br>MacBook Air  → mainstream laptop<br>MacBook Pro  → high performance</pre><h4>3. Target Workloads</h4><p>The MacBook Neo is designed as an <strong>entry-level laptop</strong>. Typical workloads include:</p><ul><li>web browsing</li><li>document editing</li><li>light programming</li><li>media playback</li></ul><p>These tasks rarely saturate the GPU. In many cases, <strong>CPU performance and battery life matter far more</strong>.</p><p>Disabling one GPU core has minimal impact on these workloads but can improve chip availability and cost efficiency.</p><h4>The Bigger Story: Mobile Chips Are Now Laptop-Class</h4><p>While the GPU difference is interesting, the bigger takeaway is something else entirely.</p><p>A chip originally designed for a <strong>smartphone thermal envelope</strong> is now powerful enough to run <strong>macOS on a full laptop</strong>.</p><p>This highlights how far mobile SoCs have evolved.</p><p>A decade ago:</p><ul><li>laptop CPUs consumed <strong>15–45 W</strong></li><li>smartphone chips consumed <strong>3–5 W</strong></li></ul><p>Today, modern mobile silicon is efficient enough that the same architecture can scale across multiple device classes.</p><p>Apple’s silicon strategy now looks something like this:</p><pre>A-series   → phones<br>A-series   → entry-level laptops<br>M-series   → mainstream Macs<br>M Pro/Max  → high-performance systems</pre><p>The MacBook Neo is an example of how these boundaries are starting to blur.</p><h4>Final Thoughts</h4><p>At first glance, the MacBook Neo having <strong>fewer GPU cores than the iPhone</strong> seems odd. But when we consider manufacturing yield, product segmentation, and real-world workloads, the decision makes sense.</p><p>In many ways, the real story isn’t about GPU cores at all.</p><p>It’s about the fact that a <strong>phone-class processor is now capable of powering a full laptop</strong>.</p><p>And that says a lot about the trajectory of modern computer architecture.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c318f8afecf2" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How I Simulated a GPU Scheduler: A Deep Dive into microGPU]]></title>
            <link>https://medium.com/@himanshu0525125/how-i-simulated-a-gpu-scheduler-a-deep-dive-into-microgpu-7fc9e503fa06?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/7fc9e503fa06</guid>
            <category><![CDATA[cpp-programming]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[gpu-architecture]]></category>
            <category><![CDATA[gpu]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Sat, 21 Feb 2026 01:10:53 GMT</pubDate>
            <atom:updated>2026-02-21T01:10:53.907Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4CvPVxqQJQ-H8BhGcbmSGA.png" /></figure><p>In my last post, we went deep into the <a href="https://medium.com/@himanshu0525125/an-architectural-look-at-gpu-compute-units-b7a20d8124dd">architectural blueprint of GPU Compute Units</a>, discussing how they manage massive parallelism in theory. But as any architect knows, there is a massive gulf between understanding a block diagram and seeing those cycles actually move. To bridge that gap, I decided to build it. I’ve spent the last few weeks developing <a href="https://github.com/himanshu5-prog/microGPU">microGPU</a>, a functional C++ simulator designed to demystify the hardware-software contract. It’s one thing to read about ‘Warps’ and ‘Round-Robin Scheduling’; it’s another thing entirely to watch a scheduler dispatch threads across virtual silicon in real-time. In this post, I’m breaking down how I modelled the execution pipeline and sharing the repository so you can compile, trace, and even break your own GPU model.</p><p>Classes defined in codebase:</p><h4>Thread</h4><p>To simulate a GPU, we must first define the smallest unit of execution: the <strong>Thread</strong>. In microGPU, <a href="https://github.com/himanshu5-prog/microGPU/edit/main/src/thread/thread.hh">Thread class</a> acts as a container for the architectural state of a single lane of execution.</p><p>Instead of a complex, bloated object, I kept the thread model lean to ensure the simulation stays performant:</p><ul><li><strong>State Management:</strong> Each thread exists in a ThreadState—either <strong>ACTIVE</strong> or <strong>INACTIVE</strong>. This is crucial for simulating &quot;predication&quot; or &quot;branch divergence,&quot; where some threads in a warp might be disabled during execution.</li><li><strong>The Register File:</strong> Each thread is allocated a private RegisterFile. In this implementation, I’ve defined THREAD_REGISTER_COUNT as <strong>64 registers</strong> per thread using a std::array&lt;int, 64&gt;. This provides a fixed-size, fast-access memory space for computational operands.</li><li><strong>Identification:</strong> Every thread carries a unique id, allowing the Global Scheduler and Warp units to track work distribution across the entire virtual chip.</li></ul><h4>Warp</h4><p>If the thread is the atomic unit, the <strong>Warp</strong> is the management unit. In <a href="https://github.com/himanshu5-prog/microGPU">microGPU</a>, a Warp groups <strong>32 threads</strong> together to execute in lockstep — a fundamental concept in GPU architecture known as SIMT.</p><p>The <a href="https://github.com/himanshu5-prog/microGPU/blob/main/src/warp/warp.hh">Warp class</a> is responsible for maintaining the shared state that these threads rely on:</p><ul><li><strong>Shared Program Counter (PC):</strong> Unlike a CPU where every thread has its own PC, all threads in this Warp share a single PC. They move through the code together, one instruction at a time.</li><li><strong>The Active Mask:</strong> I implemented the ActiveMask using a std::bitset&lt;32&gt;. This is critical for handling <strong>branch divergence</strong>. If an if/else statement causes half the threads to take one path, the mask simply &quot;turns off&quot; the inactive threads during that cycle.</li><li><strong>Reconvergence Stack:</strong> To handle complex control flows, I included a reconvergenceStack. This allows the warp to remember where divergent paths should meet back up, ensuring the threads stay synchronized after a conditional block finishes.</li><li><strong>Pipeline Tracking:</strong> Each warp tracks its own PipelineStage (from STAGE_0 to DONE) and WarpState (READY, RUNNING, or STALLED). This allows the Compute Unit to easily identify which warps are waiting for data and which are ready to execute.</li></ul><h4>Compute Unit</h4><p>The <a href="https://github.com/himanshu5-prog/microGPU/blob/main/src/computeUnit/computeUnit.hh">ComputeUnit class</a> manages the complexity of cycle-accurate simulation through several key mechanisms:</p><ul><li><strong>Warp Collection:</strong> Each CU maintains its own internal pool of warps. This mimics real hardware where a specific number of warps are &quot;resident&quot; on a shader core or streaming multiprocessor.</li><li><strong>The Round-Robin Scheduler:</strong> To keep the execution fair and prevent any single warp from hogging resources, I implemented a calculateNextWarpId() method. It follows a simple yet effective Round-Robin strategy, rotating through ready warps every cycle.</li><li><strong>The Pipeline State Machine:</strong> A major feature of this class is the 5-stage pipeline simulation. Each warp progresses through:</li><li>STAGE_0 to STAGE_3 (Execution &amp; Latency)</li><li>DONE (Retirement)</li><li><strong>Cycle-Accurate Tracking:</strong> Using the incrementCycle() method, the CU tracks the precise passage of time. This allows us to measure the performance and throughput of the simulated kernels.</li></ul><h4>GPU</h4><p>The <a href="https://github.com/himanshu5-prog/microGPU/blob/main/src/microGPU/ugpu.hh">MicroGPU class</a> is the entry point of the entire simulation. It represents the physical GPU chip, managing a collection of <strong>16 Compute Units</strong> (defined by CU_COUNT) and a global pool of work. While the CUs handle the heavy lifting of execution, the MicroGPU acts as the hardware&#39;s controller and dispatcher.</p><p>Key responsibilities defined in this top-level class include:</p><ul><li><strong>Global Warp Collection:</strong> Before execution begins, all work is stored in the globalWarpCollection. This acts as the &quot;Global Scheduler&#39;s&quot; queue, holding all warps that are waiting to be dispatched to an available Compute Unit.</li><li><strong>The Global Scheduler:</strong> I implemented a performWarpScheduling() method that handles the distribution of work. In the current iteration, I&#39;ve also included a performWarpSchedulingSimple() method—a testing-focused scheduler that assigns alternating warps to specific CUs to verify that the hand-off between global logic and local execution is seamless.</li><li><strong>The System Heartbeat:</strong> The executeGPU() method is the main loop of the simulation. It drives the currentCycle forward, calling executeComputeUnits() on every tick until the allWarpsCompleted() check returns true.</li><li><strong>Verification &amp; Testing:</strong> To ensure the hardware model actually works, the class includes createGlobalWarpCollectionTest(). This populates the GPU with test warps containing simple instructions (like ADD), allowing for a full &quot;dry run&quot; of the pipeline from dispatch to retirement.</li></ul><h4>Explore the Code</h4><p>If you want to dive deeper into the implementation or contribute to the project:</p><ul><li><strong>Source Code:</strong> <a href="https://github.com/himanshu5-prog/microGPU">Check out the microGPU Repository on GitHub</a></li><li><strong>Technical Docs:</strong> <a href="https://himanshu5-prog.github.io/microGPU/">Full API Reference &amp; Documentation</a></li></ul><p><em>I’ll be adding more features like memory hierarchy simulation in the future — feel free to star the repo to follow the progress!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7fc9e503fa06" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[An Architectural Look at GPU Compute Units]]></title>
            <link>https://medium.com/@himanshu0525125/an-architectural-look-at-gpu-compute-units-b7a20d8124dd?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/b7a20d8124dd</guid>
            <category><![CDATA[cpp-programming]]></category>
            <category><![CDATA[oops-concepts]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[gpu]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Thu, 12 Feb 2026 05:34:59 GMT</pubDate>
            <atom:updated>2026-02-12T05:34:59.133Z</atom:updated>
            <content:encoded><![CDATA[<p>A common belief about GPUs is that they are fast because they contain thousands of cores.</p><p>While that is partially true, it misses the real story.</p><p>GPUs are fast because they are designed to <strong>stay busy</strong>, even when individual operations take hundreds of cycles to complete. At the center of this design sits one of the most important building blocks of modern GPUs:</p><h3>Compute Unit</h3><p>Think of the compute unit as a miniature processor — an independent execution engine capable of scheduling, managing, and executing groups of threads with remarkable efficiency.</p><p>Understanding this component is key to understanding why GPUs behave so differently from CPUs.</p><p>Before diving into compute units, let’s establish an important contrast.</p><blockquote>A CPU is optimized to <strong>minimize latency</strong>. When a program requests data from memory, the CPU deploys large caches, sophisticated branch predictors, and out-of-order execution to reduce waiting time.</blockquote><blockquote>A GPU takes a very different approach.</blockquote><blockquote>Instead of trying to make memory faster, GPUs assume memory <em>will</em> be slow — often <strong>400–800 cycles</strong> for global memory accesses.</blockquote><blockquote>GPUs simply switch to another set of threads that are ready to run.</blockquote><p>And the hardware responsible for orchestrating this constant motion is the compute unit.</p><p>A <strong>compute unit</strong> is an independent hardware block inside the GPU that fetches instructions, schedules work, and executes groups of threads known as <strong>warps</strong> (or wavefronts in AMD terminology).</p><p>Each compute unit contains everything needed to keep execution flowing:</p><ul><li><strong>Warp schedulers</strong></li></ul><p>The scheduler continuously searches for a warp that is ready to execute.</p><p>Every cycle, it asks:</p><blockquote><em>“Which warp can make progress right now?”</em></blockquote><p>If one warp stalls on memory, the scheduler immediately pivots to another.</p><p>This ability to rapidly switch work is what allows GPUs to tolerate massive latency without slowing down.</p><ul><li><strong>Execution pipelines</strong></li></ul><p>Once a warp is selected, its instruction flows into execution pipelines — arithmetic units, load/store units, and specialized math hardware.</p><p>But here is something beginners often misunderstand:</p><p>More pipelines do <strong>not</strong> automatically mean higher performance.</p><p>Performance depends on whether the scheduler can keep those pipelines fed with ready work.</p><p>If all warps are stalled, even the widest machine goes idle.</p><ul><li><strong>A large register file</strong></li></ul><p>One reason compute units can switch between warps so quickly is that each warp’s state lives in a massive on-chip register file.</p><p>Unlike CPUs, there is no expensive context switch.</p><p>No saving to memory.<br> No restoring state.</p><p>The hardware simply selects a different register bank and continues execution.</p><p>This is one of the quiet design choices that enables GPU efficiency.</p><ul><li><strong>On-chip shared memory</strong></li><li><strong>Control logic</strong></li></ul><p>You can think of it as a highly specialized throughput machine whose primary goal is simple: Always have something to execute. DO NOT remain idle.</p><h3>Everything sounds good so far, but what happens inside Compute Unit every cycle.</h3><p>Let’s zoom into a single cycle inside a compute unit.</p><p>A simplified flow looks like this:</p><p><strong>Cycle N:</strong></p><ol><li>The scheduler selects a READY warp</li><li>An instruction is issued</li><li>Some warps may stall (for example, waiting on memory)</li><li>The scheduler searches for another runnable warp</li></ol><p>And the loop repeats.</p><h3><strong>Compute Unit class definiton</strong></h3><p>I have created a simple Compute Unit class in C++. I have added comments to explain the purpose of class methods and variables.</p><pre>#ifndef COMPUTEUNIT_HH<br>#define COMPUTEUNIT_HH<br><br>#include&lt;iostream&gt;<br>#include&lt;string&gt;<br>#include&lt;vector&gt;<br>#include&lt;array&gt;<br>#include&lt;bitset&gt;<br>#include&lt;cassert&gt;<br><br>#include &quot;../warp/warp.hh&quot;<br><br>enum SMState {<br>    IDLE,<br>    BUSY,<br>    ERROR<br>};<br><br><br>class ComputeUnit {<br>    std::vector&lt;Warp&gt; warps;<br><br>    // Each compute unit has its own ID<br>    int smId;<br><br>    // Current warp ID being executed<br>    size_t currentWarpId;<br><br>    // Current cycle<br>    int currentCycle;<br><br>    // State of the compute unit<br>    SMState state;<br><br>    public:<br>        ComputeUnit() : currentWarpId(0), currentCycle(0), state(SMState::IDLE) {}<br>       <br>        // Setter methods<br>        void setState(SMState newState);<br>        void setCurrentWarpId(int warpId);<br>        void setWarp(const Warp &amp;warp);<br>        void setSmId(int id) { smId = id; }<br><br>        // Increment cycle count for the compute unit<br>        void incrementCycle() { currentCycle++; }<br><br>        // Execute the current warp and advance its pipeline stage<br>        void execute(); <br><br>        // Getter methods<br>        int getCurrentWarpId();<br>        SMState getState() const;<br>        int getWarpCollectionSize() const;<br>        int getCurrentCycle() const { return currentCycle; }<br>        int getSmId() const { return smId; }<br><br>        // Print method for debugging<br>        void printId() const { std::cout &lt;&lt; &quot;(ComputeUnit) ComputeUnit ID: &quot; &lt;&lt; smId &lt;&lt; std::endl; }<br><br>        // Method to calculate the next warp ID to execute based on round-robin scheduling<br>        void calculateNextWarpId();<br>        <br>};<br><br>#endif // COMPUTEUNIT_HH</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b7a20d8124dd" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Threads to Warps: How GPUs Actually Execute Code]]></title>
            <link>https://medium.com/@himanshu0525125/from-threads-to-warps-how-gpus-actually-execute-code-4d924eae1ad5?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/4d924eae1ad5</guid>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[cpp]]></category>
            <category><![CDATA[gpu-architecture]]></category>
            <category><![CDATA[gpu]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Sun, 25 Jan 2026 06:29:58 GMT</pubDate>
            <atom:updated>2026-01-25T08:48:55.305Z</atom:updated>
            <content:encoded><![CDATA[<p>In the previous post, we looked at <strong>threads</strong> as the fundamental programming abstraction in GPU programming. Threads are how <em>we</em> think about parallelism: each thread has its own registers, its own thread ID, and its own piece of work.</p><p>However, threads are <strong>not</strong> the unit of execution in GPU hardware.</p><p>To understand performance, control flow, and memory behavior on a GPU, we need to introduce the concept that sits one level below threads:</p><blockquote><strong>The warp</strong> — the hardware execution unit of the GPU.</blockquote><h3>What Is a Warp?</h3><p>A <strong>warp</strong> is a fixed-size group of threads that are <strong>executed together in lockstep</strong> by the GPU.</p><p>On <strong>NVIDIA GPUs</strong>, a warp consists of <strong>32 threads</strong></p><ul><li>All threads in a warp:</li><li>Share a <strong>single program counter</strong></li><li>Execute the <strong>same instruction at the same time</strong></li><li>Operate on <strong>different data</strong></li></ul><p>This execution model is known as <strong>SIMT (Single Instruction, Multiple Threads)</strong>.</p><h3>Why Do GPUs Use Warps?</h3><p>GPUs are designed for <strong>throughput</strong>, not single-thread latency. A modern GPU may need to manage <strong>tens of thousands of active threads</strong>. Tracking a separate instruction stream for each thread would be prohibitively expensive in hardware.</p><p>Instead, the GPU:</p><ol><li>Groups threads into warps</li><li>Shares control logic across the group</li><li>Executes them together</li></ol><p>This design dramatically reduces hardware complexity while still exposing massive parallelism to the programmer.</p><h3>Warp Execution: Lockstep in Practice</h3><p>Consider the following code:</p><pre>int tid = threadIdx.x;<br>A[tid] = B[tid] + C[tid];</pre><p>From the programmer’s perspective:</p><ul><li>Each thread computes its own tid</li><li>Each thread updates a different array element</li></ul><p>From the hardware’s perspective:</p><ul><li>One instruction is issued</li><li>32 threads execute it simultaneously</li><li>Each thread uses its own registers and memory addresses</li></ul><p>Same instruction. Same cycle. Different data.</p><h3>Warp Divergence: When Threads Disagree</h3><p>The lockstep nature of warps becomes visible when control flow differs between threads.</p><pre>if (tid % 2 == 0)<br>  A[tid] = 1;<br>else<br>  A[tid] = 2;</pre><p>Within a single warp:</p><ul><li>Some threads take the if pathB</li><li>Others take the else path</li></ul><p>The GPU handles this by:</p><ol><li>Executing the if path while masking inactive threads</li><li>Executing the else path while masking the other threads</li><li>Reconverging the warp</li></ol><p>Both paths execute <strong>serially</strong>.</p><p>This phenomenon is called <strong>warp divergence</strong>.</p><blockquote>Bottom line: Divergence does not break correctness, but it does reduce performance.</blockquote><h3>Warps and Scheduling</h3><p>Each SM maintains a pool of <strong>resident warps</strong>. In every cycle, a <strong>warp scheduler</strong> selects a ready warp and issues its next instruction.</p><p>Key properties:</p><ul><li>Warp context switching is essentially <strong>free</strong></li><li>When one warp stalls (e.g., waiting on memory), another warp is scheduled</li><li>Latency is hidden through <strong>warp-level multithreading</strong></li></ul><p>This is why GPUs rely less on large caches and more on massive parallelism.</p><h3>Why Warps Matter for Performance</h3><p>Understanding warps explains many GPU performance behaviors:</p><ul><li>Branch-heavy code performs poorly due to divergence</li><li>Memory accesses should be aligned across a warp</li><li>Occupancy is measured in <strong>warps per SM</strong>, not threads</li><li>More threads do not automatically mean more performance</li></ul><p>Efficient GPU code:</p><ul><li>Minimises warp divergence</li><li>Encourages uniform control flow within a warp</li><li>Keeps many warps ready to run</li></ul><h3><strong>Conceptual Warp model (C++)</strong></h3><p>I always explain things with code if possible to provide information in a readable and concise manner. I have created a C++ class for Warp with basic functionality.</p><pre>#ifndef SRC_WARP_WARP_HH_<br>#define SRC_WARP_WARP_HH_<br><br>#include&lt;iostream&gt;<br>#include&lt;string&gt;<br>#include&lt;vector&gt;<br>#include&lt;array&gt;<br>#include&lt;bitset&gt;<br>#include &quot;../thread/thread.hh&quot;<br><br>// Define the number of threads in a warp<br>#define WARP_THREAD_COUNT 32<br><br>// Instruction types enumeration<br>enum InstructionType {<br>    ADD,<br>    SUB,<br>    LOAD,<br>    STORE,<br>    BRANCH<br>};<br><br>//Instruction structure<br>struct Instruction {<br>    InstructionType type;<br>    int dest;<br>    int src1;<br>    int src2;<br><br>    Instruction(InstructionType t, int d, int s1, int s2)<br>        : type(t), dest(d), src1(s1), src2(s2) {}<br>    Instruction() : type(ADD), dest(0), src1(0), src2(0) {}<br><br>};<br><br>// Reconvergence point structure<br>struct reconvergencePoint {<br>    int pc;<br>    ActiveMask mask;<br><br>    reconvergencePoint(int pc_, const ActiveMask&amp; mask_)<br>        : pc(pc_), mask(mask_) {}<br>    reconvergencePoint() : pc(0), mask() {}<br>};<br><br>// Warp state enumeration<br>enum WarpState {<br>    READY,<br>    RUNNING,<br>    STALLED<br>};<br><br>// Type alias for a group of threads in a warp<br>using ThreadGroup = std::array&lt;Thread*, WARP_THREAD_COUNT&gt;;<br><br>// Type alias for the active mask of threads in a warp<br>using ActiveMask = std::bitset&lt;WARP_THREAD_COUNT&gt;;<br><br>class Warp {<br>    int id;<br>    int pc;<br>    ThreadGroup threads;<br>    ActiveMask activeMask;<br>    Instruction currentInstruction;<br>    std::vector&lt;reconvergencePoint&gt; reconvergenceStack;<br>    WarpState state;<br><br>    public:<br>    Warp();<br>    Warp(int warpId, const ThreadGroup&amp; threadGroup, WarpState warpState = WarpState::READY);<br><br>     // Getter and Setter methods<br>    int getId() const;<br>    int getPc() const;<br>    void setPc(int pc_);<br>    void setCurrentInstruction(const Instruction&amp; instr);<br>    Instruction getCurrentInstruction() const;<br><br>    const ActiveMask&amp; getActiveMask() const;<br><br>};<br><br><br>#endif  // SRC_WARP_WARP_HH_</pre><p><strong>Note</strong>: This is a <em>conceptual</em> model intended to make the warp abstraction concrete. It is <strong>not</strong> a cycle-accurate GPU simulator. The goal is to expose the shared program counter, active mask, and lockstep execution semantics.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4d924eae1ad5" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Threads in GPUs: The Smallest Units That Drive Massive Parallelism]]></title>
            <link>https://medium.com/@himanshu0525125/threads-in-gpus-the-smallest-units-that-drive-massive-parallelism-2359cb271336?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/2359cb271336</guid>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[gpu]]></category>
            <category><![CDATA[cpp-programming]]></category>
            <category><![CDATA[gpu-computing]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Thu, 22 Jan 2026 02:34:09 GMT</pubDate>
            <atom:updated>2026-01-22T02:34:09.274Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9dg2oZELWTNmGCcY4Lkv5Q.png" /></figure><p>When people talk about GPUs, they often mention <em>thousands of cores</em>, <em>massive parallelism</em>, or <em>SIMT execution</em>. But at the heart of all of this is a much smaller abstraction: GPU thread.</p><p>Understanding GPU threads — how they are created, scheduled, grouped, and executed — is foundational to writing fast GPU programs and to understanding modern accelerator architecture. This post breaks down what GPU threads really are, how they differ from CPU threads, and how they fit into the larger execution model.</p><h3><strong>Why GPUs care about thread?</strong></h3><p>GPUs are designed for <strong>throughput</strong>, not latency.</p><p>While CPUs optimize for:</p><ul><li>Fast single-thread execution</li><li>Sophisticated control flow</li><li>Large caches</li></ul><p>GPUs optimise for:</p><ul><li>Running <strong>many threads at once</strong></li><li>Hiding memory latency with execution</li><li>Simple control logic replicated at scale</li></ul><p>The result: a GPU may run <strong>tens of thousands of threads concurrently</strong>, each doing a small piece of work.</p><h3><strong>But what is a GPU thread?</strong></h3><p>A <strong>GPU thread</strong> is the <strong>smallest unit of execution</strong> in a GPU program.</p><p>Each thread:</p><ul><li>Executes the same kernel code</li><li>Has its own registers and local variables</li><li>Has a unique thread ID</li><li>Works on a different portion of data</li></ul><p>If you’ve used <strong>NVIDIA CUDA</strong>, this is the threadIdx.x you’re familiar with.</p><blockquote><em>Conceptually:<br> </em><strong><em>One thread = one data element (or a few elements)</em></strong></blockquote><p>This “one thread per element” mindset is key to GPU programming.</p><p>A GPU expects many threads to stall on memory while others continue executing. This is how GPUs hide memory latency without large caches or speculation.</p><p>Of course, threads are grouped before being scheduled to a compute unit in the GPU, and this scheduling plays an important role in determining the speedup.</p><p>Some workloads (e.g., graphics) can be divided into sub-problems that can be handled in parallel with minimal dependencies. This is where the GPU shines and this is one of the reason they are used extensively in gaming.</p><h3><strong>Implementing the Thread class in C++</strong></h3><p>To make the idea of a GPU thread concrete, let’s model a thread using a simple C++ class.</p><p>Each GPU thread:</p><ul><li>Has a unique ID</li><li>Owns a private register file</li><li>Can be active or inactive depending on the control flow</li></ul><p>Below is one implementation of a thread if someone wants to write functional model of GPU:</p><pre>#ifndef SRC_THREAD_THREAD_HH_<br>#define SRC_THREAD_THREAD_HH_<br><br>#include &lt;iostream&gt;<br>#include &lt;string&gt;<br>#include &lt;vector&gt;<br>#include &lt;cassert&gt;<br>#include &lt;array&gt;<br>// Define the number of registers available to each thread<br>#define THREAD_REGISTER_COUNT 64<br><br>using RegisterFile = std::array&lt;int, THREAD_REGISTER_COUNT&gt;;<br><br>enum ThreadState {<br>    ACTIVE, // Thread is active and can execute instructions<br>    INACTIVE // Thread is inactive and should not execute instructions<br>};<br><br>class Thread {<br>    int id;<br>    ThreadState state;<br>    RegisterFile registers;<br><br>public:<br><br>    Thread();<br>    Thread(int threadId, ThreadState threadState);<br><br>    // Getter methods<br>    int getId() const;<br>    ThreadState getState() const;<br>    int getRegisterValue(int index) const;<br><br>    // Setter methods<br>    void setId(int threadId);<br>    void setState(ThreadState threadState);<br>    void setRegisters(const RegisterFile&amp; regs);<br>    void setRegisterValue(int index, int value);<br>};<br><br><br>#endif  // SRC_THREAD_THREAD_HH_</pre><p>Note that this is just a functional model and might not be hardware-accurate.</p><p>I will be writing more blog posts on the programming model and architecture of GPU, and will add to the code I shared here.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2359cb271336" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Want to learn Grad-level Computer Architecture? This Github Repo is a game changer]]></title>
            <link>https://medium.com/@himanshu0525125/want-to-learn-grad-level-computer-architecture-this-github-repo-is-a-game-changer-a2f81bfb1f8d?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/a2f81bfb1f8d</guid>
            <category><![CDATA[cpu]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[simulator]]></category>
            <category><![CDATA[top-github-repository]]></category>
            <category><![CDATA[cache]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Wed, 26 Nov 2025 19:40:49 GMT</pubDate>
            <atom:updated>2025-11-26T19:40:49.208Z</atom:updated>
            <content:encoded><![CDATA[<p>Most people learn computer architecture from textbooks —Hennessy &amp; Patterson, the RISC-V Reader, or an ISA spec—plus some experimentation with microcontrollers. A few explore heavyweight simulators like gem5 or ChampSim.</p><p>But there’s a missing middle ground:<br> A <strong>simple, clean, step-by-step simulator that teaches microarchitecture through coding</strong>, not through 400-page PDFs.</p><p>Recently, I found a GitHub repository that fills this gap perfectly:</p><p><a href="https://github.com/fabwu/eth-computer-architecture"><strong>https://github.com/fabwu/eth-computer-architecture</strong></a></p><p>This open-source project was originally part of ETH Zurich’s Computer Architecture coursework. It implements a small, pipelined CPU with cache support — and more importantly, it’s organised into <strong>four lab assignments</strong> that walk you through building or extending parts of the architecture.</p><p>For anyone learning computer architecture or anyone considering building their own simulator, this is an incredible resource.</p><h3><strong>Why is this repo different?</strong></h3><p>There are thousands of architecture repos on GitHub, but most are either:</p><ul><li>incomplete classroom exercises,</li><li>extremely complicated research simulators,</li><li>or undocumented student projects.</li></ul><p>This ETH repo is the opposite.</p><h3>✅ Clear structure</h3><p>The project is broken into multiple labs, each focusing on a single aspect of microarchitecture. You follow them in sequence, and each builds on the last.</p><h3>✅ Readable and hackable codebase</h3><p>The simulator is small enough to understand but realistic enough to behave like a real pipeline. Perfect balance.</p><h3>✅ High-quality academic design</h3><p>ETH is known for clean, well-designed architecture coursework. This repo reflects that quality: good documentation, a modular simulator, and meaningful lab goals.</p><h3>✅ Practical learning, not theoretical</h3><p>Instead of reading about forwarding or cache hits, you <em>implement</em> them and see the effect immediately.</p><p>This is the kind of resource I wish I had when I started learning microarchitecture.</p><p>Take some time to go through the repository and make sure to go through lab assignment description carefully. The repository also includes research papers that can be implemented in the simulator for each lab, which I find amazing.</p><p>This project comes from the Computer Architecture course at ETH Zurich, taught by <strong>Prof. Onur Mutlu</strong> — a globally recognised researcher whose work spans memory systems, DRAM architecture, processing-in-memory, and microarchitecture design. His lectures and course material are regularly cited and used by universities around the world. The design quality of this simulator reflects the same clarity and rigour that his courses are known for.</p><p>I hope the repository proves useful to people looking for hands-on projects in Computer Architecture.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a2f81bfb1f8d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Memory Coalescing in GPU]]></title>
            <link>https://medium.com/@himanshu0525125/memory-coalescing-in-gpu-23f222b26ca2?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/23f222b26ca2</guid>
            <category><![CDATA[memory-subsystem]]></category>
            <category><![CDATA[gpu]]></category>
            <category><![CDATA[gpu-architecture]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Sat, 22 Nov 2025 06:13:11 GMT</pubDate>
            <atom:updated>2025-11-22T06:13:11.506Z</atom:updated>
            <content:encoded><![CDATA[<p>Modern GPUs rely on enormous memory bandwidth to keep thousands of threads busy. But raw bandwidth alone isn’t enough — the way threads access memory determines whether a kernel is fast or painfully slow.<br> This is where memory coalescing becomes one of the most important performance concepts in GPU programming.</p><p>Memory coalescing is the process by which a GPU hardware unit (usually the memory subsystem + L1 coalescer) merges multiple memory requests from threads in the same warp into as few DRAM transactions as possible.</p><ul><li>A warp = 32 threads, all executing the same instruction in lockstep (SIMT).</li><li>If those 32 threads read or write data stored in consecutive, properly aligned addresses, the GPU can combine their memory requests into one large memory transaction.</li><li>If they access scattered or misaligned positions, the hardware performs multiple transactions → slower.</li></ul><p>If memory coalescing is successful, we will have fewer transactions, which will handle all requests from threads. All threads will be able to access data quickly without stalling.</p><p>GPUs access memory in fixed-sized segments (typically 32-, 64-, or 128-byte-aligned, depending on architecture and data type).</p><p>If warp accesses fall inside the same aligned segment, the hardware merges them.</p><p>Let’s say the below segment map to the following addresses</p><pre>Segment 0: [0 ... 127]<br>Segment 1: [128 ... 255]</pre><p><strong>If threads in a warp access addresses:</strong></p><pre>[20, 24, 28, ... up to 120] → All inside segment 0 : 1 transaction</pre><p>You might be thinking:</p><blockquote>“If all the addresses are on the same cache line, why do we even need to think about memory coalescing?”</blockquote><p>And the short answer is:</p><blockquote>Because GPU coalescing is <em>not</em> just about cache lines — it’s about how the <em>warp’s</em> memory requests map to fixed-sized memory <em>segments</em> and how many hardware transactions get issued.</blockquote><p>Let’s take a step back and understand few things:</p><ol><li>GPU memory transactions operate at the warp level, not the cache-line level</li></ol><p>Even if data is cached, the GPU still:</p><ul><li>collects 32 addresses from the warp,</li><li>groups them into memory <em>segments</em> (e.g., 32B, 64B, 128B),</li><li>and issues the minimum number of segment fetches.</li></ul><p>Caches reduce latency but don’t magically turn many scattered accesses into one request.</p><p>We still pay:</p><ul><li>extra memory transactions (even if served from cache)</li><li>extra L1/L2 bandwidth</li><li>extra instruction cycles</li><li>more stress on cache and TLB</li></ul><p>So coalescing improves both DRAM hits <em>and</em> cache hits.</p><p>2. Coalescing is all about aligning warp access, not cache loading</p><p>Coalescing rule:</p><blockquote><em>Warp accesses must fall in as few aligned memory segments as possible (typically 32B/64B/128B‐aligned).</em></blockquote><p>Cache is hierarchical, but the GPU pipeline must still issue one request <em>per segment</em>.<br><strong> </strong>If we touch 8 segments per warp, you pay 8× the internal bandwidth.</p><p>If you want to know more about coalescing, refer to this paper: <a href="https://scispace.com/pdf/warppool-sharing-requests-with-inter-warp-coalescing-for-3yqgz2qq0a.pdf">https://scispace.com/pdf/warppool-sharing-requests-with-inter-warp-coalescing-for-3yqgz2qq0a.pdf</a><br>I find these papers useful for understanding the fundamentals that are not readily available online.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=23f222b26ca2" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speed up with a multiple cache hierarchy]]></title>
            <link>https://medium.com/@himanshu0525125/speed-up-with-a-multiple-cache-hierarchy-db1a53c0e22b?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/db1a53c0e22b</guid>
            <category><![CDATA[l1-cache]]></category>
            <category><![CDATA[l2-cache]]></category>
            <category><![CDATA[cache]]></category>
            <category><![CDATA[cache-hit-ratio]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Sun, 24 Aug 2025 02:31:04 GMT</pubDate>
            <atom:updated>2025-08-24T02:31:04.284Z</atom:updated>
            <content:encoded><![CDATA[<p>Whenever the CPU tries to access its cache, it leads to either a cache hit or a cache miss. In case of a miss, it needs to go down the memory hierarchy. Here, we will see how multiple Cache levels help in reducing average memory access time.</p><p>Now, let’s see the equation for memory access time:<br>Average Memory Access Time (AMAT) = Hit latency + miss-ratio * Miss penalty</p><h3>Cache and main memory characteristics</h3><h4>L1 cache</h4><p>Hit latency: 2 cycles</p><p>Hit ratio: 80%</p><h4>L2 cache</h4><p>Hit Latency: 5 cycles</p><p>Hit ratio: 90%</p><h4>Main memory</h4><p>Latency: 20 cycles</p><p><strong>Case 1: Only L1 is present</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*t9ft0UTk4_APbn_UsTlkTw.png" /></figure><p>AMAT = Hit latency + miss-ratio * Miss penalty</p><p>= 2 + 0.2 * 20</p><p>= 6 cycles</p><p>Note that above AMAT is from L1 perspective.</p><p><strong>Case 2: L1 and L2 are present</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_46Qv3-9ipKKvjvCBb7ytQ.png" /></figure><p>From the L2 perspective:</p><p>AMAT_L2 =. Hit latency + miss-ratio * Miss penalty</p><p>= 5 + 0.1*20 = 7 cycles</p><p>From the L1 perspective:</p><p>AMAT_L1 = Hit latency + miss-ratio * Miss penalty</p><p>Here, the miss-ratio refers to the miss ratio of the L1 cache. Miss penalty is AMAT for L2 (calculated previously) since miss will lead to L2 access.</p><p>So AMAT_L1 = 2 + 0.2 * 7 = 3.4 cycles</p><p>So, having both L1 and L2 cache reduces average memory access time from 6 cycles to 3.4 cycles, which is a significant improvement!</p><p>To achieve this improvement, we need extra area to accommodate L2 cache and logic to allocate lines to L2 from L1 and main memory.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=db1a53c0e22b" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Cache access pattern]]></title>
            <link>https://medium.com/@himanshu0525125/cache-access-pattern-f7e272f47aca?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/f7e272f47aca</guid>
            <category><![CDATA[cache]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[lru-cache]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Sun, 24 Aug 2025 01:40:00 GMT</pubDate>
            <atom:updated>2025-08-24T01:40:00.913Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JBSCu3uXqmmWyVmX2U5SlQ.png" /></figure><p>We run multiple applications on our computing devices every day. It can be browsing, playing games, editing videos and so on. Here we are at the application layer. If we go down the level of abstraction and move towards the processor, we will find that each of the applications mentioned above will be moving data/instructions towards or out of the cache in the processor. If we look at the pattern of cache line access as a whole, it might not make sense but we can try to identify few patterns and analyse each of them to understand how microarchitecture changes will affect cache performance. Remember, we are able to run a processor at a Gigahertz frequency, but memory is still the bottleneck. So, any improvement in cache performance will drastically improve the overall performance of the system.</p><h4>Recency-Friendly</h4><p>Cache gets incoming requests in this pattern:</p><blockquote>a_1, a_2, a_3, .. ……., a_k, a_k, a_(k-1)</blockquote><p>a_1, a_2 refer to a unique address sequence.</p><p>All addresses map to the same block in the cache. We are loading a memory block into the cache and then accessing it again. This will lead to maximum hit-ratio. In this case, Least-Recently Used replacement will provide the best result.</p><p>If k is equal to associativity, then we will have the best-case scenario which will lead to a 100% hit ratio after warm-up.</p><h4>Thrashing access pattern</h4><blockquote>a_1,a_2,a_3,….,a_k, a_1, a_2, a_3,..</blockquote><p>If k is greater than associativity, then the new line will replace the line which will be accessed later. This will be a nightmare for LRU replacement policy</p><h4>Streaming Access pattern</h4><blockquote>a_1, a_2, a_3, a_4,……………</blockquote><p>Here, the sequence has poor temporal locality, and no replacement policy will help to prevent cache misses.</p><p>In reality, we have a mix of the above three patterns, and each workload will have some characteristic, such as the majority of access patterns coming under the Recency-pattern, and in whichcase, LRU will be helpful. Computer architects study these characteristics to decide what microarchitecture features will be included in a processor.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f7e272f47aca" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[RRIP: Smarter Cache replacement than LRU]]></title>
            <link>https://medium.com/@himanshu0525125/rrip-smarter-cache-replacement-than-lru-94c257478849?source=rss-9288ddc93351------2</link>
            <guid isPermaLink="false">https://medium.com/p/94c257478849</guid>
            <category><![CDATA[lru-cache]]></category>
            <category><![CDATA[computer-hardware]]></category>
            <category><![CDATA[computer-architecture]]></category>
            <category><![CDATA[cache]]></category>
            <category><![CDATA[cache-memory]]></category>
            <dc:creator><![CDATA[The Arch Bytes: From Core to Code]]></dc:creator>
            <pubDate>Sat, 09 Aug 2025 08:27:14 GMT</pubDate>
            <atom:updated>2025-08-09T08:27:14.427Z</atom:updated>
            <content:encoded><![CDATA[<p>When a processor’s cache fills up, something has to go.<br> The <strong>cache replacement policy</strong> decides <em>which</em> cache line gets evicted to make space for new data.</p><p>For decades, <strong>LRU</strong> (<em>Least Recently Used</em>) has been the go-to choice. The idea is simple:</p><blockquote><em>“If it hasn’t been used recently, it’s probably safe to evict.”</em></blockquote><p>This works well for workloads with <strong>strong temporal locality</strong> — where data is likely to be reused soon after it’s accessed.<br> But LRU struggles badly with <strong>streaming or scan workloads</strong>: imagine reading a huge array sequentially. Each access evicts something you actually <em>will</em> need soon, and by the time you come back to it, it’s gone.</p><p>So, can we do better?<br> In 2010, researchers from Intel and University of Maryland proposed <strong>RRIP</strong> — <em>Re-Reference Interval Prediction</em> — a smarter way to decide what to evict.</p><h3>The Core Idea</h3><p>Instead of tracking the exact order of past accesses (like LRU), <strong>RRIP predicts how far in the future a cache line will be reused</strong>.<br> It keeps a tiny counter called <strong>RRPV</strong> (<em>Re-Reference Prediction Value</em>) for each cache line:</p><p><strong>RRPV ValueMeaning</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iMM2JdFgCU7-ByDWkCl0kA.png" /></figure><p>0 : Will be used very soon — keep it.</p><p>1–2: Medium-term reuse likelihood.</p><p>Max (e.g. 3): Will be used far in the future — best eviction candidate</p><p><strong>Higher RRPV = more likely to be evicted.</strong></p><p>Most implementations use <strong>2 bits per cache line</strong> → RRPV values from 0 to 3.</p><h3>How RRIP Works</h3><h3>1. Victim Selection</h3><ul><li>Look for a line with <strong>RRPV = max</strong> (e.g., 3). Evict it.</li><li>If none found, increment all RRPVs (saturating at max) and repeat.</li><li>This gradual “aging” simulates the line becoming less useful over time.</li></ul><h3>2. Insertion Policies</h3><p>RRIP’s strength lies in how it sets the <strong>initial RRPV</strong> for new lines:</p><ol><li><strong>SRRIP (Static RRIP)</strong><br> Insert with RRPV = <em>max–1</em> (e.g., 2).<br> → New lines get a short trial before being evicted.</li><li><strong>BRRIP (Bimodal RRIP)</strong><br> Insert with RRPV = <em>max</em> most of the time, and <em>max–1</em> occasionally (e.g., 1 in 32 insertions).<br> → Keeps most new lines “low priority,” good for streaming data.</li><li><strong>DRRIP (Dynamic RRIP)</strong><br> Dynamically switches between SRRIP and BRRIP using <strong>set-dueling</strong>: a few sets run each policy, and the better-performing one is applied globally.</li></ol><h3>3. On a Cache Hit</h3><p>When a line is hit, <strong>reset its RRPV to 0</strong> — meaning “will be used soon.”</p><h3>Why RRIP Works Better Than LRU</h3><ul><li><strong>Scan-resistant:</strong> Doesn’t let streaming data evict useful lines prematurely.</li><li><strong>Simple hardware:</strong> Just a few bits per line and simple update logic.</li><li><strong>Adaptable:</strong> DRRIP can auto-tune itself to different workloads.</li></ul><p>In many last-level cache studies, RRIP outperforms LRU by 5–15% in hit rate for real workloads.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=94c257478849" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>