Under the hood memory: Know your memory(and caches!)

aarti gupta
software under the hood
11 min readFeb 8, 2018

Random Access Memory(RAM)

A memory unit is designated as Random Access Memory (RAM) if any location can be accessed in some fixed amount of time that is independent of the location address.

Static Random-Access Memory( SRAM)

Static random-access memory (static RAM or SRAM) is a type of semiconductor memory that uses bistable latching circuitry (flip-flop) to store each bit. SRAM exhibits data remanence,[1] but it is still volatile in the conventional sense that data is eventually lost when the memory is not powered.

The term static differentiates SRAM from DRAM (dynamic random-access memory) which must be periodically refreshed. SRAM is faster and more expensive than DRAM; it is typically used for CPU cache while DRAM is used for a computer’s main memory.

Dynamic random-access memory (DRAM)

Dynamic random-access memory (DRAM) is a type of random access semiconductor memory that stores each bit of data in a separate tiny capacitor within an integrated circuit. The capacitor can either be charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1. The electric charge on the capacitors slowly leaks off, so without intervention the data on the chip would soon be lost. To prevent this, DRAM requires an external memory refresh circuit which periodically rewrites the data in the capacitors, restoring them to their original charge. Because of this refresh requirement, it is a dynamic memory as opposed to static random-access memory (SRAM) which does not require data to be refreshed

DRAM is widely used in digital electronics where low-cost and high-capacity memory is required. One of the largest applications for DRAM is the main memory (colloquially called the “RAM”) in modern computers and graphics cards (where the “main memory” is called the graphics memory). It is also used in many portable devices and video game consoles. In contrast, SRAM, which is faster and more expensive than DRAM, is typically used where speed is of greater concern than cost, such as the cache memories in processors.

Variations in DRAM

Synchronous dynamic random-access memory (SDRAM) is any dynamic random-access memory (DRAM) where the operation of its external pin interface is coordinated by an externally supplied clock signal. DRAM used an asynchronous interface, in which input control signals have a direct effect on internal functions only delayed by the trip across its semiconductor pathways. SDRAM has a synchronous interface, whereby changes on control inputs are recognised after a rising edge of its clock input. In SDRAM families, the clock signal controls the stepping of an internal finite state machine that responds to incoming commands. These commands can be pipelined to improve performance, with previously started operations completing while new commands are received, allowing the device to operate on a memory access command in each bank simultaneously and speed up access in an interleaved fashion. This allows SDRAMs to achieve greater concurrency and higher data transfer rates than asynchronous DRAMs could. Pipelining means that the chip can accept a new command before it has finished processing the previous one. For a pipelined write, the write command can be immediately followed by another command without waiting for the data to be written into the memory array. For a pipelined read, the requested data appears a fixed number of clock cycles (latency) after the read command, during which additional commands can be sent.

Double data rate type SDRAM (DDR3 SDRAM) , <also Interleaving>

is a type of synchronous dynamic random-access memory (SDRAM) with a high bandwidth (“double data rate”) interface, and has been in use since 2007. It is the higher-speed successor to DDR and DDR2 and predecessor to DDR4 synchronous dynamic synchronous dynamic random-access memory.

The modifications to the access circuit give it the capacity to transmit twice by clock cycle: one in the beginning of the cycle and another in its end. Like this, a DDR module that operates at 133 MHz, for instance, possesses an equivalent performance to that would be reached by a SDRAM module of 266 MHz. To make possible the double data access per clock cycle, the cell array is organized in two memory banks, each of which can be accessed separately. Additionally, consecutive words of a given block are stored in different banks what leads to the interleaving concept. Is this interleaving of words in different memory banks that allows simultaneous access to two words that are transferred on successive clock edges

Fast Page Mode DRAM (FPM DRAM)<also burst mode>

FPM DRAM implements an improvement on conventional DRAM in which the rowaddress is held constant while data from multiple columns is read from MDR using several column-addresses. The data held in the MDR form an open page that can be accessed relatively quickly, what speeds up successive accesses. This mechanism is known as burst mode access, and permits a block of (typically) four sequential words to be transferred from/to the memory bank. The first word takes the same time as conventional DRAM, however the subsequent three words are transferred faster. This can be modelled with the time access form: x-y-y-y. FPM DRAM usually has an access time of 5–3–3–3 (5 cycles to the first access and the subsequent three take 3 cycles each), with speeds of 60 to 80 ns (for the first access) and a maximum bus rate of 66 MHz

Extended Data Out DRAM

Addition of a buffer on the Data Out bus drivers, leading to the Extended Data Out DRAM configuration. This stores the output data (in SRAM) and keeps it stable for the time required for it to be read through the bus. Thus, as the data is stored in these extra buffers, the chip can overlap the next read access with the previous one — i.e., the next column address can be processed while the previous data is being read. This enhancement allows the access times of the second, third and fourth data accesses in a burst mode access to be overlapped and therefore accelerated.

Rambus DRAM (RDRAM)

The strategy used to develop this memory is based on its division into a larger number of memory banks. This supports a simultaneously data transfer from/to several memory banks, obtaining higher operational frequencies. Due to effects of electromagnetic interference, resulting from the high frequencies of the data transfers, the width of the data bus had to be reduced to allow a larger isolation, achieved with the increase of the space among wires.

Memory Bank

memory bank is a logical unit of storage in electronics, which is hardware-dependent. In a computer, the memory bank may be determined by the memory controller along with physical organization of the hardware memory slots. A bank consists of multiple rows and columns of storage units, and is usually spread out across several chips. In a single read or write operation, only one bank is accessed, therefore the number of bits in a column or a row, per bank and per chip, equals the memory bus width in bits (single channel). The size of a bank is further determined by the number of bits in a column and a row, per chip, multiplied by the number of chips in a bank.

Data Bus

Bus (computing), a communication system that transfers data between different components in a computer or between different computers

memory bus, a bus between the computer and the memory

Memory Centric Architectures

Prediction is that memory behaviour would be preponderant over the global performance of the computational system.

Smart memories, intelligent memories, intelligent RAM (IRAM), merged DRAM/Logic (MDL), processor in memory (PIM), etc

Memory chips have huge internal bandwidth .The connection pins are responsible for the external degradation of the bandwidth, thousands of times slower than internal bandwidth. Eliminating the connections not only improves the bandwidth, but also improves the latency, as logic and storage are closer to each other. To increase the amount of integrated storage space, most of the smart memories proposals use DRAM instead of SRAM

Processor In Memory(PIM)

Large array of simple computation elements (typically over 1,000) were built into the DRAM arrays. These processing elements are usually integrated at the output of the sense amplifiers and are controlled by a single control unit, as an SIMD processor. This design strategy can explore the massive on-chip bandwidth of the DRAM, as the computation elements are integrated directly into the DRAM outputs

Vector DRAM

This strategy integrates a complete vector processor on a DRAM chip. Unlike the PIM strategy, the computation is out of the DRAM array, what reduces the peak throughput. However, fewer but more powerful processing elements can be used, as the spatial limitations are not so crucial. Consequently it is possible to achieve performance improvements on a larger set of applications

Multiprocessor-on-a-Chip

The single-chip multiprocessor PPRAM design tries to avoid the central control unit bottleneck that been a problem of other architectures. It does it by integrating several relatively simple, but fully independent, cached RISC processors, each with a reasonable amount of local memory (= 8 MB — DRAM), what is called a processing element or PPRAM node. These nodes are connected by a very high-bandwidth interface (PPRAM-Link), and can be programmed using standard shared-memory or message-passing parallel algorithms. Because of the high-level programmability of these designs, they are more easily programmed for maximum parallelism than the other smart memory designs. However, the large amount of resources necessary for each node limits the total number of parallel nodes far short from the parallel processing elements of the previously described PIM design

Caches

CPU caches are small pools of memory that store information the CPU is most likely to need next. Which information is loaded into cache depends on sophisticated algorithms and certain assumptions about programming code. The goal of the cache system is to ensure that the CPU has the next bit of data it will need already loaded into cache by the time it goes looking for it (also called a cache hit).

Cache Miss

A cache miss, on the other hand, means the CPU has to go scampering off to find the data elsewhere. This is where the L2 cache comes into play — while it’s slower, it’s also much larger. Some processors use an inclusive cache design (meaning data stored in the L1 cache is also duplicated in the L2 cache) while others are exclusive (meaning the two caches never share data). If data can’t be found in the L2 cache, the CPU continues down the chain to L3 (typically still on-die), then L4 (if it exists) and main memory (DRAM).

Hit Rate

The percentage of accesses that result in cache hits is known as the hit rate or hit ratio of the cache. The alternative situation, when the cache is consulted and found not to contain data with the desired tag, has become known as a cache miss.

Tag RAM and Cache Associativity

The tag RAM is a record of all the memory locations that can map to any given block of cache. If a cache is fully associative, it means that any block of RAM data can be stored in any block of cache. The advantage of such a system is that the hit rate is high, but the search time is extremely long — the CPU has to look through its entire cache to find out if the data is present before searching main memory.

At the opposite end of the spectrum we have direct-mapped caches. A direct-mapped cache is a cache where each cache block can contain one and only one block of main memory. This type of cache can be searched extremely quickly, but since it maps 1:1 to memory locations, it has a low hit rate. In between these two extremes are n-way associative caches. A 2-way associative cache (Piledriver’s L1 is 2-way) means that each main memory block can map to one of two cache blocks. An eight-way associative cache means that each block of main memory could be in one of eight cache blocks.

Cache Design and CPU Performance , How cache design impacts performance

The performance impact of adding a CPU cache is directly related to its efficiency or hit rate; repeated cache misses can have a catastrophic impact on CPU performance. The following example is vastly simplified but should serve to illustrate the point.

Imagine that a CPU has to load data from the L1 cache 100 times in a row. The L1 cache has a 1ns access latency and a 100% hit rate. It therefore takes our CPU 100 nanoseconds to perform this operation.

Now, assume the cache has a 99 percent hit rate, but the data the CPU actually needs for its 100th access is sitting in L2, with a 10-cycle (10ns) access latency. That means it takes the CPU 99 nanoseconds to perform the first 99 reads and 10 nanoseconds to perform the 100th. A 1 percent reduction in hit rate has just slowed the CPU down by 10 percent.

In the real world, an L1 cache typically has a hit rate between 95 and 97 percent, but the performance impact of those two values in our simple example isn’t 2 percent — it’s 14 percent. Keep in mind, we’re assuming the missed data is always sitting in the L2 cache. If the data has been evicted from the cache and is sitting in main memory, with an access latency of 80–120ns, the performance difference between a 95 and 97 percent hit rate could nearly double the total time needed to execute the code.

Memory hierarchy

In a memory hierarchy, a processor is connected to hierarchical set of memories, each of which is larger, slower, and cheaper (per byte) than the memories closer to the processor

CPU DRAM gap

The performance of the processor-memory interface is characterized by two parameters: the latency and the bandwidth. The latency is the time between the initiation of a memory request, by the processor, and its completion. In fact the problem of the increasing divergence between the memory and processor speeds is a latency growing trouble. The bandwidth is the rate at which information can be transferred to or from the memory system.

There are two major classes of techniques to reduce the impact of long memory latencies: latency reduction and latency tolerance. Latency reduction decreases the time between the issue of a memory request and the return of the needed operand. Latency tolerance involves performing other computation while a memory request is being serviced, so that the memory latency for that request is partially or completely hidden. This will be covered in the next section. The use and success of these techniques expose the bandwidth limitations [4], since they speedup the instructions rate, and consequently the necessity of operands also grows; they require more items than are effectively needed increasing the absolute amount of memory traffic. The bandwidth incapability slows the response times to the processor requests, i.e. increases the latency. Given the complex interactions between memory latency and bandwidth, however, it is difficult to determine whether memory-related processor degradation is due to raw memory latency or from insufficient bandwidth (which also increases the latency).

So far, the larger effort to decrease the performance gap between processor and physical memory has been concentrated on efficient implementations of a memory hierarchy. Particular techniques have been developed to reduce miss rate, miss penalty and hit time:

reducing miss rate: increasing the dimensions of a cache and/or their blocks, higher associativity, insertion of victim and/or pseudo-associative caches, hardware pre-fetch, compiler controlled prefetching, compiler reduce misses;

— reducing miss penalty: read priority over write on miss, sub-block placement, early restart and critical word first on miss, lockup-free caches, multi-level caches;

— reducing hit time: simple and small caches, pipelined write.

CPU DRAM gap levelled out around 2005 with 3 GHz processors, and since then processors have scaled using more cores and hyperthreads, plus multi-socket configurations, all putting more demand on the memory subsystem. Processor manufacturers have tried to reduce this memory bottleneck with larger and smarter CPU caches, and faster memory busses and interconnects. But we’re still usually stalled.

References

https://pdfs.semanticscholar.org/6ebe/c8701893a6770eb0e19a0d4a732852c86256.pdf

https://en.wikipedia.org/wiki/Memory_bank

--

--

aarti gupta
software under the hood

-distributed computing enthusiast, staff engineer at VMware Inc