Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

[How it Works] Memory and Performance. Part 1.

--

Photo by Michael Dziedzic on Unsplash

If you are a software engineer and care about performance, this article is for you. It doesn’t really matter what language you prefer, C, C++, Java, Python, GO etc… What I will try to cover here should give you an insight on what happens with your computer memory under the hood. Because basically every program interact with memory, it is crucial to have a good knowledge of it.

I will try to cover it in depth and provide a few examples from the real world, but don’t expect this to be a collection of “best practices” and advices. You will be the one who will give advices after reading this :). The power is in knowledge and not in blind rule-following.

If you are a beginner, don’t worry, I will make sure we are covering the basics.

Ok, let’s start with basics. Computer memory means many things, let’s classify them by volatility first.

Now we have two types, volatile and non-volatile (don’t be annoying and ignore semi-volatile type please).

Non-volatile memory

Non-volatile memory is computer memory that can retain the stored information even when not powered. We will briefly cover HDD and SSD in this article.

HDD. Hard Drive Disk
HDD. Hard Disk Drive.

HDD. You most probably heard about HDD (Hard Disk Drive) before, it’s a piece of “magnetic disk” that spins crazy fast and can read/write your precious information. The main advantage over other non-volatile storage types is the balance between capacity and cost.

To read something, you will need to seek your file on the surface of the disk with magnetic heads and start translating magnetic fields into bytes. You may ask, what if my file is huge and is spread across the surface of the disk (also known as fragmentation)? Well, you will spend more time. Because time that we wait the disk to spin to the next file fragment is a wasted time.

On left. High fragmentation, 5 spins required. | On Right. No fragmentation, less than 1 spin required.

Because of the design, we have the following characteristics:

Fast sequential operations — you will waste time to find the file but if the file is not spread you will be able to read it in one go!

Slow random operations — reading/writing files in random places on surface of disks adds a significant overhead because magnetic head have to travel to each place, often, in multiple spins.

By the way, the time it takes for a magnetic head to move to the desired location is also known as access time. Other memory types don’t have magnetic head, but still have access time penalty, keep this in mind.

SSD. Solid State Drive.

SSD (Solid State Drive). This one doesn’t have any physical spinning disks or moving parts, so you can expect better access time, thus superior random read/write operations. Generally speaking, SSD are usually around x10 times faster than HDD. Of course, this speed is a bit pricey, for the same money you will have less capacity if compared to HDD.

Did you know that some computer games have to duplicate data (typically textures) to make sure the HDD keeps reading things sequentially ? Knowing this small thing helps developers to improve significantly the performance. This technique isn’t however optimal for SSD because access time have small overhead.

Volatile memory

Volatile memory is computer memory that requires power to maintain the stored information. We will talk about RAM a lot.

RAM. Random Access Memory.

RAM. (Random Access Memory) As the name suggests, this one allows information to be read or written in almost the same amount of time, irrespective of the physical location of data inside the memory. Typically, it is x1000 faster than SSD/HDD. It’s also super expensive if compared to HDD/SSD and will have a hundred times less capacity.

Let’s recap memory characteristics:

Capacity — it’s about how much data we can store.

Volatility — how does memory behave when there is no power.

Access time — how much time we need to find the physical location of our data.

Fragmentation — how much our data is broken up into many pieces that are not close together physically.

To go further, we will need to agree on a way to represent memory visually. Flat memory model or linear memory model is the most common memory addressing paradigm, we built a lot of abstraction layers above this, but be patient, we will cover that later. In modern computer architectures, one memory address points to one byte of data. Here is how it looks:

Linear memory model.

Wait a second! If one byte of information is associated with an address, then the address itself should take some space from your memory, right? If we spend 1 byte to store address then 1 byte to store information itself, then we will have 50% of memory wasted just to keep track of addresses. It’s madness!

Yes it is! That’s why nobody is doing that :). Instead of this, modern computers just “count” each time “memory cells”. Memory address is just a number of bytes the CPU will skip from the beginning of the memory to get to the one it’s looking for. For example:

  • To access the first byte, it has to skip 0 bytes, so the first byte’s address is 0.
  • To access the second byte, it has to skip 1 byte, so its address is 1.
  • (and so on…)

You got the idea, right?

Okay, but I heard about limitations like 32 bit CPU-s can’t have more than 4GB or RAM, why?

Oh, I like your questions. The thing is that when you instruct your CPU to do something, you will have to call a special function with a parameter. Let’s imagine we are calling “goto 1000”, which means go to memory address 1000. The CPU have to store this number somewhere, then pick up that instruction and execute. This storage cell is called a CPU register, and the capacity of that register is limited to 32 bits. The highest number you can store in 32bits is 2³² (or 4294967296) which is 4Gb. This means you can’t tell to your CPU to go to a memory address above that number. You can have more RAM, but the CPU will not be able to read/write anything above that limit. Luckily, most modern computers have now 64bit architecture (but not always 64bit registers), which means it will be able to work with 16 exabytes (2⁶⁴ bytes) of RAM (sadly, it will cost a lot, don’t expect to have this in your average home PC).

Good to know. Memory access to RAM is relatively slow from CPU perspective, and may take a number of clock ticks to complete. Memory access to registers is very fast (usually one clock tick). To overcome this issue, modern CPUs will use memory cache.

But what if my program need to use numbers higher than 255 (1 byte)? How this will be stored in RAM?

Not a problem! All you need to know is the memory address where your integer starts and how many bytes it takes. In the image below, you can see how a 2 byte integer is stored in memory. We will start to read the data at address 0x6000 and stop at address 0x6001.

Great! Now we know the basics. We know about different memory types and main characteristics of memory. Also, we learned a bit about linear memory model and how CPU is accessing it. Let’s move on to programs.

Let’s start with the basic workflow of a program and see how it works from a memory perspective. But before that, how we are making these programs? Generally you have simple text files which we are calling source files, then using your language specific compiler you compile it into an executable file.

Hey, I am a programmer, I obviously know it!

Okay, okay… The executable file contains instructions readable by your OS and CPU. There are two main pieces in the executable file: data (contains global variables, constants, other static data), and code (contains your compiled functions).

OK, clear… But how my program end up being in RAM?

I am glad you asked! This part of the job is being handled by Operating System Loader. When the user executes the executable file, “the loader” will make sure to allocate memory for your program and then will copy it right into your Operative memory (RAM). But this all isn’t that simple! We need to go deeper! Typically, OS loader will split allocated memory into a few memory segments.

There are four memory segments which we will cover: code (obviously, your compiled functions), static (that data part from executable file), stack and heap.

Program Memory Segments.
  • Code segment: usually is marked as read-only by the loader (this feature depends on OS) so that CPU will not be able to overwrite this address range. If an attempt will be made to change this segment, OS may detect it and abort the program immediately.
  • Static segment: the size of this segment will be fixed, it will persist in memory while the program is running. Doesn’t matter if you are using it or not. In contrast to the previous segment, this one can be changed during execution.
  • Stack segment (aka call stack): If the static segment contains global variables and constants, this one contains local variables. This segment is broken into many stack frames. Stack frames are added into the stack before the function runs, and are removed after execution.
Call stack/Stack segment

A stack frame contains all needed information for one function call.

Stack frame contains function arguments, local variables and return address

Function call is broken in few instructions:

  • Allocate memory for a stack frame
  • Push the stack frame into the call stack
  • Run the function declared in the stack frame
  • Pop the executed stack frame

As you can see, programmers will not bother too much about this. Memory will be freed and allocated automatically. Please note, the size of a frame is fixed!

Hmm, does it mean that calling functions is more expensive than writing everything as a single flow of instructions?

Yes, it has a bit of overhead, but don’t be afraid to split your logic into small functions, modern compilers are smart and will often “inline” functions. The overhead might be zero to minimal. But it’s good to be aware that going to “abstract” will not go unnoticed for performance. (Maybe, one day, I will write an article on how to design your code to make use of CPU cache and other advanced techniques like branchless programming but now, we are talking about memory)

  • Heap segment: this part is designed for cases when programmers need to allocate large and dynamically resizable data structures. Something which is not really possible in stack or static segment. We can control the lifecycle of the data and decide when you want to free and allocate the memory. Heap management is a very complex topic. Programming languages will implement different memory managers. These usually try to allocate a continuous block of memory which is later split into smaller chunks when you request memory reservations. In nutshell, it keeps track of what chunks are used and not. If during program execution, memory manager will allocate and free chunks of memory in an unfortunate sequence, you may end up with a very fragmented heap segment.

Hey, fragmentation shouldn’t be an issue we have fast access time in RAM, right?

Well… Yes, and not really. Here is an example when RAM fragmentation may become a problem.

High memory fragmentation

We have 400 kB of total used memory and 400 kB unused memory. Now, imagine your program requires an array which will take 200kB. Even if you have 400 kB of free memory in total, you can’t find a chunk of 200kB or bigger, and you will get a memory allocation failure. A solution to this will be memory compaction. Memory what ? Memory compaction! This is the process used to relocate information blocks in main memory in order to maximize the available free space. But keep in mind, to do this you will have to do a few malloc() and free() operations. Your program must stop its execution and reorganize the memory chunks. Obviously, this is quite expensive from a performance point of view.

Memory after compaction.

As you can see, your programs can suffer from serious heap fragmentation leading to low utilization of memory, degraded performance, and application failure due to memory exhaustion.

Will this mean that “immutability” lead to a very high fragmentation and will put a lot of pressure on memory managers?

Most probably yes, so try to use this rationally, it’s a good solution for multithreading for a number of reasons, but excessive memory allocations will have a significant impact.

Remember CPU memory cache mentioned before ? Memory lookups, which are expensive from CPU perspective, can be served from cache, but cache itself have a limited capacity. High RAM fragmentation may exhaust that cache. We will cover this a bit again when we will learn about memory pages.

Languages like Java, Go and Python have garbage collectors which have cycles of compacting the memory. Languages like C/C++ have different memory allocators/managers which attempt to solve the same problem. Even if it looks that the problem is magically solved by memory managers and Garbage Collectors, you need to remember that compacting the memory involves additional copy operations, scans and long pauses!

That’s it for the fisrt part. In the next parts we will learn about memory alignment, memory pages, memory mapping, caching and a few performance tricks. If you are interested in the second part, please let me know in the comments. What do you think about level of details ? Is it too superficial or too complex and hard to follow? I will consider your feedback for the second part. Also, feel free to suggest the next topic if you are interested in something.

Thank you for reading this!

--

--

RandomDev
RandomDev

Written by RandomDev

Software engineer with a passion for performance.

Responses (3)