An Overview of Operating Systems

Karan Shukla
14 min readSep 26, 2017

--

With modern programming languages and advanced frameworks, it’s so easy to abstract away the lower-level details of the operating system. For many programmers, highly-abstracted Javascript or Python is all they’ve ever known. But highly-performant systems require deeper knowledge of the underlying OS. Deep learning, quantitative analytics, cloud computing — many of the hottest areas in tech today utilize low-level programming to create incredibly efficient systems.

Operating Systems was one of my most useful classes last year, and I regret waiting so long in my college career to take it. I was recently reading my notes and realized it would be really useful to make a hyper-condensed summary giving a high-level overview of the most important concepts in operating systems. Hopefully, this document will be useful for students who are taking OS for their first time, or for those of us who just need a quick refresher.

I will try to discuss these ideas in an order that is easy to follow, but I may discuss some ideas before formerly introducing them. It’s very hard to teach operating systems in a linear manner because so many of the components interact with each other in a non-hierarchical manner. Nevertheless, these are all introductory concepts, so it shouldn’t be difficult to understand.

The kernel is the operating system’s core; it acts as an abstraction layer between resources (like the CPU, RAM, and I/O devices) and the processes/applications that use them. The kernel handles resource management so that programmers don’t need to, but programmers can access kernel functions using system calls. This is what operating systems is all about.

The CPU is the processing unit — it’s the component of the hardware that processes (AKA programs) run on.

The memory, also known as the RAM, consists of very fast data storage. Executing processes and their relevant data are stored here.

The hard disk, is slower than memory, but much cheaper. Because it’s cheaper, there’s often much more space to store programs and data, but actually accessing that data from the hard disk is very slow.

We’ll get to devices later

The stack, also referred to as the call stack, is the section of RAM containing locally-scoped data. When a function foo() is executed, data created by foo() is thrown onto the stack, and is only accessible in the scope of foo(). If foo() calls another function bar(), a new portion of the stack is created for bar(), and foo()’s portion becomes inaccessible within the scope of bar(). When bar() terminates, its stack data is popped off with it, and the scope returns to foo()’s portion of the stack. The operating system uses the call stack to keep track of a program’s control flow.

The heap, on the other hand, is the section of RAM containing dynamically-allocated data created with malloc(). This data is globally accessible in a process, and is persistent beyond the lifespan of the function that created it, unlike stack data. Note that the heap's data persistance can lead to a memory leak if never de-allocated with free().

Processes are executing programs. They are typically independent of each other, meaning they don’t share memory space with each other. Threads are lightweight executing entities within a process that share the process’s memory space. This means they share the same heap and execution code, but they do not share the stack.

Parallelism is what it sounds like — two or more processes are running at the exact same time. For this to occur, a computer needs to have multiple CPUs operating simultaneously — this is known as multiprocessing. A single CPU can only simulate parallelism, but cannot truly provide it.

Amdahl’s Law states that the potential speedup due to parallelization for a system is 1/((1-p)+(p/n)), where n is the number of processors and p is the proportion of the system that is parallelizable. As you add more and more processors, the maximum speedup begins to approach 1/(1-p). Of course, due to the overhead of adding more CPUs, this upper bound is difficult to reach.

Concurrency just means that two programs are running in the same timeframe, but not necessarily at the exact same time. Concurrency can occur on a single CPU with multiprogramming — having multiple programs in memory and rapidly switching between them via round-robin scheduling or some other CPU scheduling algorithm. Timesharing is a specific application of multiprogramming that allows a single computer to support multiple users by rapidly switching the CPU between them.

For a process to run, it must be loaded into memory from the hard disk, a process known as paging. Since a process must be in memory to run, the number of processes we can have running in a given timeframe appears to be limited by the size of our RAM. However, we can artificially expand the amount of accessible memory using virtual memory.

Virtual memory takes advantage of the fact that processes usually don’t use all of their data at once. Instead of importing a process’s entire memory space, we could just import its working set — the set of pages being used by a process at some moment in time. The process of only fetching the pages needed for immediate use by a process is known as demand paging. This reduces the amount of memory consumed by each program, allowing more programs to fit into memory and improving the OS’s multiprogramming.

If a program’s working set is too big to fit in memory, pages will need to be continuously moved in and out of memory, greatly slowing down the system. This is known as thrashing, and is extremely undesirable. To avoid thrashing, we use page replacement algorithms that intelligently minimize the number of times pages need to be replaced. A good page replacement algorithm should evict pages that aren’t likely to be referenced in the future. A page is likely to be referenced in the future if it has been referenced in the immediate past. This is known as temporal locality. We would also prefer not to evict pages that have been modified, because that means we will need to write those modifications to the disk.

Least-Recently-Used (LRU) is a very simple page replacement algorithm; simply replace the page that was used least recently. This is inefficient, as it requires the system to record every time a page is used. Instead, LRU is often approximated with the “second-chance” algorithm.

Second Chance gives each page a reference bit that is set to 1 whenever a page is referenced. When a page needs to be evicted, iterate through all the pages, setting any reference bits that are 1 to 0, until a page is found whose reference bit is already 0. Evict that page. This algorithm prioritizes pages that have not been referenced recently.

Enhanced Second Chance adds to each page a modify bit that is initialized to 0 and set to 1 whenever a page is modified. When a page needs to be evicted, iterate through all the pages, setting any reference bits that are 1 to 0, until a page is found whose reference bit and modify bit are both 0. Evict that page. If no unmodified, unreferenced page is found after iterating over all pages, iterate again, this time searching for a page whose reference bit is 0 and whose modify bit is 1. This algorithm prioritizes pages that have not been referenced or modified recently.

Note that a “page” is a logical unit used by a program. A frame is the physical equivalent of a frame, and is an actual unit of storage in memory and hard disk. A page table is used to map between pages and frames. In addition to the page table is the Translation Lookaside Buffer (TLB), a high-speed cache for pages. When a process wants to access a page, it queries the TLB first, then the page table. If the page is not in the page table, a page fault occurs, and the process must retrieve the frame from disk and store it as a page. This is an I/O action, and thus activates a trap interrupt, slowing down the system.

A trap is a type of interrupt caused by a program exception. An interrupt is a signal that halts the currently-executing process to execute a piece of code called an interrupt handler. One major source of interrupts is I/O. For a kernel to access I/O devices, the devices must be connected to a Peripheral Component Interface (PCI) bus, and kernel must have the appropriate device driver software installed. When an I/O device is ready to interact with the system (for example, when a monitor has an update to display, or when a keyboard has registered a click), the device sends an interrupt to the CPU scheduler, prompting the scheduler to switch to the corresponding I/O process. I/O processes are those processes that interact with I/O devices, while CPU-bound processes are processes that primarily do calculations in the CPU. Because I/O processes spend the majority of their time waiting for I/O interrupts, efficient CPU scheduling algorithms schedule CPU-bound processes while the system waits for I/O. When an I/O interrupt arrives, the CPU scheduler switches to the I/O process, executes it, and switches back to a CPU-bound process.

A Direct Memory Access (DMA) Controller can improve the performance of the overall system by bypassing the CPU and transferring data directly between I/O devices and memory. Another way to improve performance with I/O is with spooling. A buffer, known as a “spool”, can be used by processes to pick up or drop off I/O data. The devices then read data from the spool at their own pace, preventing slow devices from slowing down the CPU’s processes. This is also useful for devices such as printers that are accessed by multiple processes.

Let’s go back to discussing the CPU scheduler. CPU scheduling algorithms are one of the most important aspects of operating system performance. Different CPU scheduling algorithms optimize for different factors, such as process, throughput, waiting time, turnaround time, response time, and average completion time. A few common CPU scheduling algorithms are discussed below.

First-In-First-Out schedules processes based on when they arrived in the system. This can lead to the convoy effect, where a short process is scheduled after a long process and thus has to wait a long time to complete. This is undesirable for users, since they would like short tasks to be finished instantly. Shortest-Job-First prioritize processes by their predicted duration. This can lead to longer processes being indefinitely delayed as new short processes are added to the system, and so is also undesirable. Round-Robin rotates the CPU between processes without any sort of prioritization. This is an effective algorithm, provided that the time allocated isn’t too short (excess overhead) or too long (convoy effect). Some algorithms also have an aging mechanism where older processes get an increase in priority to speed up their completion.

If too many processes are running, main memory can fill up. In this case, a process can be swapped out from memory onto the hard disk, in a similar manner to paging. Because a process’s entire memory space has to be moved, this is very inefficient, and is typically used as a last resort if demand paging is insufficient.

Processes running concurrently can also run into conflicts, such as read-write or write-write conflicts, if they access shared resources like data or devices. A critical section is the part of a program where these shared resources are accessed. No two processes accessing shared resources should enter their critical sections at the same time.

The problem of avoiding concurrent access of shared resources is known as the critical section problem. Solutions to this problem must meet three criteria:

  1. Mutual Exclusion (safety) — We don’t want processes simultaneously accessing data, or accessing data when they shouldn’t.
  2. Progress (liveness) — The system should be continuosly in a state of motion, meaning that work is being done and programs are progressing in their execution.
  3. Bounded Waiting (fairness) — Each process should get an opportunity to execute their work; it should not be possible to eternally lock a process out from every executing.

Synchronization mechanisms, such as mutexes and semaphores, are a very common solution to the critical section problem.

A mutex, lock, or binary semaphore is a variable that can be acquired or released by a process. Once acquired, the next process that tries to acquire the same mutex will be forced to wait until the mutex is released. Each shared resource should have its own mutex. Any proces that accesses a shared resource should acquire the mutex before entering its critical section, and should release the mutex after exiting its critical section. This is a very simple synchronization mechanism that meets the three criteria if used correctly.

While mutexes can only represent a single copy of a resource, semaphores (also referred to as counting semaphores) can represent multiple copies of a resource. They are implemented as counters initialized to the number of copies of a resources available. When a process wants to enter a critical section, it decrements the counter and, if the counter goes below 0, enters a waiting state. When a process exits its critical section, it increments the counter and, if the counter becomes 0 or more, wakes up a waiting process.

Let’s look at an example of semaphores. One of the most common methods of inter-process communication is via a message queue, also known as the producer-consumer or publisher-subscriber model, where a producer process inputs data to a buffer and a consumer process reads data from that buffer. Note that the buffer has a limited size. This introduces the bounded-buffer problem: if the producer wants to transmit data while the buffer is full, it must go to sleep until the buffer is no longer full. Similarly, if a consumer tries to read from an empty buffer, it must go to sleep until the buffer is no longer empty. This problem can be solved using two semaphores, full and empty. Initialize full to 0 and empty to the size of the buffer. Before a producer can add to the buffer, it must decrement empty; after adding to the buffer, it increments full. Similarly, before removing from the buffer, a consumer must decrement full; after removing to the buffer, it increments empty. This solution prevents producers from adding to a full buffer, or consumers from reading from an empty one.

Unfortunately, synchronization mechanisms can cause a system to enter a deadlock when used poorly. A deadlock is when a circular chain of processes are each holding onto a mutually-exclusive resource that is required by the next process in the chain. None of the processes can execute because they have no way of forcibly taking the resources they need, so the system grinds to a halt.

One simple example of a deadlock is priority inversion, where a high-priority process is waiting on the execution of a low-priority process. Because the low-priority process will not get any CPU allocated to it, it won’t be able to complete execution, preventing the high-priority process from running. This is solved with the priority inheritance protocol. The high-priority process will donate its CPU allocation to the low-priority process, so the low-priority process can finish and then give up the resource that the high-priority process is waiting on.

Most deadlock solutions are not that simple. A more general deadlock solution is Banker’s Algorithm, which keeps track of each process’s maximum resource requests. The algorithm only allows a process to obtain a resource if the system will be in a safe state aftewards; that is, if there exists some sequence of processes such that each process can obtain its maximum resources, then terminate and return all of its resources. This algorithm is limited by the fact that it requires knowledge of each process’s maximum resource request, which is usually not known in real-world situations. Banker’s algorithm is known as a deadlock avoidance algorithm because it tries to avoid entering an deadlock state. Solutions that enter deadlock states and then try to back out of them are known as deadlock resolution algorithms.

Let’s jump to another aspect of operating systems — file system management (sorry, I couldn’t think of a smoother transition). Blocks are the unit of storage in a file system. Each block consists of some fixed number of bits. There are several block allocation methods for file systems, the most obvious being contiguous allocation — simply place all of a file’s blocks adjacent to one another. Unfortunately, this creates fragmentation, meaning that, over time, as small files are deleted, gaps will appear in the file system that can only be filled by other small files. This is wasteful, since it means that deleting two 1-block small files may not provide sufficient space to add one 2-block file. In this case, the data needs to be compacted — that is, the blocks need to be squished together to remove the empty gaps and add more contiguous space for large files. This is a slow process.

Linked allocations help us avoid compaction. Rather than requiring files to be composed of adjacent blocks, each block can contain a pointer to the next block in the file. This solves the fragmentation by allowing file’s blocks to be spread across storage. A third allocation method is to give each file an index block with pointers to all the other blocks. This allows for nonsequential access, which is useful for files with nonsequential access patterns, but is less useful for sequential data (like media files).

In addition to keeping track of file-block allocations, we would also like to keep track of free blocks for quick access and allocation. A simple solution is to represent all blocks in the file system with a bit vector, where 1 indicates occupied and 0 indicates free. This consumes little space. Another solution is to treat the free blocks as a linked list, so that each free block links to next one. However, this slows down traversal. A faster solution is to split free-space blocks into groups of size n, where the first block is an index block and the other (n-1) blocks are empty. The index block has n pointers — one for each of the other blocks in the group, and one for the next group’s index block. Essentially, we can traverse over n blocks at a time instead of traversing block-by-block. This is much faster.

When dynamically allocating file data to free space, some of the common allocation algorithms include first-fit (first gap), next-fit (next gap after the previously-inserted file), best-fit (smallest gap), and worst-fit (largest gap).

Finally, let’s discuss disk storage techniques to improve redundancy, read speed, and write speed. Redundant Disks prevent data loss from occurring by mirroring the same data across multiple machines. It is a reality that, over-time, disks will fail. It’s cheaper to simply buy more disks than to monitor and repair existing disks. Data that is stored on mirrored disks can also be read simultaneously, resulting in an n-fold increase in read speed. On the other hand, write times also increase n-fold, as data needs to be written on to every disk to ensure redundancy.

Mirroring is still expensive, however, even with cheap hardware. It can be cheaper to use intelligent techniques to improve speed and redundancy. One example is striping, a technique where consecutive segments of bits are split, or “striped”, across multiple disks. Data in the stripe can then be simultaneously read across all the disks, resulting in an n-fold increase in reading speed. This allows for a similar read-speed boost to mirroring without requiring any extra storage space or increase in write times. The largest disadvantage of disk striping is that it greatly hampers redundancy — losing any disk will cause all of the data in the system to be lost, unless a data recovery technique such as parity is used.

Parity is the property of being even or odd. A redundant parity disk can be added to a set of disks to ensure that number of 1’s in each stripe has constant parity. For example, a parity disk can ensure the parity is always odd by storing a 0 if a stripe’s parity is odd, or 1 if the parity is even. Then, if one of the disks fails, that disk can be reconstructed by using the same parity-evaluation process used to create the parity disk. By maintaining parity, a failed disk can be easily restored so long as all the other disks in the system are alive.

Hopefully that was a helpful overview of operating system concepts! I see a lot of aspiring software engineers in college lack knowledge of OS theory. That’s truly a pity given how important it is.

--

--