Hypervisors are a virtualization technique that powers cloud computing infrastructure like Amazon EC2 and Google Compute Engine. Although container virtualization technology like Docker and Kubernetes have taken the spotlight recently, containers are often deployed on top of hypervisors on the cloud.
In this article, we will first outline the architecture of classical trap-and-emulate hypervisors that were invented in the 1970s. We then describe how hypervisors evolved from the 1970s to today’s cloud computing era. Finally, we will look at future trends that affect hypervisor design.
(This article was inspired by an awesome talk on Amazon’s Nitro project by Anthony Liguori, which I highly recommend everyone interested in hypervisors and cloud infrastructure to watch.)
A hypervisor is a piece of system software that provides virtual machines (VMs), which users can use to run their OS and applications on. The hypervisor provides isolation between VMs, which run independent of each other, and also allows different VMs to run their own OS. Like other virtualization techniques, hypervisors provide multitenancy, which simplifies machine provision and administration. One of the main criticisms against hypervisors is that they tend to usually be heavy-weight compared to other virtualization techniques like containers (Morabito et al., 2015). However, it’s also possible to build hypervisors that are light-weight (Manco et al., 2017) and also make the guest OS more light-weight when running in under a hypervisor (Madhavapeddy, 2013).
A hypervisor can be decomposed into two major parts: the virtual machine monitor (VMM) and the device model. The VMM is responsible for setting up VMs and handling traps (a.k.a VM exits) caused by the guest OS executing privileged instructions like I/O access. The device model, on the other hand, is responsible for implementing I/O interfaces for all the devices like networking cards, storage, and so on, the hypervisor supports. Hypervisor architecture is illustrated in the following diagram.
(The terms hypervisor and VMM are often used interchangeably. However, we refer to hypervisor as the combination of a VMM and a device model.)
Virtual machine monitor (VMM)
A VMM must satisfy three properties (Popek and Goldberg, 1973):
- Equivalence property states that program execution has identical observable behavior on bare metal and under VMM, except for timing and resource availability, which are difficult to preserve because of shared physical hardware.
- Efficiency property states that the majority of program instructions are executed directly on a physical CPU without interference from the hypervisor.
- Resource control property states that the VMM manages all hardware resources. Virtual machines require permission from the hypervisor to directly access hardware.
As a side note, it’s worth noting that emulators satisfy both equivalence and resource control properties, but does not satisfy the efficiency property.
The KVM subsystem in the Linux kernel (and other OS’es it has been ported to), for example, provides the building blocks for implementing a VMM. The KVM subsystem is effectively a portable abstraction over CPU hardware virtualization capabilities, which can be leveraged by userspace applications like QEMU to implement a VMM or a full hypervisor.
The device model is the part of a hypervisor, which provides I/O interfaces for virtual machines. While the VMM is responsible for handling traps, it delegates I/O requests to the appropriate device model. Examples of device models are virtualized NICs and storage devices. Device models can either provide interface for a real hardware device or a paravirtualized device. The device model can be implemented either using software, like the virtio family of drivers, or in hardware, using SR-IOV, for example.
To implement a device model, I/O virtualization is needed. The two approaches for I/O virtualization are software-based and hardware-assisted.
Software-based I/O virtualization implements I/O interfaces in software to allow sharing the same physical devices across multiple virtual machines. Software-based I/O virtualization can be implemented on top of various different backends. For example, a software-based storage device can be layered on top of a block device or a filesystem. One issue with software-based approach is that the device model uses the same CPU resources that the vCPUs, which reduces available CPU capacity and causes jitter.
Hardware-assisted I/O virtualization implements I/O interfaces in hardware. This approach requires hardware support for sharing the same physical device across multiple virtual machines. SRV-IO is a PCI extension, which allows a physical PCI function to be partitioned into multiple virtual PCI functions.
The semantics of a trap-and-emulate VMM was formalized in the early 1970s (Popek and Goldberg, 1973) and made popular again in the mid-1990s for running commodity OS’es on multicore machines (Bugnion et al., 1997). However, the most popular machine architecture at the time, Intel x86, was not virtualizable because some of it’s privileged instructions did not trap.
The VMware hypervisor, which targeted x86, was first released in 1999. It used binary translation to replace privileged instructions to trap into the hypervisor, while still running unprivileged instructions directly on the physical CPU, which solved x86’s virtualization issues (Adams and Agesen, 2006). This allowed the VMware hypervisor to run unmodified commodity OS’es on x86 hardware in virtual machines without the performance penalty of emulation.
The Xen hypervisor released first in 2003 took a different approach to solving the x86 virtualization issue. Instead of binary translation, they modified the source code of the guest OS to trap to the hypervisor instead of executing non-trapping privileged instructions.
Intel and AMD released x86 CPUs with virtualization extensions in 2005 and 2006, which made classic trap-and-emulate virtualization possible. KVM, initially developed for Linux, implements a kernel subsystem that in combination with QEMU’s device model provides a full hypervisor. Initially, the KVM project provided software-based device model that emulated full hardware devices, but later acquired paravirtualized I/O device model when the virtio device model was introduced.
The classic hypervisor architecture has stood the test of time but there some trends that affect hypervisor design.
Hardware virtualization is becoming more wide-spread. For example, the Amazon Nitro project (talk by Anthony Liguori) takes an unconventional approach to hypervisor design, which replaces all of the software-based device model with hardware virtualization as illustrated in this diagram. Amazon’s Nitro hypervisor also uses a custom designed VMM that leverage’s Linux KVM.
Operating systems have also started to evolve to accommodate hypervisors better. Unikernels are an interesting OS design approach that packages the OS and the application into one bundle, which runs in the same CPU protection level (Madhavapeddy, 2013). This eliminates the traditional separation between kernel and user space, which reduces context switch and system call overheads at the expense of losing some OS functionality. The basic idea was already pioneered earlier in the form of library OSes, but the much simpler device model of a hypervisor compared to bare metal made the idea much more feasible for real world use.
Light-weight virtualization is becoming more and more important as the use of cloud computing grows. Containers are excellent technology for providing light-weight virtualization. However, containers are unable to provide the full isolation capabilities of VMs, and have various security problems because containers share the same host OS and have access to the large OS system call interface (Manco et al., 2017). Hypervisors can be slimmed down significantly (Manco et al., 2017) and unikernels provide even larger opportunity to optimize the hypervisor if we relax the equivalence property requirement of VMMs (Williams, 2016).
Serverless computing is a new computing model, better described as Functions as a Service (FaaS), that allows application developers to deploy functions instead of applications to a managed platform. One approach to serverless computing is to use hypervisors and unikernels for packaging and deploying the functions (Koller and Williams, 2017).
Energy efficiency is another important future direction for hypervisor design. Communications technology, which cloud computing is a large part of, is forecasted to consume around 20% of global electricity by 2030, or as much as 50% in the worst case (Andrae and Edler, 2015)! The energy overhead of a hypervisor can be extremely high depending on workload. One experiment reports between 59% and 273% energy overhead for KVM (Jin et al., 2012)!
Kernel-bypass networking has become important recently because NICs are getting faster and traditional TCP/IP and POSIX socket abstraction is proving to have high overheads (Han et al., 2012; Young et al., 2014; Yasukata et al., 2016). Hypervisors that implement the device model using I/O paravirtualized effectively introduce another layer to the networking data path, which increases networking overheads. In Linux, the vhost architecture is one solution to the problem. Vhost moves the virtio paravirtualized I/O device model from QEMU (which is the VMM userspace kernel) to the host kernel (which also hosts the KVM module), which eliminates the exit from host kernel to userspace VMM. Another solution is full hypervisor kernel-bypass using hardware NIC virtualization introduced by the Arrakis project (Peter et al., 2014).
The hypervisor architecture invented in the 1970s has stood the test of time. The x86 architecture quirks meant that the first successful hypervisors had to resort into binary translation to handle privileged instructions. Binary translation solutions were followed by paravirtualization (popularized by Xen) but hypervisor architectures were consolidated to the classic model as Intel and AMD added virtualization extensions to the x86 architecture.
Although containers have recently become a very popular virtualization technique, emerging computing paradigms like serverless computing could make hypervisors an attractive technique again. Light-weight hypervisor designs, unikernels, and hardware-assisted virtualization all reduce hypervisor overheads, which also makes hypervisors more competitive against containers.
Keith Adams and Ole Agesen. 2006. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS XII). ACM, New York, NY, USA, 2–13. DOI: https://doi.org/10.1145/1168857.1168860
Anders S. G. Andrae and Tomas Edler. 2015. On Global Electricity Usage of Communication Technology: Trends to 2030. In Challenges, 6(1):117–157, 2015. DOI: http://dx.doi.org/10.3390/challe6010117
Edouard Bugnion, Scott Devine, and Mendel Rosenblum. 1997. Disco: running commodity operating systems on scalable multiprocessors. In Proceedings of the sixteenth ACM symposium on Operating systems principles (SOSP ‘97), William M. Waite (Ed.). ACM, New York, NY, USA, 143–156. DOI: http://dx.doi.org/10.1145/268998.266672
Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: a new programming interface for scalable network I/O. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, USA, 135–148.
Yichao Jin, Yonggang Wen, and Qinghua Chen. 2012. Energy efficiency and server virtualization in data centers: An empirical investigation. In Proceedings IEEE INFOCOM Workshops, Orlando, FL, 2012, pp. 133–138. DOI: http://dx.doi.org/10.1109/INFCOMW.2012.6193474
Eun Young Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. mTCP: a highly scalable user-level TCP stack for multicore systems. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14). USENIX Association, Berkeley, CA, USA, 489–502.
Ricardo Koller and Dan Williams. 2017. Will Serverless End the Dominance of Linux in the Cloud?. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS ‘17). ACM, New York, NY, USA, 169–173. DOI: https://doi.org/10.1145/3102980.3103008
Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. 2013. Unikernels: library operating systems for the cloud. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS ‘13). ACM, New York, NY, USA, 461–472. DOI=http://dx.doi.org/10.1145/2451116.2451167
Filipe Manco, Costin Lupu, Florian Schmidt, Jose Mendes, Simon Kuenzer, Sumit Sati, Kenichi Yasukata, Costin Raiciu, and Felipe Huici. 2017. My VM is Lighter (and Safer) than your Container. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ‘17). ACM, New York, NY, USA, 218–233. DOI: https://doi.org/10.1145/3132747.3132763
Roberto Morabito, Jimmy Kjällman, and Miika Komu. 2015. Hypervisors vs. Lightweight Virtualization: A Performance Comparison. In Proceedings of the 2015 IEEE International Conference on Cloud Engineering (IC2E ‘15). IEEE Computer Society, Washington, DC, USA, 386–393. DOI: http://dx.doi.org/10.1109/IC2E.2015.74
Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: the operating system is the control plane. In Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, USA, 1–16.
Gerald J. Popek and Robert P. Goldberg. 1973. Formal requirements for virtualizable third generation architectures. In Proceedings of the fourth ACM symposium on Operating system principles (SOSP ‘73). ACM, New York, NY, USA, 121-. DOI: http://dx.doi.org/10.1145/800009.808061
Dan Williams and Ricardo Koller. 2016. Unikernel monitors: extending minimalism outside of the box. In Proceedings of the 8th USENIX Conference on Hot Topics in Cloud Computing (HotCloud’16). USENIX Association, Berkeley, CA, USA, 71–76.
Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert. 2016. StackMap: low-latency networking with the OS stack and dedicated NICs. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ‘16). USENIX Association, Berkeley, CA, USA, 43–56.