Advancements in Virtualization Environments

Ankit Bagde
Advances in Operating Systems
15 min readNov 25, 2020

This article is submitted as a reading assignment for the course of Advances in Operating Systems Design by Adhikansh Singh, Ankit Bagde, Gaurav Goyal, Rajat Kumar Jenamani, and Rohan Henry Ekka.

Introduction

With recent developments in techniques in different domains of virtualization such as virtual drives in cloud storage, direct I/O, etc, continuous efforts are being made to make it memory efficient, feasible for deployment and cost-effective. A common concern with all such recent optimizations is that

Irrespective of the domain, it should be ensured that these advancements do not lead to a trade-off between security and performance.

Here we discuss such developments in different domains related to virtualization. Firstly, we discuss how static pinning hinders the memory management in Direct I/O and how coIOMMU manages to provide fine-grained pinning by treating pinning and protection problems in parallel. Followed by this, we discuss how LeapIO provides with feasible offloading of the storage stack to ARM co-processors, bringing substantial cost savings to the cloud providers.

Tackling static pinning in Direct I/O

Direct I/O is the best performant I/O virtualization method, widely deployed in cloud and data centres. It allows the guest to directly interact with I/O devices without the intervention from a software intermediary.

Protection against memory corruption and malicious Direct Memory Acess (DMA) Attacks is provided by DMA remapping for device drivers, which in turn provides a higher level of compatibility for devices. An I/O memory management unit (IOMMU) is an MMU that provides the capability of DMA remapping, preventing DMA attacks in direct I/O. As shown in Figure below, each assigned device is associated with an IOMMU page table (IOPT), configured by the hypervisor in a way that only the memory of the guest that owns the device is mapped.

The IOMMU with IOPTs validate and translate DMA requests, achieving inter-guest protection among directly assigned devices.

In Direct I/O Virtualization, Hypervisor has no visibility of guest DMA activities leading to static pinning.

However, direct I/O faces the problem of static pinning, due to two reasons:

  1. Most devices do not tolerate DMA faults, implying that guest buffers must be pinned in host memory and mapped in the IOPT before they are accessed by DMAs.
  2. Secondly, since hypervisor has no visibility of guest activities, it has to assume that every guest page might be a DMA page.

Consequently, the hypervisor has to pre-allocate and pin the entire guest memory upfront (a.k.a. Static pinning), before the guest DMA operation starts. For, e.g. during environment creation time.

The problem with static pinning is quite obvious. It worsens the memory utilization, as pinned pages cannot be reclaimed for other purposes. Along with this, we have to tolerate the much-increased VM creation time, as shown below.

The problem of static pinning: VM creation time increases with guest memory size. Up to 73x longer time observed for a VM with 128GB memory

To tackle this problem, previous studies have been done in two directions, each of which has a certain drawback, which is discussed below :

  • First, making the device support DMA page faults — suffers from the problem of the long latency, which in turn demands larger on-device buffer and further complicates the device leading to high device cost.
  • Secondly, exposing the DMA buffer information to the hypervisor through software approaches. Knowing when a guest page is mapped, or unmapped allows the hypervisor to pin or unpin it dynamically. However, a big performance penalty is incurred when blindly doing hypercalls to notify the hypervisor of every guest mapping/unmapping operation. Reduction in notifications is obtained with a guest-side pin-down cache, but it comes at the cost of loss in intra-guest protection.
vIOMMU: Guest uses vIOMMU to map/unmap DMA buffers & vIOMMU requests hypervisor to pin/unpin guest DMA buffers

One possible solution of static pinning is to expose a virtual IOMMU (vIOMMU) to the guest, which serves the primary purpose of providing intra-guest protection with virtual DMA remapping. As seen in the figure, Guest uses vIOMMU to map/unmap DMA buffers & vIOMMU requests hypervisor to pin/unpin guest DMA buffers, which enables the concept of fine-grained pinning. The emulation cost of vOIMMUs could be high. A few aggressive optimizations are proposed which may compromise the security.

In reality, Virtual DMA remapping is disabled in established vIOMMUs by most guest OSes. Users may opt-in when security requirement is prioritized over performance. E.g. Linux uses passthrough policy by default (which completely disables DMA remapping), leaving ‘strict’/’lazy’ policies for the user to opt-in. Thus, established vIOMMUs cannot reliably eliminate static pinning in direct I/O, due to the emulation cost of their DMA remapping interfaces.

For efficient memory management, a new vIOMMU architecture, coIOMMU is proposed. Before looking at what coIOMMU offers, it is important to understand how the two tasks DMA tracking and DMA remapping are linked to each other.

In established vIOMMU architecture mixing the requirements of protection and pinning, through the same costly DMA remapping interface, is needlessly constraining. Protection is a guest requirement and relies on the DMA remapping capability, while pinning is for host memory management and needs the capability of tracking guest DMA buffers. The two do not always match, thus favouring one may just break the other, if both are enabled through the same interface. Let us understand this with the help of a possible scenario,

the hypervisor either must fall back to static pinning by assuming that most guests disable protection, or, adopt fine-grained pinning by forcing all guests to enable protection and bear with added cost

So, what if we have a separate DMA buffer tracking mechanism to the vIOMMU, that doesn't rely on any semantics of DMA remapping, so that the pinning and protection problems can be handled parallelly (enter coIOMMU)

coIOMMU introduces a cooperative DMA buffer tracking mechanism for fine-grained pinning, orthogonal to the costly DMA remapping interface. The new mechanism uses a shared DMA tracking table (DTT) for hypervisor and guest to exchange the DMA buffer information, without incurring excessive notifications from the guest, with two optimizations namely smart pinning and lazy unpinning.

During experimentations, it has been seen that coIOMMU not only dramatically improves the efficiency of memory management in wide direct I/O usages with negligible cost, but also sustains the desired security as required in specific protection policies.

LeapIO: Portable Virtual NVMe Storage on ARM System-on-Chips

Cloud storage has improved drastically in size and speed in the last decade, and with this growth, there are new challenges in making the storage efficient.

To satisfy customer needs, today’s cloud providers must implement a wide variety of storage (drive-level) functions like supporting virtual drives, drive-level atomicity and other performance and reliability features. As a result of these requirements, the cloud storage stack is extremely resource hungry, with 10–20% of x86 cores reserved for storage functions.

Offloading the storage stack to ARM co-processors can bring substantial cost savings. But just dropping an ARM SoC on a PCIe slot would not be enough, we also need to rethink the entire storage stack to meet all the deployment challenges — enter LeapIO.

To address deployment goals in a holistic way, LeapIO employs a set
of OS/software techniques on top of new hardware capabilities, allowing storage services to portably leverage ARM co-processors. LeapIO helps cloud providers cut the storage tax and improve utilization without sacrificing performance. All the involved software layers see the same abstraction: the virtual NVMe drive. Let us divide the entire process into 3 parts — the goals, the design and the results

The Goals

The Goals for our LeapIO design
  1. Fungibility and portability: We need to keep servers fungible regardless of their acceleration/offloading capabilities.
  2. Virtualizabilty and composability: We need to support virtualizing and composing of, not just local and remote SSDs, but also local and remote IO services via NVMe-over-PCIe. A user can obtain a local virtual drive that is mapped to a portion of a local SSD that is shared by another remote service that glues many virtual drives into a single drive (eg. RAID).
  3. Efficiency: It is important to deliver performance close to bare metal.
  4. Service Extensibility: unlike traditional block-level services that reside in the kernel space for performance, LeapIO allows storage services to be implemented at the user space

The Design

Hardware View

We require four new hardware properties in our SoC design:

  1. host DRAM access (for NVM queue mapping)
  2. IOMMU access (for address translation)
  3. SoC’s DRAM mapping to host address space (for efficient data path)
  4. NIC sharing between x86 and ARM SoC (for RDMA purposes).

All these features are addressable from the SoC side and no host-side hardware changes are needed.

Software View

The different stages of software IO flows in software are explained below:

  1. User VM: On the client side, a user runs the guest OS of her choice on a VM where no modification is required. For storage, the guest VM runs on the typical NVMe device interface.
  2. Host OS: The design is such that LeapIO runtime sees the same NVMe command queue exposed to the VM.
  3. Ephemeral Storage: If the user VM utilizes local SSDs, the requests will be put into the NVMe queue mapped between the LeapIO runtime and the SSD device.
  4. Client-side LeapIO runtime and services: The client side runtime “glues” all the NVMe queue pairs and end-to-end storage paths over a network connection.
  5. Remote access(NIC): If the user stores data in a remote SSD or service, the client runtime simply forwards the IO requests to the server runtime via TCP through NIC.
  6. Server-side LeapIO runtime and services: The server side runtime prepares the incoming command and data by polling the queues of that client.

The Results and Evaluation

The hardware properties of this design make ARM-to-peripheral communications as efficient as x86-to-peripherals.

A portable runtime is developed which abstracts away hardware capabilities and exploits the uniform address space to make offloading seamless and flexible. The authors also built several novel services composed of local/remote SSDs/services and perform detailed performance benchmarks as well as analysis.

On comparing LeapIO with “pass-through” technology (PT) on the FIO benchmark shows that it is not far from bare-metal performance dropping only 2% and 5% for read-only and read-write throughputs.

This and several other tests performed by the authors prove that LeapIO is our next-generation cloud storage stack that leverages ARM SoC to alleviate taxing x86 CPUs. It has very little performance overhead of moving from x86 to ARM and it is therefore ideal to move existing host storage services to LeapIO.

Optimizing Nested Virtualization Performance Using Direct Virtual Hardware

Introduction

Nested virtualization, running virtual machines and hypervisors on top of other virtual machines and hypervisors, is increasingly important because of the need to deploy virtual machines running software stacks on top of virtualized cloud infrastructure. However, performance remains a key impediment to further adoption as application workloads can perform many times worse than native execution. To address this problem, we introduce DVH (Direct Virtual Hardware), a new approach that enables a host hypervisor, the hypervisor that runs directly on the hardware, to directly provide virtual hardware to nested virtual machines without the intervention of multiple levels of hypervisors. We introduce four DVH mechanisms, virtual-passthrough, virtual timers, virtual inter-processor interrupts, and virtual idle. DVH provides virtual hardware for these mechanisms that mimics the underlying hardware and in some cases adds new enhancements that leverage the flexibility of software without the need for matching physical hardware support. We have implemented DVH in the Linux KVM hypervisor. Our experimental results show that DVH can provide near native execution speeds and improve KVM performance by more than an order of magnitude on real application workloads.

DVH Mechanism

Virtual Pass Through

  • Virtual pass through is similar to device pass through in allowing a nested VM to directly access the I/O device, but assigns virtual I/O devices to nested VMs instead of physical I/O devices. Loosely speaking, virtual-pass through takes the virtual I/O device model for the host hypervisor and combines it with the pass through model for subsequent guest hypervisors. The virtual device provided to the guest hypervisor is in turn assigned to the nested VM. The nested VM can interact directly with the assigned virtual device, bypassing the guest hypervisor(s).
  • Unlike the virtual I/O device model, virtual-pass through avoids the need for guest hypervisors to provide their own virtual I/O devices, removing expensive guest hypervisor interventions for virtual I/O device emulation. Unlike the pass through model, virtual-pass through supports I/O interposition and all its benefits as the host hypervisor provides a virtual I/O device for use by the L1 VM instead of a physical I/O device.

Virtual Timers

A per virtual CPU virtual timer is software provided by the host hypervisor that appears to guest hypervisors as an additional hardware timer capability. For example, for x86 CPUs, the virtual timer appears as an additional LAPIC timer so that guest hypervisors see two different LAPIC timers, the regular LAPIC timer and the virtual LAPIC timer. Like the LAPIC timer, the virtual LAPIC timer has its own set of configuration registers. Although x86 hardware provides APIC virtualization (APICv), APICv only provides a subset of APIC functionality mostly related to interrupt control; there is no such notion as virtual timers in APICv. As typically done when adding a new virtualization hardware capability, we add one bit in the VMX capability register and one in the VM execution control register to enable the guest hypervisor to discover and enable/disable the virtual timer functionality, respectively. Virtual timers are designed to be transparent to nested VMs and require no changes to nested VMs. Hardware timers used by nested VMs are transparently remapped by the host hypervisor to virtual timers. When a nested VM programs the hardware timer, it causes an exit to the host hypervisor, which confirms that virtual timers are enabled via the VM execution control register. Rather than forwarding the exit to the respective guest hypervisor to emulate the timer, the host hypervisor handles the exit by programming the virtual timer directly. This can be done either by using software timer functionality or architectural timer support, similar to regular LAPIC timer emulation. Our KVM implementation uses Linux hr timers to emulate virtual timer functionality. Using virtual timers, no guest hypervisor intervention is needed for nested VMs to program timers, avoiding the high cost of existing to the guest hypervisor on frequent programming of the timer by the guest OS in a nested VM.

Virtual IPIs

Virtual IPIs, a DVH technique for reducing the latency of sending IPIs for nested VMs. Virtual IPIs involve two mechanisms, a virtual ICR and a virtual CPU interrupt mapping table. A per virtual CPU virtual ICR is software provided by the host hypervisor that appears to guest hypervisors as an additional hardware capability. We also add one bit in the VMX capability register and one in the VM execution control register to enable the guest hypervisor to discover and enable/disable the virtual IPI functionality, respectively. The guest hypervisor can let nested VMs use virtual IPIs by setting the bit in the VM execution control register, which is also visible to the host hypervisor.Virtual IPIs are designed to be transparent to nested VM sand require no changes to nested VMs. The hardware ICR used by nested VMs is transparently remapped by the host hypervisor to the virtual ICR. When a nested VM sends an IPI by writing the ICR, it causes an exit to the host hypervisor, which confirms that virtual IPIs are enabled via the VM execution control register. Rather than forwarding the exit to the respective guest hypervisor, the host hypervisor handles the exit by emulating the IPI send operation and writing the hardware ICR directly. Using virtual IPIs, no guest hypervisor intervention is needed for nested VMs to send IPIs.

Virtual Idle

Virtual idle, a DVH technique for reducing the latency of switching to and from low-power mode in nested VMs. Virtual idle leverages existing architectural sup-port for configuring whether to trap the idle instruction, but uses it in a new way. We configure the host hypervisor to trap the idle instruction as before, but all guest hypervisors to not trap it. The host hypervisor knows not to forward the idle instruction trap to the guest hypervisor since it can access the guest hypervisor’s configuration for nested VMs through the VMCS as discussed for virtual timers. A nested VM executing the idle instruction will only trap to the host hypervisor, and the host hypervisor will return to the nested VM directly on a new event. As a result, the cost of switching to and from low-power mode for nested VMs using virtual idle will be similar to that for non-nested VMs, avoiding guest hypervisor interventions.

Highlights

  • Nested virtualization involves running multiple levels of hypervisors to support running virtual machines (VMs) inside VMs
  • Direct Virtual Hardware (DVH) can provide better performance than device pass through while at the same time enabling migration of nested VMs, thereby providing a combination of both good performance and key virtualization features not possible with device pass through.
  • DVH, a new approach for directly providing virtual hardware to nested virtual machines without the intervention of multiple levels of hypervisors.
  • Four DVH mechanisms, virtual-pass through to directly assign virtual I/O devices to nested virtual machines, virtual timers to transparently remap timers used by nested virtual machines to virtual timers provided by the host hypervisor, virtual inter-processor interrupts that can be sent and received directly from one nested virtual machine to another, and virtual idle that enables nested VMs to switch to and from low-power mode without guest hypervisor interventions
  • DVH provides virtual hardware for these mechanisms that mimics the underlying hardware and in some cases adds new enhancements that leverage the flexibility of software without the need for matching physical hardware support

Dynamic Library OS for Simplified and Efficient Cloud Virtualization

Virtual machines on the cloud are usually dedicated to a specific application such as web browsing or data analysis. These highly specialized single-purpose appliances need only a very small portion of traditional OS support to run their accommodated applications. However, the current general-purpose operating systems are designed for multi-user, multi-application scenarios. This mismatch leads to performance and security penalities.

This problem has recently motivated the design of Unikernel, a new type of OS in form of libraries. An Unikernel application is packed with its dependent libraries into a specialized appliance image which is run efficiently and securely on a hypervisor. Unikernel applications eliminate unused code and at the same time guarantee the same level of isolation. However, the shortcomings are that it sacrifices flexibility, efficiency, and applicability. For example, it cannot support dynamic fork, a basis for commonly-used multi-process abstraction of conventional UNIX applications.

In this paper, the authors examine if there exists a balance between the best of both Unikernel appliances (strong isolation) and processes (high flexibility/ efficiency). An analogy is drawn between appliances on a hypervisor and processes on a traditional OS. They extend on static Unikernels by proposing a dynamic library operating system called KylinX, that provides process-like VM abstraction (pVM) to allow simplified and efficient cloud virtualization.

Both page-level and library-level dynamic mapping for this pVM is allowed by taking the hypervisor as an OS and the appliance as a process. At the page level, KylinX supports pVM fork plus a set of API for inter-pVM communication (IpC). This IpC is compatible with conventional UNIX IPC. As they only allow IpC between a family of mutually-trusted pVMs forked from the same root pVM, its security is guaranteed. KylinX at the library level supports shared libraries to be dynamically linked to a Unikernel appliance. This allows pVMs to (i) replace old libraries with new libraries at runtime by performing online library updates (ii) fast boot by reusing in-memory domains (recycling). The authors provide an analysis of dynamic mapping, the potential threats induced by it, and enforce likewise restrictions.

The authors demonstrate the performance of a prototype of KylinX that they have built. This prototype achieves comparable performance to Linux by forking a pVM in about 1.3 ms and linking a library to a running pVM in a few ms ( Linus takes about 1 ms). It’s IpC are also comparable to UNIX IPCs. The authors also evaluate KylinX on real-world applications, such as a Redis server and a web server. In this domain, KylinX achieves higher applicability and performance than static Unikernels while retaining the isolation guarantees.

They intend to improve the performance of KylinX by adopting more efficient runtimes and adapting to the MultiLibOS model which would allow the spanning ofpVMs onto multiple machines.

Conclusion

From the above-shown advancements, we see that the new technologies how mainly focus on improving the existing technologies such as direct I/O in virtualization and virtual drives in cloud storage stacks.

Improvement in the efficiency of memory management is seen in wide direct I/O of virtualization usages with negligible cost, along with a solution to obtain the desired security as required in specific protection policies.

LeapIO proposes a new architecture to tackle the resource-hungry nature of cloud storage and the implementation of virtual drives and associated features. It facilitates the adoption of ARM SoCs for this at minimal overhead.

KylinX is a simplified virtualization architecture that strives to achieve a balance between the best of both Unikernel appliances (strong isolation) and processes (high flexibility/ efficiency). It provides process-like VM abstraction to allow simplified and efficient cloud virtualization.

References

--

--