The Turtles Project: Design and Implementation of Nested Virtualization
----------------------------------------------------------------------

Nested virtualization using single-level virtualization support.
Performance within 6-8% of single-level (non-nested) virtualization for common workloads.
Applications:
- cloud provider can allow users to run hypervisors
- Live migration of hypervisors
- Trap hypervisor-level rootkits
- Testing hypervisors
- Windows 7 uses VM for XP-compatibility mode

Discuss figure 1
Discuss figure 2
Discuss Figure 3

VMX Trap and emulate: L0 emulates VMX instructions, updates VMCS structures, resumes L1. Most VMX instructions in L1 cause an exit to L0 and re-entry to L1. Exceptions: vmresume and vmlaunch (re-entry to L2).

VMCS shadowing: draw parallels with shadow page tables

What if a trappable event in VMCS0->2 was not specified in VMCS1-2? From L1's point of view nothing happened.

What if a trappable event in VMCS0-2 was not specified in VMCS0-1? forward the event to L1 by switching it to root mode.

page tables:
shadow-on-shadow
shadow-on-EPT
EPT-on-shadow

shadow is expensive when PT updated frequently, hence better to use it at the lower level (3x difference)

VPIDs: L0 needs to map the VPIDs that L1 uses into valid L0 VPIDs.

Device Virtualization: IOMMUs: allow single level of address translation. For paravirtualized L1, L1 can just use a hypercall to L0 to tell it the L1-to-L2 mapping. For unmodified guest, L0 emulates an IOMMU for L1. L1 sets up that IOMMU, and L1 then combines that with it's own to form a shadow IOMMU.

Transitions between L1 and L2 are slower than transitions between L0 and L1
Exit handling code running in L1 is slower than same code running in L0

Hack: empirically noted that certain fields of VMCS can be accessed using memory read/write instructions instead of vmread/vmwrite. Use that to improve performance by writing multiple VMCS fields at once during merge by L0. Can also use this to binary-translate L1 instructions to have efficient reads/writes to VMCS.

Table 2: 10.3% overhead over guest. 20% over native for kernbench.
Less for specjbb (compute-intensive)

Figure 5: nested case has twice the CPU utilization for kernbench (5.17% versus 2.28%)

Large variance in handling times of different types of exits: cost of exit is proportional to the number of privileged instructions performed by L1 (exit multiplication)

PIO handler takes 12K cycles in single-level virt but 192K cycles in nested case. The difference in execution times between L0 and L1 due to two reasons:
-- handlers in L1 execute privileged instructions which exit to L0
-- handlers run for a long time; hence more external interrupts during their runtime

Figure 7: emulated NIC becomes CPU-bottlenecked even at single-level, paravirt is better, and direct-assignment is best. direct-on-direct increases CPU utilization to 60% from 20%(on bare-metal) for 2.9GHz core and 1Gbps n/w card.

Hypothesis: the large number of interrupt forwardings is causing the high CPU utilization. Try using polling with varying packet sizes. In polling, CPU utilization is always 100% (polling thread consumes all available CPU) but throughput is better for smaller packets also (Figure 8)

Figure 9: compares shadow-on-EPT with EPT-on-shadow. On kernbench, the latter is 3.5x faster than the former, due to the large number of page faults in kernbench.

Running VMware as a guest hypervisor: VMware uses VMX initialization instructions (vmon, vmoff, vmptrld, vmclear) several times during L2 execution. Notion of virtualization-friendly hypervisors.

Ask to read through steps 1-10 when L2 executes the cpuid instruction