The Turtles Project: Design and Implementation of Nested Virtualization ---------------------------------------------------------------------- Nested virtualization using single-level virtualization support. Performance within 6-8% of single-level (non-nested) virtualization for common workloads. Applications: - cloud provider can allow users to run hypervisors - Live migration of hypervisors - Trap hypervisor-level rootkits - Testing hypervisors - Windows 7 uses VM for XP-compatibility mode Discuss figure 1 Discuss figure 2 Discuss Figure 3 VMX Trap and emulate: L0 emulates VMX instructions, updates VMCS structures, resumes L1. Most VMX instructions in L1 cause an exit to L0 and re-entry to L1. Exceptions: vmresume and vmlaunch (re-entry to L2). VMCS shadowing: draw parallels with shadow page tables What if a trappable event in VMCS0->2 was not specified in VMCS1-2? From L1's point of view nothing happened. What if a trappable event in VMCS0-2 was not specified in VMCS0-1? forward the event to L1 by switching it to root mode. page tables: shadow-on-shadow shadow-on-EPT EPT-on-shadow shadow is expensive when PT updated frequently, hence better to use it at the lower level (3x difference) VPIDs: L0 needs to map the VPIDs that L1 uses into valid L0 VPIDs. Device Virtualization: IOMMUs: allow single level of address translation. For paravirtualized L1, L1 can just use a hypercall to L0 to tell it the L1-to-L2 mapping. For unmodified guest, L0 emulates an IOMMU for L1. L1 sets up that IOMMU, and L1 then combines that with it's own to form a shadow IOMMU. Transitions between L1 and L2 are slower than transitions between L0 and L1 Exit handling code running in L1 is slower than same code running in L0 Hack: empirically noted that certain fields of VMCS can be accessed using memory read/write instructions instead of vmread/vmwrite. Use that to improve performance by writing multiple VMCS fields at once during merge by L0. Can also use this to binary-translate L1 instructions to have efficient reads/writes to VMCS. Table 2: 10.3% overhead over guest. 20% over native for kernbench. Less for specjbb (compute-intensive) Figure 5: nested case has twice the CPU utilization for kernbench (5.17% versus 2.28%) Large variance in handling times of different types of exits: cost of exit is proportional to the number of privileged instructions performed by L1 (exit multiplication) PIO handler takes 12K cycles in single-level virt but 192K cycles in nested case. The difference in execution times between L0 and L1 due to two reasons: -- handlers in L1 execute privileged instructions which exit to L0 -- handlers run for a long time; hence more external interrupts during their runtime Figure 7: emulated NIC becomes CPU-bottlenecked even at single-level, paravirt is better, and direct-assignment is best. direct-on-direct increases CPU utilization to 60% from 20%(on bare-metal) for 2.9GHz core and 1Gbps n/w card. Hypothesis: the large number of interrupt forwardings is causing the high CPU utilization. Try using polling with varying packet sizes. In polling, CPU utilization is always 100% (polling thread consumes all available CPU) but throughput is better for smaller packets also (Figure 8) Figure 9: compares shadow-on-EPT with EPT-on-shadow. On kernbench, the latter is 3.5x faster than the former, due to the large number of page faults in kernbench. Running VMware as a guest hypervisor: VMware uses VMX initialization instructions (vmon, vmoff, vmptrld, vmclear) several times during L2 execution. Notion of virtualization-friendly hypervisors. Ask to read through steps 1-10 when L2 executes the cpuid instruction