A Comparison of Software and Hardware Techniques for x86 Virtualization ----------------------------------------------------------------------- Theme of the paper: Surprisingly, we find that the first-generation hardware support rarely offers performance advantages over existing software techniques Review Popek-and-Goldberg requirements. Emphasize safety. Classically virtualizable: An architecture that can be virtualized purely with trap-and-emulate. De-privileging: Traps can occur because of the instruction type itself (out) or from access to protected structures (e.g., address range of memory mapped IO device). Primary and shadow structures: On-CPU state (privileged registers) and Off-CPU state (page tables, memory-mapped regions for IO devices). Memory traces: example x86 page tables. Used to prevent shadow PTEs to become incoherent with guest PTEs. True page faults, Hidden page faults Traces are expensive. At the other extreme, avoiding all use of traces causes either a large number of hidden faults or an expensive context switch to prevalidate shadow page tables for the new context. 3-way tradeoff is hard to achieve right. IBM's S/370 had interpretive execution (SIE instruction) mode to make certain virtualization operations faster. x86 obstacles to virtualization: visibility of privileged state (e.g., CPL), lack of traps during unprivileged execution (e.g., popf) Interpretive execution (fetch-decode-execute cycle burns hundreds of physical instructions per guest instruction). Better alternative: binary translation VMware is a system-level binary translator: Cannot assume things like "return addresses are always produced by calls". A buffer overflow attack in the guest should actually occur even when running virtualized. Binary translation: subsetting, adaptive TU has a cap of 12 instructions. Typically TU = BB. Use "continuations" for jumps. point out [fallthroughAddr]. Chaining. The translator captures an execution trace of the guest code, ensuring that TC code has good icache locality if the first and subsequent executions follow similar paths through guest code. Rare paths get placed away from hot path (nice side-effect) Discuss how memory is virtualized and how VMM memory is protected. %gs used as an escape into VMM's data structures. Indirect control flow costly (due to hash table lookup) Privileged instructions could be both cheaper (e.g. vcpu.IF=0) or costlier (e.g., context switch) Non-IDENT instructions: PC-relative addressing, direct control flow, indirect control flow, privileged instructions. binary translation costs: rdtsc: trap-and-emulate 2030 cycles, callout-and-emulate 1254, in-tc emulate 216. Adaptive binary translation: eliminate traps from loads and stores (which are the most common trap reason). Draw Figure 1. Hardware Virtualization: - vmrun: transfers from host to guest mode - Guest execution proceeds until some condition, expressed by the VMM using control bits of the VMCB is reached. At this point, the hardware performs an "exit" operation. On exit, the hardware saves guest state to the VMCB, loads VMM-supplied state into the hardware, and resumes in host mode, now executing the VMM. - VMM programs the VMCB to exit on guest page faults, TLB flushes, address-space switches, I/O instructions, access to privileged data structures such as page tables and memory-mapped devices. Discuss "why" for each of them. - On guest exits, the VMM reads the VMCB fields describing the conditions for the exit, and vectors to appropriate emulation code. Most of this emulation code is shared with the software VMM. - Important: Since current virtualization hardware does not include explicit support for MMU virtualization, the hardware VMM also inherits the software VMM's implementation of the shadowing technique. Example operation: process creation (notice the overhead of tracing and hidden page faults). Note that this scenario also has true page faults (due to guest implementing copy on write and demand paging). Reducing the frequency of exits is the most important optimization for classical VMMs. Hence, maintain shadow state in VMCB BT over VT: trap elimination, no need to decode trapping insn, avoid callouts using in-TC emulation. VT over BT: better code density, precise exceptions, system calls need SPECint, SPECjbb: compute-intensive. Not much to choose. ApacheWin (single address space): hardware 67%, software 53% ApacheLin (multiple address spaces): hardware 38%, software 45%. More address spaces mean more page table switches mean more traps in hardware VMM. Software VMM seems to be absorbing some of those traps through adaptive BT. PassMark: I/O intensive so both perform equally. LargeRAM: software much better than hardware because adaptive BT hides some of the traps due to paging activity. 2DGraphics: system-call intensive. hardware performs better because no need to exit. compileWin: higher overhead than compileLin due to more IPC (context switches) forkwait: hardware much slower than software. lots of system calls, context switching, creation of address spaces, modification of traced page table entries, injection of page faults. "Nanobenchmarks" syscall: software much worse than hardware. why? in/out: software VMM executes this instruction fifteen times faster (!!) than native. Hardware VMM has almost 10x overhead. cr8wr: hardware VMM no effect (no exit). software VMM four times faster! callret: software VMM 8x slower! hardware VMM no difference. why? pgfault: true page faults. software VMM faster due to adaptive BT. hardware VMM has 11x slowdown. EPTs have almost no slowdown. divzero: software VMM has slowdown due to exits on exceptions. ptemod: 10000x slowdown for hardware VMM. software VMM absorbs much of it using adaptive BT. Note that this slowdown does not exist with EPTs. Figure 5 shows the relative costs of each operation during a Windows XP boot/halt sequence. Notice that the translation cost is very small. I/O cost and ptemod costs dominate for Hardware VMM. Section 7.1: Notice how microarchitectural changes can profoundly impact system performance. forkwait natively takes 6.02 seconds on P4 but takes 2.62 seconds on Core! Also, virtualization overheads are lower (Table 1). Hardware VMM algorithmic changes: VMware has iteratively improved the tracing performance making traces relatively cheap. Thus, wherever needed, traces are used generously. But with hardware VMM, this does not work well. VMware was contemplating changing their trace points for hardware VMM to obtain better performance, but Intel came up with EPT making this unnecessary. Introduce how nested page tables will work in hardware. This is a good time to preview the AMD nested page tables paper.