A Comparison of Software and Hardware Techniques for x86 Virtualization
-----------------------------------------------------------------------
Theme of the paper: Surprisingly, we find that the first-generation hardware support rarely offers performance advantages over existing software techniques

Review Popek-and-Goldberg requirements. Emphasize safety.

Classically virtualizable: An architecture that can be virtualized purely
with trap-and-emulate.

De-privileging: Traps can occur because of the instruction
type itself (out) or from access to protected structures (e.g., address
range of memory mapped IO device).

Primary and shadow structures: On-CPU state (privileged registers) and
Off-CPU state (page tables, memory-mapped regions for IO devices).

Memory traces: example x86 page tables. Used to prevent shadow PTEs
to become incoherent with guest PTEs.

True page faults, Hidden page faults

Traces are expensive. At the other extreme, avoiding all use of traces
causes either a large number of hidden faults or an expensive context
switch to prevalidate shadow page tables for the new context. 3-way
tradeoff is hard to achieve right.

IBM's S/370 had interpretive execution (SIE instruction) mode to make
certain virtualization operations faster.

x86 obstacles to virtualization: visibility of privileged state (e.g., CPL),
lack of traps during unprivileged execution (e.g., popf)

Interpretive execution (fetch-decode-execute cycle burns hundreds of
physical instructions per guest instruction). Better alternative: binary
translation

VMware is a system-level binary translator: Cannot assume things
like "return addresses are always produced by calls". A buffer
overflow attack in the guest should actually occur even when running
virtualized.

Binary translation: subsetting, adaptive

TU has a cap of 12 instructions. Typically TU = BB.
Use "continuations" for jumps. point out [fallthroughAddr]. Chaining.
The translator captures an execution trace of the guest code, ensuring
that TC code has good icache locality if the first and subsequent
executions follow similar paths through guest code. Rare paths get
placed away from hot path (nice side-effect)

Discuss how memory is virtualized and how VMM memory is protected.
%gs used as an escape into VMM's data structures.
Indirect control flow costly (due to hash table lookup)
Privileged instructions could be both cheaper (e.g. vcpu.IF=0) or
costlier (e.g., context switch)

Non-IDENT instructions: PC-relative addressing, direct control flow,
indirect control flow, privileged instructions.

binary translation costs:
rdtsc: trap-and-emulate 2030 cycles, callout-and-emulate 1254, in-tc emulate 216.

Adaptive binary translation: eliminate traps from loads and stores
(which are the most common trap reason). Draw Figure 1.

Hardware Virtualization:
- vmrun: transfers from host to guest mode
- Guest execution proceeds until some condition, expressed by the VMM
  using control bits of the VMCB is reached. At this point, the hardware
  performs an "exit" operation. On exit, the hardware saves guest state to
  the VMCB, loads VMM-supplied state into the hardware, and resumes in
  host mode, now executing the VMM.
- VMM programs the VMCB to exit on guest page faults, TLB flushes,
	address-space switches, I/O instructions, access to privileged data
	structures such as page tables and memory-mapped devices. Discuss
	"why" for each of them.
- On guest exits, the VMM reads the VMCB fields describing the conditions
  for the exit, and vectors to appropriate emulation code. Most of this
	emulation code is shared with the software VMM.
- Important: Since current virtualization hardware does not include
  explicit support for MMU virtualization, the hardware VMM also inherits
	the software VMM's implementation of the shadowing technique.

Example operation: process creation (notice the overhead of tracing
  and hidden page faults). Note that this scenario also has true
  page faults (due to guest implementing copy on write and demand paging).

Reducing the frequency of exits is the most important optimization for
classical VMMs. Hence, maintain shadow state in VMCB

BT over VT: trap elimination, no need to decode trapping insn, avoid
callouts using in-TC emulation.

VT over BT: better code density, precise exceptions, system calls need

SPECint, SPECjbb: compute-intensive. Not much to choose.

ApacheWin (single address space): hardware 67%, software 53%
ApacheLin (multiple address spaces): hardware 38%, software 45%.

More address spaces mean more page table switches mean more traps in
hardware VMM. Software VMM seems to be absorbing some of those traps
through adaptive BT.

PassMark: I/O intensive so both perform equally.

LargeRAM: software much better than hardware because adaptive BT hides some of the traps due to paging activity.

2DGraphics: system-call intensive. hardware performs better because no need to exit.

compileWin: higher overhead than compileLin due to more IPC (context switches)

forkwait: hardware much slower than software. lots of system calls, context
switching, creation of address spaces, modification of traced page table
entries, injection of page faults.

"Nanobenchmarks"
syscall: software much worse than hardware. why?
in/out: software VMM executes this instruction fifteen times faster (!!) than native. Hardware VMM has almost 10x overhead.
cr8wr: hardware VMM no effect (no exit). software VMM four times faster!
callret: software VMM 8x slower! hardware VMM no difference. why?
pgfault: true page faults. software VMM faster due to adaptive BT. hardware VMM has 11x slowdown. EPTs have almost no slowdown.
divzero: software VMM has slowdown due to exits on exceptions.
ptemod: 10000x slowdown for hardware VMM. software VMM absorbs much of it using adaptive BT. Note that this slowdown does not exist with EPTs.

Figure 5 shows the relative costs of each operation during a Windows XP boot/halt sequence. Notice that the translation cost is very small. I/O cost and ptemod costs dominate for Hardware VMM.

Section 7.1: Notice how microarchitectural changes can profoundly impact system performance. forkwait natively takes 6.02 seconds on P4 but takes 2.62 seconds on Core! Also, virtualization overheads are lower (Table 1).

Hardware VMM algorithmic changes: VMware has iteratively improved the tracing
performance making traces relatively cheap. Thus, wherever needed, traces are
used generously. But with hardware VMM, this does not work well. VMware
was contemplating changing their trace points for hardware VMM to obtain
better performance, but Intel came up with EPT making this unnecessary.

Introduce how nested page tables will work in hardware. This is a good time
to preview the AMD nested page tables paper.