Virtualizing I/O Devices on VMware's Workstation's Hosted Virtual Machine Monitor ------------------------------- Primary need for hosted architecture: PC hardware diversity What is a world switch? page table switch, hardware registers (including some control registers which may be expensive). Plus the hidden cost of TLB flush. Entities: Guest, VMM, Host, VMapp, VMNet Driver(kernel). what is each's function? What privilege level is each executing in? VMNet is an interface between VMApp and the Host. VMApp can call select() on VMnet to find if a packet is waiting for it. For example, VMNet determines if the packet belongs to the guest, and hence is a VMApp packet. VMApp performs I/O on behalf of the virtual machine. Adv: use host's syscall interface. All OSes allow hardware-independent program to access the underlying devices (through some interface). example: CDROM interface vs SCSI/IDE CDROM interface. write a driver for each device versus write a driver for each class of devices (the latter is what VMapp does). All device interrupts are heandled by host: VMM also yields control to the host OS on receiveing a hardware interrupt. The hardware interrupt is reasserted in the host world so that the host OS will process the interrupt as if it came directly from hardware What is the VMM? A kernel module, a device driver, whatever the host OS wants to call it. It has all the power, but it is written to be cooperative with the host. VMM behaves as though the VM is just another process. Hence, VM gets scheduled by the host OS's scheduler. Similarly, VM's pages may be swapped-in/out by host OS. downside: at the mercy of host OS. Virtual Device Choice: just use an old device that all OSes are likely to have a driver for. Is performance of device important? Yes, but only at the order-of-magnitude granularity (so that the guest's device driver does not do things unnecessarily slow). Not supporting any other device, greatly restricts the emulation ports and code which is useful in optimization. What are the most significant sources of overhead in I/O virtualization in hosted architecture? world switch, trap cost, copying from guest physmem to host's kernel buffers bridged vs NAT. The paper focusses on bridged. Why do I need the host NIC in promiscuous mode for bridged? Show state diagram of virtualized NIC. the inputs are IN/OUT instructions. At certain states, the message is sent out to the network. When the network finishes, an interrupt is sent to the guest. In virtual world, emulate the same thing. Use write() to send packet out to network. Also, send an interrupt back to the guest (in unoptimized version, this is done synchronously). Accesses to the Lance's address register are handled completely within the VMM and all accesses to the data register switch back to handling code in the VMApp. Why? because address register accesses are guaranteed not to cause data transfer. Data accesses may require read from network traffic, or packet send. Overheads: world switch, three interrupt handlers (on recv, potentially VMM, host and guest), two device drivers (guest and host), extra copy nettest: what does it do? mention ACKs. Use rdtsc to plot figure 5. Figure 5: 0.57: just checking the I/O port to identify device. copying data. to host kernel's buffer 1.23: checking which device. system call to vmnet driver. 17.55: emulating the device, write packet fields (send the packet on the network) 2.97: collating results. copying data back. Time-based sampling for profiling IRQ processing: depending on where the interrupt is received, it may require 1 or two world switches. The first "real" processing of the IRQ is always done by the host. VMM interrupt processing is just to forward. The interrupt handling routines typically execute privileged instructions which usually require traps, and often require world switches (e.g., I/O to PIC). Most of the I/O instructions are to the PIC. draw diagram to explain. Most of IRET instructions are returns from interrupt handlers (what are the other IRET instructions? e.g., scheduler). Basically, would like to lower the number of interrupt handling calls. Cost of servicing an interrupt taken in the VMM world more due to extra world switch. Any interrupt in VMM has to cause a switch to host, because VMM knows nothing about device handling. VMApp gets notified by VMM on interrupt. VMApp can also come into the picture by periodically calling select(). What is select()? Why is it expensive? Optimization #1: Emulate NIC in VMM as much as possible. Translate I/O instructions to mov instructions. Optimization #2: As a followup to #1, do not switch worlds on every packet that needs to be sent. Queue up to 3 packets. Anyways, world will switch on every timer tick. A corollary benefit: transmitting multiple packets at once increases the probability that native send-complete interrupts are taken while executing in the host world. Optimization #3: Reduce host system calls by establishing a piece of shared memory between app and VMNet driver. Can be done in general (e.g., web server) but usually not done as involves using a special kernel module. How is guest idle time measured? By actually running an idle loop in VMM. Why does throughput increase on increasing data size? just a simple packet size improvement means decrease in overhead per MB. The CPU idle time is still much less for virtualized guest than native. why? one reason, I am now at the mercy of the host OS scheduler; second, I am doing a lot of I/O which is eating CPU too; third, I am being binary translated. Performance Enhancements: 1. reducing CPU virtualization overhead: replace PIC accesses with mov instructions (basically, reduce callouts) as PIC being emulated in VMM already. 2. modifying the guest OS: don't switch PT for idle thread. 3. paravirt-NIC 4. specialize the host to make VMApp faster (boooo.. bad idea). 5. implement device drivers in VMM: impractical. basically, the best you can do is to move as much functionality to VMM as possible.