Xen and the Art of Virtualization --------------------------------- Main Difference between Xen and Linux: -- In Xen, each instance can exports an ABI identical to a non-virtualized Linux, so same apps can be run on both! -- Complete isolation implies no configuration headaches due to multiple installations, etc. (e.g., windows registry) -- Performance isolation! e.g., disk activity, memory consumption, CPU sconsumption (scheduler complications). problem due to unclear boundaries within an OS. Existing techniques: - Resource containers for performance isolation (getrlimit, setrlimit) - Micro-kernels: allow QoS to be implemented among processes by correct accounting as most operations are done in the process context (exokernel, nemesis. Discuss micro-kernel architecture. - Change the ABI (Denali) to make it more isolated from other processes. How will you change the ABI? What is the issue? Tradeoff in Xen: running/spawning OS instances more expensive than running/forking processes. Full virtualization? What are shadow page tables? What does trapping every update event mean? What is the other option -- lazily trap on first read attempt (usually more complicated and not worth it) Examples when guest OS will like to see real resources? TCP timeouts, superpages (for performance), page colouring (for performance through better cache management). Denali -- single user, single application instances. Unpopular! Xen -- each instance has "resource guarantees" for memory and disk. Denali virtualizes namespaces to some private naming. Xen uses existing namespaces and uses only access control within the hypervisor. Why is x86 worst case for memory virtualization? efficiently virtualizing hardware page tables is more difficult than virtualizing a software-managed TLB. why? software-managed TLB instructions are explicit and can be translated explicitly. On the other hand, hardware managed page tables need to be emulated using shadow page tables and traps. How is tagged TLB useful? Avoids TLB flushes across world switches. Xen reserves top 64MB of address space on each instance page table. Hence, entering or leaving the hypervisor is easy! Who else does this? Linux/Windows/VMware. Who does not do this? Exokernel. OS registers it's page table with Xen and relinquishes write privileges on page tables. why? for isolation. otherwise, guest could potentially add entries to the page table mapping to pages belonging to other guests. Hence, all updates must be validated by Xen. A process in an OS is never allowed to change it's page table explicitly. It can only do it implicitly -- how? How does x86 segmentation work? Guest OS runs in ring 1. Why not run it in ring 3? to provide separation between guest app and OS. what kind of separation? mem/IO/etc. (the u/s bit only checks if cpl=3) What is the hlt instruction? How does it get emulated? IDT registered with Xen. Interrupt stack frame identical to hardware interrupt stack frame except #PF where CR2 is pushed to stack (as guest cannot read the hardware CR2). How is syscall executed? usually via software exception (int 0x80 on linux). How is it executed under Xen? By converting it into an indirect function call to the hardware exception table (previously installed and validated by guest). #PF handler needs to go through Xen, because CR2 can only be read at ring 0. Xen reads hardware CR2 stores it in intr-stack-frame, and transfers control to guest OS handler in ring 1. How is exception propagated? Reaches Xen, which then calls the guest OS handler. Double faults? Xen terminates guest instance on double fault because it is usually not expected. I/O done through shared-memory asynchronous buffer descriptor rings. Asynchronous "signals" can be given to the guest instance. Guest may "hold off" signals to perform more efficient batch processing. What is needed to port a guest to Xen? page-handling code, initial bootup code, hardware-specific privileged instructions. The management layer is another guest instance (Domain0 or Dom0). Dom0 communicates with Xen through a control interface. What can the control interface do? create/terminate instances, control scheduling, memory allocation, physical devices, etc. Virtual devices = VBDs (virtual block devices) and VIFs (virtual interface cards). Control transfer = hypercalls and events? examples of hypercalls? page table updates (needs validation by Xen). examples of events? (timer interrupt, device completion interrupt) "We attempt to miimize the work required to demultiplex data to a specific domain when an interrupt is received from a device". Why? So that most of the processing is accounted to the respective domain (performance isolation). Similarly, they use instance pages for I/O as far as possible (later they mention that they use only 20kB of state per instance inside Xen). The shared (between requests and responses) ring buffer allows best utilization of space. This is a generic interface for many I/O paradigms. How is batching done? In case of requests, a domain may enqueue multiple entries before invoking the hypercall. Similarly, a domain can defer delivery of notification event by specifying a threshold number of responses. Scheduling: need low-latency wakeup (dispatch). why? interactivity! refer network ping response experiment later in the paper. Real-time (gettimeofday) can be synchronized with external NTP service. Virtual-time is the time the domain gets on CPU. This is typically used by guest scheduler to schedule applications. Why can't the guest scheduler use real-time to schedule apps? because that will mislead if the guest has been swapped out. Full virtualization: Guest page tables versus MMU-visible shadow page table. require explicity propagation of dirty and accessesed bits. Frame types: PD, PT, LDT, GDT, RW. On app context switch, can simply update pointer to previously cached PD frame! The frame type dictates what ops can/cannot be done on it. Balloon driver? To reclaim memory from guest. A PPN->MPN mapping is maintained by Xen and shared with guest. Why? May be needed for page tables. Optimization (cache placement for physically-indexed cache, etc). Use scatter-gather DMA for zero-copy network interface for performance. Examples of copying in an OS? (user-kernel). why is it needed in kernel and not needed in Xen? because the interface between Xen and guest is completely controllable. Guest is trusted in some sense (because it has been verified and patched). Disk: request re-ordering can happen at two levels (guest and xen). can screw up performance in some cases. when? the OS algorithms can get obviated. Again, zero-copy data transfer takes place using DMA betwen the disk and pinned memory pages in guest domain. What are re-order barriers? similar to barriers in compiler/OS. Building a new domain: construct state of domain per OS before transferring control. Not very different from exec() system call. Strip out a lot of bootstrap code from guest. What is hyperthreading? Why was it disabled in evaluation? Discuss Figure 3. OLTP benchmarks requires many synchronous disk operations resulting in many protection domain transitions. SMP kernel is run on uniprocessor. Will have lower performance due to locking overhead. specially visible in open-close and slct-TCP. slct-TCP difference is atounding (5x). wonder what's going on. no attempt to explain it in paper. fork, exec, sh: significant overhead using xen because of large number of page table updates. mmap, page fault have lot of #PFs. Hence almost factor of 2 overhead, because at both faulting and at page-installation time, Xen needs to kick in. By default, Xen uses a 5ms time slice. What is the scalability issue in Apache that is suspected by the authors? Two instances gives higher throughput than two threads inside Apache. PostgreSQL: why is Xen faster than linux. perhaps, scalability issues in PostgreSQL and poor utlization of Linux's block cache. In Figure 4 and 5, why is performance highest for 2 instances? because the machine is 2CPU machine. Figure 6: Time slice=5ms, xen throughput is 7.5% lower than native. Timeslice=50ms, xen throughput is same as native. At what timeslice will xen throughput become zero (very small say 5 microsecond due to livelock).