Memory Resource Management in VMware ESX Server ----------------------------------------------- -- Overcommit Memory to reap benefits of statistical multiplexing. Common systems theme everywhere: exploit statistical multiplexing. Other examples? cloud, processes, packet-switching, internet, ... ESX vs. Hosted: significantly higher I/O performance and complete control over resource management. High-level resource management policies compute a target memory allocation for each VM based on specified parameters and system load. A guest expects a zero-based physical address space in the beginning. Explain 32-bit x86 paging structure. Two structures: shadow page tables, and pmap data structure. These two structures are kept consistent with each other. There can be multiple shadow page tables (one for each process), but a single pmap structure for each VM. Draw three diagrams: Guest Page table, PMAP, and Shadow page table. How often is guest page table switched? on every process switch How often is PMAP switched? on every VM switch How often is shadow page table switched? on every process switch How to decide how many shadow page tables to keep? There could be unbounded number of processes in guest? Ans: use a cache to maintain only the "most-used" shadow PTs. Each VM has a "max" size that represents the maximum amount of machine memory it can be allocated (as current OS do not support dynamic changes to physical memory). A VM will be allocated it's maximum size when memory is not overcommitted. An extra level of paging requires a meta-level page replacement policy. Problems? this policy has to make relatively uninformed decisions. diverse guests. "double paging": Let's say page P is not being used by a guest so the VMM swaps it out. The guest now decides to swap it out itself. So, first the VMM will read it from the disk, then the guest will immediately write it back to the disk causing double paging. Balloon: show figure. inflate = allocate pinned physical pages within a VM, using appropriate native interfaces. communicate the page numbers to VMM so that they can be used for other VMs. deflate = free some of the pinned pages pop = free all pages, so guest is at max allocation. If VMM has overcommitted, he may use meta-level page replacement. Popping the balloon occurs in which situations? boot-time, crash-time, guest writes to a ballooned page (unlikely to happen, indicates an anamoly in guest). Balloon driver polls guest once per second to obtain a target balloon size. Allocation rates limited adaptively to avoid stressing guest. ESX Server Allocation respects cache coloring by the guest OS; when possible, distinct PPN colors are mapped to distinct MPN colors. Figure 2: Why are gray bars lower than black bars (1-4%)? primarily due to guest OS data structures that are sized based on the amount of "physical" memory. Meta-level page replacement policy = randomized page replacement. Used to prevent the types of pathological interference with native guest OS memory management algos. Content-based page sharing (as opposed to OS-based page sharing for common code/rodata pages). Advantages? Independence from OS. More sharing opportunities. Explain COW's implementation using page tables. How to identify shared r/o pages? Use a "hint" entry for every unshared page which means that any write access to it will not take a page fault, but it's hash entry exists in hashtable. Any future page which has the same hash as the hint page causes the hint page to be rehashed. If the contents are same as the hash value, then the page is shared, else it is marked unshared. Which pages to scan for copies? Use any policy - random, sequential, heuristics for identifying most promising candidates (e.g., code pages). Also, "share before swap" -- always attempt to share a page before swapping it out -- this way, we can reduce the number of disk writes and updating disk block pointers to the shared page. An interesting optimization: Assume that all shared pages have unique hash values. Hence, if two sets of shared pages have the same hash value, one of those sets will not be shared. Makes management much easier and more efficient. Page sharing decreases with increasing diversity in workloads. In general, it seems to have a positive impact on performance (improved memory locality - temporal). What are working sets? Recall OS. Shares? Memory performance isolation guarantees. Min-funding revocation algorithm? simple way of reallocating resources based on shares. A randomized policy? lottery scheduling. Idle memory tax? what is the need. to allow some sort of efficiency with fairness. How is idle memory measured? At the start of each sampling period, a small number n of the VM's physical pages are selected randomly using a uniform distribution. They are invalidated in page table, and a page fault is taken on next access to them. The idle memory fraction is calculated as t/n where t is the pages that got touched. 100 pages in 30 seconds are marked not-present. They have 2 fast adaptive functions and a slow adaptive function. The target is the max of all these. This ensures that the algo adapts fast to increases in allocation and slowly to decreases in allocation. For each VM, a disk space of max-min VM allocation is reserved to ensure that the system is able to preserve VM memory under any circumstances (meta-level page replacement) I/O page remapping: Devices only see lower 4GB of memory. The guest is repeatedly copying data from high memory to low memory to send it to the device. The VMM can notice this, and transparently remap high memory pages to low memory. The guest still feels that the page has been copied, while actual copying never takes place. relwork: ballooning is similar to self-paging processes introduced in Nemesis OS.