vIC: Interrupt Coalescing for Virtual Machine Storage Device IO
---------------------------------------------------------------

Interrupt Coalescing balances IO latency and CPU (and hence system) efficiency.

Traditionally there are two parameters that need to be balanced: MIDL (maximum interrupt delivery latency) and MCC (maximum coalesce count). One can configure MCC based on a recent estimate of interrupt arrivals. MIDL puts a hard cap on latency. Some devices allow a configurable MIDL in increments of tens of microseconds.

This paper uses R (delivery ratio : how many interrupts are sent to the guest divided by the number of actual IO completions received by the hypervisor on behalf of the guest). R based on CIF and IOPS.

CIF: many important applications issue synchronous IOs. Delaying the completion of prior IOs can delay the issue of future ones.

On a multiprocessor system, it is possible for one core to receive an interrupt meant for a guest running on another core. To reduce latency, IPI is used. IPI is expensive and the paper discusses an optimization to reduce IPIs.

Without explicit interrupt coalescing, the VMM always asserts the level-triggered interrupt line for every IO. Level-triggered lines do some implicit coalescing already but that only helps if two IOs are completed back to back in the very short time window before the guest interrupt service routine has had the chance to deassert the line.

Emulating MCC and MIDL on CPU is simply too costly. Would need timer interrupts at 100us resolution. Compare this with Windows7 default  resolution of 15.6ms.

The coalescing rate R is controlled using CIF as the main parameter and IO completion rate as a secondary control.

SSDs can typically do tens of thousands of IOPs even with CIF=1 (SSDs have higher throughput than mechanical disk devices).

How is CIF used? To keep the IO device busy. Coalescing 4 IO completions out of 32 outstanding might not be a problem since the storage device remains busy with the remaining 28. On the other hand, coalescing 2 IOs out of 4 could result in the storage device not getting fully utilized.

* IOPS <= 2000 (iopsThreshold)  implies R = 1    (upperbounds the *average* latency to 500us).
* CIF <= 4 (cifThreshold) implies R = 1. e.g., for dd (where CIF=1), if we had coalesced an interrupt, it would actually hang forever.
* Variable R based on CIF

Implement R using fast integer arithmetic and "countUp" and "skipUp" integer variables. Recompute R after every "epochPeriod" (200ms). Notice that this logic is on the "fast path" for every IO and hence needs to be fast.

The other optimization is to reduce the number of IPIs. On a multiprocessor system, the hardware interrupts may be handled without causing guest exits. To reduce latency however, IPIs are used to deliver IO to the guest. The primary concern is that a guest OS might have scheduled a compute intensive task, which may result in the VMM not receiving an intercept. In the worst case, the VMM will wait until next time interrupt which could be tens of milliseconds away, to deliver virtual interrupts. The solution is very simply to use an IPI threshold of 100us (two IPIs are spaced by at least 100us).

120 LoC and 50LoC to implement this logic. 400bytes in .text section, and 104 bytes in .data section. Size consideration when VMM shares VA space with guest.

IOmeter parameters: OIOs, number of worker threads (not consequential because CPU not bottleneck), Read/Write, Block size (4KB, 8KB in these experiments), Fully cached LUN (so that reads can happen very fast). What would happen if they used write requests? IO device could become the bottleneck.
The CPU cycles per IO drop by 18% for 64 outstanding IOs (notice the difference is more for more outstanding IOs because more coalescing).
The paper claims that the throughput never *decreased*. Why should it decrease? because we are increasing average latency (remember that this workload is not CPU bound so there is little reason for throughput to increase).

CPU Usage breakdown: Guest idle time has increased from 65% oto 71.4%. Slight increase in message delivery handler function due to extra coalescing logic.

SQLIOSim/GSBlaster: targets an "ideal" IO latency to tune for (used default 100ms in experiments). Keeps increasing number of OIOs till avg latency exceeds desired value. Using vIC, the throughput (IOPS) increased by around 17% (the CPU cost difference was also similar which shows that the benchmark was CPU-bottlenecked).

TPC-C: The number of users is always kept large enough to fully utilize the CPU resources such that adding more users won't increase the performance. vIC allowed TPC-C to increase number of users from 80 to 90. Transaction rate increased by 3% and 5.1% for cifThreshold of 4 and 2 respectively (Table 8). The last column of Table 8 also shows a marginal increase in average latency. Notice a marginal increase in IOPS in Table 8 -- this can be explained by the fact that there is increased parallelism (more users) in the input workload. Between cifT=4 and cifT=2, the throughput increases but the number of users remains same.

Figure 5: X-axis is IPI threshold. For noHyperPi, changing IPI threshold makes no difference because the guest is mostly idle and picks up the interrupt almost immediately irrespective of IPIs. Similarly, the performance of HyperPi-noIO is agnostic to IPI threshold. When both are run simultaneously, performance of HyperPi improves with increasing threshold. But IOmeter throughput decreases with increasing IPI threshold. ESX favours compute-intensive tasks over IO intensive tasks (the tradeoff perhaps makes more sense for less IO intensive workloads than this extreme synthetic workload).