Accelerating Two-Dimensional Page Walks for Virtualized Systems
---------------------------------------------------------------

Virtualization overhead on SPEC 2000: less than 5%
Reduction in VM exits, most significant optimization. Shadow tables require VM exits. This paper attempts to eliminate these exits at low overhead. Do they succeed?

Why nm+n+m? Basically (n+1)*(m+1) - 1.

Show 64-bit paging (actually 48-bit with 12 sign extension bits). Can be later extended by adding another page-directory level. 48=12+9+9+9+9.

Explain Figure 1b. Circle are lookups in nested page table. Squares are lookups in guest page table.

Large pages are 2MB pages. basically remove the last level of indirection in page table. Draw shortcut edge in Figure 1.

Most large page benefits are neutralized if a guest uses a large page to map a block of memory that the nested page table maps with smaller pages. Why? Because the TLB must consider the page size for a given translation to be the smaller of the two (called splintering).

Table 1:
Column 1: misses per 100K instructions. They are assuming a standard TLB size (given in Table 2). Hence, this column is a property of the workload.
Column 2: The expected translation slowdown of virtualized compared to native with no TLB acceleration. Basically, there are that many more page-table lookups.
Column 3: If there were a perfect TLB, the performance would have been this much better than the current hardware.

Figure 2 shows that fewer percentage of unique entries can capture more accesses (basically because there are many more redundant accesses now).

Page entry re-use: Figure 3: {nL1,gPA} and {G,gL1} should have been same but are different only because of large pages in guest. For guest large pages, the accesses never come to {G, gL1}.

Spatial locality: Some benchmarks (SPEC int and fp) have high spatial locality of page table entries that are used. Others have lower reuse. Still more than 30% of cache lines have at least two {nL1, gPA} page entries. Indicates that reading a cacheline worth of pagetable entries is useful!

PWC: Page Walk Cache: small, fast, fully-associative, physically-tagged page entry cache. Stores all page table levels except L1 which is effectively stored in the TLB.

Not caching page table entries in the L1 caches is a design decision and not a fundamental requirement for page entry caching.

1-D PWC: cache only guest's page table entries (the squares)
2-D PWC: cache all except the guest's L1.
2-D PWC with nested translations: Cache complete translations of nested page table (sort-of nested TLB).

Use simulator (common theme in architecture papers). In this case, the simulator has been validated against silicon.


All page entries are tagged with the system physical address and {G,gL1} is not cached.

Caching nL1 page entries is preferable to caching G entries because of the superior reuse characteristics of nL1 (Fig 3).

Figure 6:
With 2D_PWC + NT, the performance is 51%-78% of native. Compare this to shadow page table.
With software support for large nested pages, teh performance is 86-93% of native.
1-D PWC is not good enough because 21 of the 24 page entry references still require memory hierarchy accesses.

Figure 7 Left side:
accesses to PWC. Without NT, all rows get almost equal hits (expected). With NT, the PWC Guest accesses remain same. Similarly Nested gPA remain same because of low reuse.

The NTLB translations for gL1 hs reuse equivalent to {G, gL2}.

Even though {nL1,gL1} has low reuse (Figure 3), it exhibits short-term spatial locality as one NTLB entry can eliminate a nested page walk for all page entries on the same page in memory. Hence the "Nested gL1" bar is thinner for 2D_PWC+NT than for 2D_PWC. 
Same factor allows 2D_PWC+NT to eliminate even more of the PWC Nested gL2..4 accesses.

Many hypervisors use 4KB pages due to the complexity of eliminating sub-2MB fragmentation.

Figure 7 right side:
Number of memory accesses for complete misses in TLB/NTLB/PWC caches. 2DPWC+NT only eliminates the nested (circled) accesses which exhibit high reuse and hence the number of misses are reduced primarily in that region. It does not eliminate a significant portion of the accesses that have the highest penalty ({nL1, gPA}, {G, gL1}).

2D_PWC_with_NT accesses 40% lower than 2D_PWC.

Figure 8:
access rates with NTLB enabled: The G column is not cached by NTLB, hence all accesses go to it. Similarly, the gPA row exhibits low reuse, hence NTLB is ineffective.
hit rates: basically follow re-use characteristics (Figure 3).

A miss in PWC often misses in the L2 cache as well. Why? because PWC is a much more specialized cache than L2, and so should have been more effective.

Table 4 first column: L2 accesses  incurred during a 2D page walk using 2D_PWC+NT configuration generate 2.7-5.5 times more L2 misses than the native page walk.

Table 4 second column: The miss rates are quite high (almost one in 4) because the easy-to-cache accesses have been filtered by PWC and NTLB.

Figure 9:
baseline: 24-entry PWC and 16-entry NTLB. Increase size of NTLB and PWC to see effects in simulation.
conclusion: not much useful. For example, increasing the L2 data TLB to 8096 4KB page entries increases overall performance for MiscServer by 11.3% (not shown) compared to 8.5% with 8096 PWC entries.

Large nested pages help a lot: elimination of nL1 references provides a significant reduction. AMD Opteron provides 2MB and 1GB pages. Table 5: TLB misses are reduced because fewer entries. PWC accesses are reduced because nL1 is eliminated. L2 Cache misses (biggest improvement!) are reduced because of elimination of poor-locality references.

Related work:
structures requiring fewer accesses:
hashed page table (hash VA to obtain PA): works well for sparse address spaces
clustered page table: improve hashed page tables by clustering pages that are used often and reduce the number of entries to generate better locality.
guarded page table: for levels that have only one entry, completely bypass the access to that level.