MY NOTE's, 2nd revision ----------------------- * What is super pages? - Support from processors allow a single entry in TLB to map larger physical memory region to virtual address space. - Help increase TLB coverage, hence promise performance improvement - superpages: + a memory page of larger size than an ordinary page (base page) + occupies only one TLB entry + help increase TLB coverage * What is the motivation for superpages? - TLB size increase slowly, why? want to keep it fast - Application working set increase dramatically - Hence, we want to increase TLB coverage * What worse of we use superpages incorrectly? - wasted of memory: for example, an apps working set is 1MB, but we use a superpage of 16MB --> wasted 15MB - worst, if that a superpage is dirty, we don't know which base pages are modified, and if we are not careful, we need to write out all base page to persistent store. In this case, the gain in superpages is outweighed by the cost in increasing IO. * What is the benefit of using multiple page size (vs. fixed page size) ? - increase TLB coverage (both) - reduce internal fragmentation - reduce disk traffic - best fit application, since each has a preferred page size * What is some hardware-imposed constraint on superpage? - super-page size must be support by processor - a super-page must be contiguous in physical and virtual space - starting address of super-page must be a multiple of its size - more importantly, TLB entry of a super-page provides single reference bit, single dirty bit ==> hard to now which base page is referenced or dirtied * What are operations in super-page management? - Allocation - Promotion - Demotion - Page replacement (or eviction) - Fragmentation Control * When fragmentation can happens? Does relocation-based allocation suffer from fragmentation? - I guess not - Fragmentation appears mostly in reservation-based allocation with multiple page sizes - Scattered wired pages (those cannot be evicted in kernel address space) * Relocation vs. Reservation: - Relocation: + copy overhead + even when there are plenty of contiguous available regions, still need to reallocate based pages before promotion + can be transparent to user-level + robust to fragmentation + when to relocate: need to keep track some info to make this decision - Reservation: + no copy overhead + suffer from fragmentation (with multiple page sizes) ~ may require fragmentation control ~ contiguity-aware page replacement + may not be transparent, user may define reservation size * What data structures are used in the paper, and why? - Buddy allocator: to allocate page frames - Multi-list reservation scheme: + to track partially used memory reservation + help in choosing reservation for preemption (in case of memory pressure) - Population Map + keep track of memory allocation in each memory object * Compare contiguity aware page replacement scheme with traditional ones (LRU, etc ...)? * The paper use reservation-based allocation. One of the challenge is choosing the reservation size? Why does this matter? What is the policy use in the paper? - Each apps has preferred super-page size, and predicting future behavior is hard. - if the reservation-size is too small, if is hard to revert, i.e. if we want to increase the super page size. We need to pay relocation cost ~ E.g.: || A | A || B | - ||, size = 2, and we have a super page of size 2 times base page, but now, we want to increase it to 4, we need to relocate B - If the size is too large, then waste of memory, since the app may never touch all the pages ==> it is ok, since partially filled reservation can be preempted - Solution in the paper: + for fixed size object (code, etc) choose super page size as large as possible, as soon as possible, does not reach beyond of the object + for variable size object (stack, heap), start with the base page * In the paper, why choose the partially filled reservation whose most recent page allocation occurred least recently for preemption? - Based on experience, heuristic: + useful reservation are often populated quickly + reservation have not experienced any recent allocations are less likely to be fully allocated in the near future * What is incremental promotion described in the paper? and why? - A super page is created *as soon as* any superpage-sized and aligned extent *within* a reservation get *fully* populated. - E.g.: if we have a reservation with the size of 16K, base page size is 2K, and the system support 4K, 8K, 16K superpages 0 1 2 3 4 5 6 7 | BP | R | R | R | R | R | R | R | R: reserved BP: based page populate If we fault in page number 2 in the reservation, we form a super-page of size 4K 0 1 2 3 4 5 6 7 | SP | R | R | R | R | R | R | If we fault in page number 6 and 7, we have the following 0 1 2 3 4 5 6 7 | SP | R | R | R | R | SP | If we fault in page page 4 and 5, we have the following 0 1 2 3 4 5 6 7 | SP | R | R | SP | - Why not just promote when reach fraction of that size? + may inflate application's memory footprint + most applications populate their address space densely and relatively early in their execution * The paper use multi-reservation list? Describe the purpose if this? - keep track of partially allocated reservation + so that when their is a memory pressure, we need to preempt a reservation for free memory extent - one reservation list for each page size support by hardware - a reservation R in the 64K list when we can yield a largest free extent of 64K when preempting R - Keep sorted by the time of their most recent page frame allocations * What happens when a reservation is preempted? - Say reservation R is preempted - R is broken in to R1 ... Rn smaller extent - Free extents are given back to buddy allocator - partially allocated extents are reinserted to appropriate lists - Full extents: don't have to reinsert * What operations does the population map support? 1. Fast reserved frame lookup - when a page fault happen with a virtual address need to find if this virtual address belong to a reservation ... 2. When making a reservation, need to make sure no overlap with other 3. Promotion decisions 4. Preemption assistance * In order to support superpage, the author need to tweek the paging daemon of FreeBSD. List some the changes and the reason for that. - consider cache pages as available for reservation + since cache pages is clean, and unmapped + easy to free under memory pressure How: buddy allocator keeps cache pages coalesced with free pages - paging daemon is activated not only when memory pressure, but also when available contiguity is low: + when: fail to allocate contiguous region of a preferred size - All clean pages backed by a file are move to the inactive list as soon as the file is closed by all processes * What does the paper do with pages that kernel use internally (i.e. wired page) - cluster all of them to avoid fragmentation * Why the author needs to change the mmap system call? - to support super page with multiple mapping for example, sharing a mapped file - since most processes do not specify the address to map, change the mmap syscall to give the virtual address that work for supper page * What overhead when allocate a page of memory? - promotion - preemption of a reservation * Can you think of a workload that is worst fit for superpages? - allocate memory - access 1 byte each pages - deallocate * Among random and sequential workload in reading a files, which is better with superpages? - well sequential read a very large file will trigger a lot of promotion - random read a large file may not trigger any promotion, hence better * In the experiment section, why Web server doesn't benefit much from superpage? - supper page tend to benefit large size object - and since Web server accesses hundred of small files, and the system doesn't attempt to build super pages that span multiple memory object **Practical, transparent operating system support for super pages** ================================p=================================== Good summary: 1) http://www.cs.toronto.edu/~demke/469F.06/Lectures/Lecture14_4up.pdf # 1. Motivation: - increasing cost in TLB miss overhead: + growing working sets of applications + TLB size does not grow as same pace (why? because if we want TLB fast, better keep it small) - processors now support superpages (multiple size of base pages) + i.e one TLB entry can map to a large region + must be multiple size of base page + contiguous in physical and virtual address + starting address must be multiple of its size + TLB entry has *single* reference bit, dirty bit, and set of protection (i.e uniform attributes) ==> so, how can we define which base page is dirtied? referenced? # 2. The CRUX: HOW do we increase TLB coverage without enlarging the TLB size? (i.e reduce the chance of TLB miss, hence increase performance, while) - Solution 1: use larger base page ==> problem of internal fragmentation - Solution 2: use supperpages (this is what we are talking about) + increase TLB coverage + no internal fragmentation + no increase in TLB size # 3. Previous solutions (Note: add more here) - reservation: support one superpage size only - relocation (i.e physical copy) + move pages (to make it contiguous??) at promotion time + copying cost difficult to recovered, especially in busy system - superpages are allocated at page fault time, based on the amount of memory available (based on user-specified per segment hint) => size specified by user --> non transparent # 4. THE SUPERPAGE PROBLEM (or challenges) *Allocation* ------------ - HOW/WHEN/WHAT Size to allocate + relocation: i.e physical copy (see above) + reservation: > when allocate a page frame, reserve subsequent page frames that is part of and in alignment it size of a super page > later, when process touches a page frame in that boundary, corresponding base page frame allocated and mapped ==> require a priori knowledge of super page size, but different app has different "best" size *Promotion* ------------ - what is promotion? + create super page out of a set of smaller base page + mark base page table entries to reflect the new size - when to do promotion? 1. incrementally: + when certain number of contiguous base pages have been allocated promote to small super-pages, promote to larger superpage if subsequent base pages have been allocated ==> trades benefit of early promotion agains increased memory consumption (if not all constituent pages of the superpage are used) 2. forcibly populate pages? ==> may cause internal fragmentation 3. wait for app to touch pages? ==> may loose opportunity to increase TLB coverage *Demotion* ---------- - demotion: convert superpage into smaller base pages - when? + attributes of base pages of super page become non-uniform (e.g, process no longer uses a specific base page) ==> How does OS efficiently detect which part of super pages are actively used (due to uniform attribute in TLB) *Fragmentation* --------------- - Memory becomes fragmented due to + use of multiple page sizes + persistence of file cache pages (why ???) + scattered wired (non-pageable) pages (??? explain) - Contiguity: contended resource - OS must + use contiguity restoration techniques + trade off impact of contiguity restoration against superpage benefits *Eviction* ---------- - similar to eviction of base pages, when memory pressure demands it - because of single dirty bit + superpage may be flushed out entirely (even some of its base page clean) # 5. THIS PAPER's SOLUTION (this is what this paper is all about) # Allocation (Section 4.1 --> 4.3 and 4.8 in paper) - use preemptible reservation: (opportunistic policy) + superpages as large and as soon as possible + as long as no penalty if wrong decision + Go for biggest size that is no larger than the memory object (what is memory object to the OS any way? code segment, file, heap, stack) + If size not available, try preemption before resigning to a smaller size + preempted reservation had its chance - why doing all of this? *Observation* 1) Once an application touches the first page of a memory object then it is likely that it will quickly touch every page of that object ==> Hence, motivation for superpages as large and as soon as possible 2) Useful reservations are often populated quickly, and that reservations that have not been experienced any recent allocations are less likely to be fully allocated in the near future ==> Hence, motivation for trying preemption before refusing allocation - multi-list reservation scheme + why multi-list? different app has different *best* size + one reservation list for each page size support by hardware E.g: 512KB, 64KB, 8KB pages reservation list for 8KB may contains reservation of size 64KB, 512KB reservation list for 64KB may contains size 512 KB + reservation in each list is sorted by the time of most recent allocation Why? reservation at head of the list is chosen for preemption (for LRU) ==> that has its most recent allocation occurred *least* recently Why doing this? See observation + do not break a reservation into base page immediately when preemption > break it in to smaller extents > free those without allocations > unpopulated extents are transfer to buddy allocator > insert those *partially* populated at the head of smaller reservation list (remember: fully populated extent does not need to be reinserted) (for more, see example in section 4.8) # Promotion: incrementally - Superpage is created whenever any superpage- sized and aligned extent within a reservation is *fully populated*. ==> again, opportunistic policy - Why not promote when extent partially populated? (??? Explain) + may inflate the apps footprint + apps tend to populate their address spaces densely and early in their execution # Demotions: - Speculative: + one ref bit per superpage ==> how do we detect portions of superpage not referenced anymore + on memory pressure (i.e we need to choose a page to evict), demote superpages when reseting ref bit (of what) ==> Why? because we will know which portion is not ref, hence can be evicted + re-promote (incrementally) as pages are referenced - dealing with dirty superpages: + how do we know which portions is dirtied, and flush it to disk? + if we flush out the whole superpages --> IO overhead + Solution: (this avoid flushing all base page of the superpage) > Demote clean superpage on first write > re-promote (incrementally) later when all base page are dirtied E.g: if a superpage P contain 4 base page: p0 p1 p2 p3 and a process attempt to write to p2, hence, before write happen, we demote P into P1 and P2, when P1 = p0 p1, P2 = p2 p3. P2 is demoted to base pages, hence now p2 is dirtied. later on, if p3 is dirtied ==> promote to P2 = p2 p3, and so on Alternative: - no demotion on dirty superpages - each base page has a hash, created when read from disk - when a base page in a superpages is dirtied --> all base pages are considered dirtied by the OS, but, when flush each base page to disk, the hash is recomputed to see if the page is actually dirtied, and if yes, write it to disk - disadvantage: + overhead of hash computation + hash collision --> there is still chance that a dirtied page is not write to disk (although this chance is very small) ==> solution: + compute hash only at idle loop + apply hash for partially dirty superpage only # Fragmentation control - coalesce available memory regions whenever possible + work when most of main memory is available + this is done by the buddy allocator. When memory are freed, it automatic coalesce contiguous free contiguous memory regions i.e when free, check next free, if yes, then combine - under memory pressure, we face low contiguity + modified page daemon performs *contiguity-aware page replacement* - contiguity-aware page replacement: + restore contiguity > 3 lists: inactive, active, free > move clean, inactive pages to free list > page out dirty inactive pages + minimize impact > prefer pages that contribute the most to contiguity > keep content as long as possible (to avoid IO when need) (even when part of a reservation: if reactivated, break reservation) - Dealing wired pages: + clustering to avoid fragmentation (what is wired page anyway? Kernel page that is not able to be evicted) NOTE: I don't understand detail implementation in section 4.9 # Population map (more notes) # Implementation Some of the implementation tricks with FreeBSD. - Modify the paging daemon to make it contiguity-aware + cache pages are available for reservations, are put at the end of reservation lists + paging daemon is activated when: > memory pressure > and failure to allocate prefer-sized continuous region + all pages backed by a file are moved to inactive list, hence the candidates for reservation. (by default, these pages never moved to inactive unless in case of memory pressure.) - cluster "wired pages"(pages used for kernel internal data structures, thus can not be evicted) so that they do not fragment memory - change the (virtual) addresses when multiple processes map the same file, so that the addresses are compatible with superpages allocation. (otherwise, if the addresses are differ by, say, one page size --> cannot build superpages for that file in the page tables of both processes) # Evaluation - Applications benefit from superpages - The best superpages size depends on application (given the OS only support one superpages size) - Support of multiple page sizes yield higher performance than support for one superpage size, because now the system dynamically selects the best size for every region of memory Question: what kinds of application workload most benefit from superpages? What do not? Note: if the OS gives a superpages that larger than the applications's footprint the OS semantics may changes, because now invalid access could not be caught