« Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures | Main | Implementing Remote Procedure Calls »

Practical, transparent operating system support for superpages

J. Navarro, S. Iyer, P. Druschel, and A. Cox.

Comments

Summary

Use of superpages improves the usage of TLB by reducing pagefaults. The paper deals with the various memory management issues which arise upon introducing variable sized superpages.

Problem

Main memory size has been steadily increasing over the years thus keeping up with the Moore’s law. But this lead to the steady decline in TLB coverage which subsequently lead to higher page fault rate. Increasing the page size solves this problem but introduces new problems like internal fragmentation. Though new hardware provides support for a variety of page sizes, the OS has to deal with hard challenges like fragmentation control to ensure contiguous extents of physical memory, promotion and efficient paging out of large sized superpages. The paper proposes various policies to efficiently solve such issues.

Contributions

Reservation based page allocation is not something novel and was already published in the work by Talluri and Hill. However this paper brings in the idea of dynamically selecting the superpage size based on the attributes of the memory object. Other new ideas proposed in this paper are :

1. Preemption of reserved unallocated frames during increased memory pressure.

2. Memory allocation/deallocation over a period of time can lead to excessive fragmentation. To address this, they use the buddy allocation algorithm to coalesce the region of memory to form a larger chunk. They did this incrementally. Similarly, in case of eviction or writing a dirty superpage to disk, they recursively break it to smaller pages, check its hash, (which they again optimized by computing it lazily) which again reduced unwanted IO.

3. They address the issue of allocating memory under memory pressure by preempting existing reserved memory based on LRU which was a fair choice but they did not cite any previous work or presented statistical evidence.

4. A population map is used to keep track of allocated pages and help in the promotion and demotion of super pages.

5.Speculative superpage demotion strategies that reduce the cost of superpage eviction and optimize disk I/O costs of partially modified superpages.

Evaluation

The authors have carried out an extensive evaluation of the proposed system to prove that their design decisions do not negatively impact the performance. The authors prove that their system improved performance by giving details about the speed-up obtained along with TLB misses for all the chosen benchmarks. They have backed up the claim of improving contiguity by executing web and FFTW applications and comparing them against daemon for page replacement and cache, where daemon performs better. They have also measured the system against various pathological cases and show that the overheads are negligible. The evaluations could have been performed on wider range of hardware, since hardware support for superpages is one of the issues the authors are trying to solve.


Confusion

I have a few questions:

1. This is a basic question. More memory means more money spent but what is the exact reason for it to slow down even in case of Random Access Memory? Why can’t we increase the TLB size beyond certain limit without affecting latency?

2.Buddy allocator has been referred to multiple times in the paper in different scenarios but it’s actual purpose and implementation is not clear to me.

3. I would like the contiguity aware page daemon to be discussed futher.

Summary
The paper discusses the existing challenges imposed due to the support of superpages in the OS. The authors then provide a practical and efficient superpage management system that can support multi sized superpages ensuring that the application memory footprint is not bloated.

Problem
Main memory sizes are increasing at a faster pace compared to the TLB coverage. Due to this reason, application having large working sets see higher number of TLB misses and hence have performance degradation. To mitigate this problem, superpages are used that dramatically increase the working set coverage of the applications.
Superpage management comes with its own set of problems w.r.t page promotion, demotion and eviction. There are issues with allocation strategies as relocation based allocation is costly compared to reservation based approach. Using a fixed single sized large pages can lead to internal fragmentation. Thus, there arises a need to design such a management systems that supports multiple size superpages and handle internal fragmentation and contiguity effectively.

Contribution
The author's design uses reservation based method and extend it to support multiple superpage sizes, incremental promotion and demotion of the superpages. They use buddy allocation method to handle the physical memory and a new data-structure called population map for easy decision making. The buddy allocation method helps in decreasing the fragmentation and is backed by a contiguity aware page replacement to ensure contiguity.
Population map keeps track of the memory allocation in each memory objects and helps in reserved frame lookups, deciding on the promotions, identifying the unallocated regions during preemption's. The authors propose the demotion of a page when the write is attempted and subsequently a promotion happens if all the base pages are dirty. This helps in controlling the large I/O required to flush the contents of the page even though its partially dirty.

Evaluation

The authors have evaluated their proposal on a many real-world workloads to understand the impact of having single sized superpage vs multiple sized superpage and the reduction in the TLB misses due to superpages. Based on the results, about 50% workloads show ~5% improvement and around 30% workloads show an improvement of 25%. Only one workload(Mesa) sees performance degradation as the allocator does not differentiate zeroed out pages from other free pages. Majority of the workloads see a TLB miss reduction of about 90% or more which is quite significant number. This piece of evaluation however doesn't mention the internal picture of how many pages undergo promotion/demotion etc.

Overtime, as the system undergoes allocation/deallocation, memory fragmentation becomes prominent and hence can impact the performance of the superpages. The authors evaluate their design in this aspect by first fragmenting the system using a Web server load and then running another workload in a sequence or in parallel. The results show that in sequential execution, the system tries to recover from the fragmentation and tries to find contiguity to build superpages and is able to recover 20 out of 60 - MB superpages. Similarly, for parallel execution, the goal is measure how page replacement policy performs in the presence of contiguity seeking application. About 30% of the superpages requests were fulfilled in the scenario.

Lastly, the authors try to evaluation the system under various pathological cases and figure out the performance implications under such scenario. They observe that incremental promotion overhead is significant (8.9%) but point that 7.2% of this is due to hardware while 1.7% is due to population map management. The overhead due to population map is not known in the extensive workload evaluation mentioned earlier.

The authors show here that without modifying the application and hardware they are able to mitigate the TLB coverage problem for most of the applications. However, it would have been better to talk/evaluate how easy is it to port this proposed system on new OS/hardware? Is it going to be a smooth porting?

Confusion
I am confused about the cache coloring subsumption due to this new proposal.

1. Summary
The paper addresses issues with supporting memory pages of multiple sizes. While small pages are generally better and memory management thanks to less fragmentation and less I/O b/w, larger pages are better for improving TLB coverage. The authors analyze these issues and implement an effective memory management system for Alpha processor in FreeBSD addressing issues including page allocation, fragmentation control, promotion, demotion and eviction of pages.

2. Problem
Advancements in memory systems over the last decade has resulted in significant increase in available physical memory (and in parallel, the memory demand from applications for best performance) but this has not been paralleled by an increase in TLB coverage. This has led to increase in TLB miss %, which significantly affects application performance. Naive h/w optimizations such as increasing TLB size, banking the TLB, adding second levels etc. have been attempted without enough performance improvements, leading to the implementation of multiple page size support from h/w and OS. Increasing page sizes, creates different trade-offs requiring better memory management which this paper addresses.

3. Contributions
Reservation based allocation:
Reservation of entire superpage size of physical memory (if available) when there is an access to one of its base pages, is an efficient allocation technique that cuts down overheads of allocation that other schemes (such as reallocation) suffer from. This paper extends this scheme to support multiple page sizes.

Policy on deciding superpage size:
The decision on the size has to be made ‘eagerly’ in the reservation mechanism. The entire region has to be tentatively reserved at the first base page access. The policy implemented looks at attributes of the memory object at the time of page fault and makes size decisions based on the object being static/dynamic in size, alignment issues, avoidance of overlap with existing reservations. When new allocations are requested, the system preempts existing reservations (those with least recent page allocation) rather than refusing allocation of its desired size. The use of a contiguity-aware page daemon increases the feasibility of being able to allocate required superpage sizes. The page reclamation occurs not only when memory pressure exists but also when contiguity falls low, and the page daemon performs page reclamation based on LRU and awareness of how any particular page will help improving contiguity.

Promotions/ Demotions:
Promotions of super-page sizes are performed incrementally from one stipulated super-page size to another, as the requirement grows, to avoid fragmentation. Demotions, on the other hand, are done speculatively since otherwise it is impossible to figure out which pages within the superpage are inactive. Similarly, at the time of page reclamation, when dirty pages need to be written back, it is impossible to know which base pages with the superpage are dirty - therefore clean superpages are always demoted before the first write and are promoted incrementally when bore pages are written - this significantly reduces I/O costs to write to disk and for copy on write etc.


4. Evaluation
The authors implement the design on FreeBSD-4.3 kernel atop an Alpha 21264 processor with 128 entry TLB. While the idea of increasing TLB coverage is the fundamental aspect of the paper - no details are provided as to the h/w modifications required to do so. The TLB needs to be able to understand varying page sizes.

The best-case benefits analysis is rigorous, but size it is a best case analysis, it fails to occur for overheads from preemption and unavailability of required superpage sizes (though, separate analysis of those overheads have found them to be minimal). While the paper compares the use of multiple page sizes against baselines of a single page size, it would have been interesting to see the use of only two page sizes (say 8KB and 512KB) and analyse how different is the speedup from multiple page sizes. Also, it would be interesting to analyse the impact of larger page sizes on multiprocessor environments with significant sharing - the overheads from transferring pages between processes might impact performance more.

The analysis of adversary applications is a clear highlight - most of the proposed techniques in the paper related to incremental promotions, preemption of reservations (for better allocation), the modified policy with contiguity awareness are all evaluated for overhead estimates (in their area of significance) and are shown to be reasonable techniques.

Would have been more interesting, if their was some breakdown on individual performance improvements from each of their innovations.


5. Confusion
H/W changes to the TLB - installing comparators to check for page size, programmable masking of page number bits are all overheads that affect latency. Is that still better than increasing the TLB size, banking the TLB and/or adding an L2 TLB.

Are multiple page sizes (more than two extremes) really necessary?

It is not clear as to how speculative demotion works - are base pages randomly chosen to extract from the superpage and tested for their activity?

Population map usage is not very clear.

1. Summary
The paper provides a practical, transparent and effective solution for providing superpages support in operating systems through an effective reservation-based superpage management system. Superpages are memory pages of large sizes that allow a single entry in the TLB to map larger portion of physical memory to a virtual address space thus dramatically reducing TLB misses thereby providing performance improvements for various applications.

2. Problem
The growth in the size of TLB in terms of number of entries has remained very slow compared to the exponential growth in the size of main memory as well as the working set of current applications, to keep the access time for TLB low. How can we increase TLB coverage without increasing the the TLB size? We could use larger base pages called superpages of multiple page sizes that would enable increased TLB coverage while keeping internal fragmentation low. Providing support for superpages imposes several challenges: A super-page size must be supported by the processor, it must be contiguous in physical and virtual space and aligned on address boundary, TLB entry of a super-page provides single reference bit/single dirty bit and therefore is hard to know which base page is referenced or dirtied. Techniques and policies for various operations like superpage allocation, promotion, demotion and superpage replacement or eviction, etc. need to be designed to support effective superpage management.


3. Contributions
The main contribution of this paper is the building of an effective and practical superpage management system by using a set of few previously used mechanisms and some novel mechanisms guided by a set of policies that decide when to invoke these mechanisms, in order to support different superpage management operations namely reservation-based superpage allocation, incremental superpage promotion, speculative demotion, eviction, etc. i) Previous work on reservation based superpage allocation is extended to support reservations that can be preempted if its most recent allocation occurred least recently by use of a novel multi-list reservation scheme. ii) A new structure called population map is introduced that keeps track of allocated base pages for each virtual memory object (mmaped region, code, stack and heap segments are all virtual memory objects) and support efficient lookup during promotion and preemption. iii) A contiguity aware paging daemon to restore contiguity and a buddy allocator that automatically coalesce contiguous free contiguous memory regions are implemented to keep fragmentation at control.

4. Evaluation
The evaluations cover a wide range of workloads/benchmarks that cover both best-case as well as pathological-case scenarios. First, benchmarks/applications that vary in terms of memory requirements are run in isolation under no memory pressure to demonstrate the benefits of having support for multiple superpages of different page sizes. It was nice to see how few applications that create large superpages have significant benefits while few applications (web) in spite of creating huge number of smaller pages as a result of accessing a large number of small files doesn’t suffer or suffer negligible performance degradation.

The evaluation that shows the effectiveness of their contiguity aware page replacement technique by running a workload that demands creation of very large superpages on memory that is already fragmented and concurrently running two applications with exact opposite superpage requirements again shows how special applications that create large superpages have significant benefits while normal workloads that doesn’t create /require large superpages do not suffer or suffer negligible performance degradation.

Long-running workloads that uses a combination of benchmarks in various order could have been done as opposed to just running two applications concurrently so as to show how the system performs when concurrently running a mix of applications that have varying superpage requirements. Workloads of 10 minutes or less are still short and it would have been interesting to see the performance for workloads that run for a longer period of time typical to data center workloads that access huge datasets.

Also, there were no measurements that show what the additional memory overhead was for maintaining new data structures like reservation list per page size and population map per virtual memory object.

5. Confusion
I didn’t quite understand the page table entry replication scheme. Also, how is virtual address changed to accomodate pages of different sizes? The PTE is modified to include page table size. Is TLB entry modified too?

1. Summary
This paper describes mechanisms for supporting multiple superpage sizes in an modern OS. It tackles several issues including superpage promotion/demotion policies, fragmentation and managing disk I/O for dirty superpages.

2. Problem
For most modern computer systems the size of the TLB has not scaled as much as physical memory or on-chip cache sizes. This leads to limited TLB reach and a loss in performance that can be deal with using superpages. However superpages imposes constraints on OS memory allocation due to contiguity and alignment requirements. They present allocation issues such selecting the optimal superpage size for reservations. Also large pages lead to higher disk I/O demands due to a single dirty bit for the whole region.

3. Contributions
The authors manage using multiple superpage sizes with a dynamic reservation based scheme. -
1- The OS reserves contiguous extents of main memory in anticipation of promoting it later to superpages. A key idea is to use the largest reasonable page size based on the memory object the page belongs to. An overcommitted reservation is easily undone using pre-emption
2- On facing memory pressure, the system pre-empts a suitable reservation that was least recently allocated. This is done using linked reservation lists and population map data structures. The reservation itself is broken into smaller extents of the next lower super page size.
3- The system incrementally promotes a set of base pages in a reservation to a superpage once all the contained base pages are populated. This allows for a scalable way to create larger superpages.
4 - Demotion of superpages gives a mechanism to adjust the optimal super-page size for a memory region. Demotion is done incrementally to the next lower super page size. The system also speculatively demotes superpages to determine if parts of it are not being referenced at all and thus reclaim those memory pages.
5- Authors introduce a contiguity aware page daemon that demotes superpages to reclaim free memory. The daemon is also activated when the systems deems the level of contiguity falls low.
6- Lastly clean superpages are demoted whenever a process attempts to write to it, and are promote if all sub pages are dirtied - this reduces unnecessary disk I/O for superpages.

4. Evaluation
The authors use a broad range of benchmarks including the SPEC INT/FP 2000 suite and many other programs : web server, linker,fftw. Several memory intensive programs show high benefits from using superpages due to reduced TLB misses. They also give interesting comparisons for the performance with different fixed page sizes - pointing out that there is no one size fits all, thus defending their dynamic page size scheme. A sustained run test also proves the page daemon is able to recover contiguity lost in fragmentation. However.they do not specifically talk about the benefits of speculative demotion of super-pages i.e. do not provide a breakdown of why and how this mechanism works.

5. Confusion
For the page replacement daemon, what is the difference between the cache and inactive pages, are the cache pages easily replaceable or unwanted?

Summary

This paper presents the design of superpage management scheme for general-purpose operating systems and provides an evaluation of the implementation. The solution proposed supports multi-sized super-pages and can be leveraged for transparent application memory super-paging. Speed-ups of over 25% in 30% of the benchmarks were observed.

Problem

There has been an increasing disparity between TLB coverage and main memory sizes. TLB reach has not grown in proportions owing to its associativity and access requirements. The performance of applications are degraded by large proportions from TLB misses and its low coverage. Superpages have been introduced as a solution to this problem in this paper , however accurate implementation of superpages requires a number of careful considerations. An attempt has been made to address these challenges and present a transparent solution.

Contributions

The primary contributions of the paper is extension of an existing reservation based super page management system and introducing enhancements like contiguity aware page replacement mechanism, multiple super page sizes and for partially modified pages an improved disk I/O.
The key notions are variable page sizes of super pages, which are reserved based on a preferred size policy and hence prevents problems of fragmentation and contiguous memory allocation, preemption of reserved unallocated frames in memory pressures, a buddy allocator and contiguity aware page daemon which helps in coalescing fragmentated memory chunks.
A variety of data structures also have been introduced as multi-list reservation list which tracks of partially used memory reservations for objects, population map to keep track of allocated memort in each object. Several immediate and speculative superpage demotion strategies that reduce the cost of superpage eviction and optimize disk I/O costs of partially modified superpages are also discussed.
Incremental promotion of base page to a super page of increasing sizes is implemented, similar if the page is chosen to be evicted it is demoted to next smaller super page size.

Evaluation

The authors have implemented their design FreeBSD-4.3 kernel and presented evaluation on a varied range of benchmarks. Superpages consistently outperform the unmodified version with an improvement gain of 5%-25% for over 35 benchmarks in best case scenarios. Multiple super page size clearly benifits over a fixed super page size. Allowing the system to choose page size gives higher performance which is an important observation.
This system actively restores contiguity of memory regions a key requirement for good performance. The authors have performed two workloads,
Sequential execution :first fragmenting the memory and (by using Web Server workload) and then restoring it (using FFTW). The "daemon" is able to restore 20 out of requested 60 pages
Concurrent Exection : web-server run concurrently with contiguity seeking application. The daemon is again able to fulfill ~30% of superpage requests.
Contiguity-aware page daemon shows significant performance improvements even in the presence of widespread memory fragmentation.
The authors also evaluate the overall overhead in practise and proves that it is 1%.
In the evaluations authors have shown that contiguity aware daemon only amounts to 3% performance hit for the web server process from the traditional least recently used policy, it could be more significant for other scenarios which should have been highlighted in the evaluations.
Additionally evaluations could have been done for speculative demotions and mult-list reservation schemes over traditional approaches.

Confusion
The concept of buddy allocator and its implementation is not so clear to me, also a bit more explaination of population map is needed.

1. Summary : Most processors provided hardware support for large sized pages, superpages. These increased TLB coverage, reduced TLB misses and improved performance. The OS faces challenges such as superpage allocation, promotion tradeoffs, fragmentation control, etc in order to support superpages. In this paper the authors analyze these issues, propose the design of a superpage management system, They also evaluate on a varied range of intensive workloads and benchmarks and claim to achieve benefits over 30%.

2. Problem : TLB lies in the critical path of every memory access and it is fully associative with low access times. The size of TLBs has grown at a much slower pace compared to that of main memory and caches. To cater to this the TLB coverage should be increased by using larger base pages of multiple sizes to cover the working set of applications. Even though this improves performance by 30%, it gives rise to enlarged application footprints, increased fragmentation, higher I/O paging traffic due to increased page granularity. To tackle this complex, multidimensional optimization task, the authors develop a general, transparent superpage management system that balances tradeoffs and allocation of superpages, to yield sustained benefits when memory is plentiful.

3. Contributions: the paper extends a previously proposed reservation based approach to work with multisite superpages, it puts forth contributions in the following areas namely - a] reservation based allocation- where a contiguous region(size determined by attributes of faulting memory object) is reserved at page fault time and promoted when number of frames reach a promotion threshold. Promotions are incremental in nature to avoid the inflation of applications memory footprint. b] reservation list- keeps track of the reserved extents that are partially populated sorted by most recent page frame allocations. When the need for contiguous region of free memory arises, system first relies on buddy allocator and then on preempting a reservation from this list. c] the system prefers to preempt existing reservations over refusing allocation in the case of scarce free physical memory or excessive fragmentation based on approximate LRU. d] buddy allocator coalesces available memory regions as and when possible. e] speculative demotion-eviction of a base page by the page daemon causes a superpage to be demoted recursively to the next smaller superpage till the one that contains the victim page. f] clean superpages are demoted on an attempt to write and repromoted later to prevent the I/O overhead of writing out the entire superpage set when the single dirty bit is set. An optimized hash computation on the superpage contents aids this design. h] population map- keeps track of the allocated base pages(by using counters) within each memory objects and aids in address mapping, detecting overlapping regions during allocation, page promotion decisions, identifying unallocated regions at preemption. This is implemented using a radix tree, where each level corresponds to max superpage size, subsequent levels correspond to next smaller superpage sizes. i] contiguity aware page daemon- that maintains active, inactive and cache lists based on ALRU. It is activated on memory pressure or when available contiguity falls low with the goal to restore contiguity. All clean pages backed by a file are moved to the inactive list once the file is closed.

4. Evaluation: The authors have performed the evaluation of their transparent support for superpages very judiciously. They implement their design in a FreeBSD kernel residing on an Alpha processor, as a loadable module with hooks in the OS to call module functions when necessary. This firmware supports superpages by page table entry replication(PTE). Several workloads demonstrate benefits with superpages when free memory is available and non-fragmented. Only Mesa shows a negligible degradation, which the authors attribute to their allocator not differentiating zeroed-out pages from other free pages. One shortcoming is that this design subsumes page coloring requisite to reduce cache conflicts. By supporting multiple superpage sizes, results show an improvement as the superpage size is application dependent. I highly appreciate how the authors have evaluated their ideas (buddy allocator and contiguity aware page daemon) for sustainability of performance improvement in the long run over conventional systems and this can be clearly attributed to active restoration of contiguity in the former compared to the latter. The effort of the authors to design three synthetic workloads to exhaustively test the overhead of incremental promotion, sequential access, preemption is highly commendable. Most of the above overheads are highly negligible or can be attributed to hardware specific reasons such as PTE translation. However having noticed the extensive evaluation done by the authors as mentioned above, I feel little insight into the following aspects would have rendered the evaluation complete: a] as the overhead in many cases is attributed to hardware specifics, the authors could have evaluated the system on more hardware architectures rather than constraining only to Alpha. b] the evaluation for the extra computation and memory for computing hash, counters, accessing and storing population map in the form of radix tree has not been catered to. In my understanding I feel these new data structures and computation will have a significant effect in terms of time and space complexity which the authors have ignored to evaluate.

5. Confusion: Granularity of each of the data structures such as population map, reservation lists, hashing and their association with each process is not clear.

1. Summary
This paper describes techniques for managing superpages in a way that decreases fragmentation and the need to copy data. They implement this as a loadable module for FreeBSD and evaluate it on benchmark suites, real-world programs, and pathological examples.
2. Problem
When this paper was written, memory had increased in size much faster than TLBs had increased in size. This led to a greater number of TLB misses, which noticeably degraded performance. Many processors supported superpages, memory pages of larger size than ordinary pages, which could be used to increase TLB coverage. However, superpages could not be used efficiently: all existing systems that managed them either led to too much memory fragmentation or too much memory copying.
3. Contributions
This paper contributes a superpage management system that uses reservation-based page allocation. This means that when the system allocates a page in memory, the system tries to allocate in a region where it can expand to the maximum superpage size. The system chooses to preempt reservations for a superpage rather than deny allocation whenever possible. In addition, the system promotes superpages incrementally: if a process needs more memory than the current superpage allocated to it, the system increases it to the next superpage size, not necessarily the maximum possible superpage size. For demotion, the system will demote a superpage speculatively, without knowing whether the full superpage is used or not, despite that it must be re-build if so. The system also maintains a linked-list of superpages according to the extent of memory that can be gained by demoting them.
4. Evaluation
The evaluation seems fairly well done. They evaluate on benchmark suites (which also include common programs such as gzip) as well as common workloads such as a web server. They find that every test, except one, shows performance improvement. This demonstrates well that their system should be used by for actual computing work. They also evaluate their system on pathological worst cases and also show that its complexity is either constant or linear in the number of superpage sizes.
5. Confusion
I was confused by the many references to a buddy allocator, which the system depends upon deeply but which the paper never describes in detail.

1. Summary:To support variably sized pages, the authors demonstrate a reservation-based superpage allocation policy. This policy, in combination with carefully selected data structures and a novel page reclamation daemon reduces TLB contention and keeps fragmentation low, while incurring a small CPU overhead.

2.Problem: In keeping with Moore's law, memory density and thus main memory size has maintained exponential growth over the past few decades. This has yielded an increased rate of TLB misses, and miss handling comprises a larger portion of program overhead than in earlier decades. Simply increasing page sizes increases TLB coverage, but increases both the potential for internal fragmentation, as well as the cost of writing out dirty pages. Hardware manufacturers have, however, constructed mechanisms which allow a single TLB to handle variably sized pages. However, OS mechanisms for handling variably sized pages require careful consideration, as contiguity becomes a resource to be managed; different forms of naive contiguity management may yield badly fragmented address spaces over the course of a long-running workload or may over-allocate memory, yielding under-utilized superblocks.

3 Contributions:
The core of the system is a fairly simple reservation policy; when a process requests a superpage-aligned memory object be mapped into its address space, the system speculatively provides a reservation for the smallest superpage size which can contain the object. The requesting process is thus given preferential access to subsequent pages in the reservation; as other pages are used, they are mapped from the reservation, and are upgraded to a superpage once all pages within the superpage sized region have been touched. To minimize fragmentation, the other policies support the demotion of supperpages as well as the revocation of unused pages within reservations. In particular, the contiguity-aware page daemon maintains a hierarchy (from high to low) of active (used and mapped), inactive (unused, but mapped), and cached (unmapped and clean) pages. Periodically, the page daemon will check the status of pages in each list, and downgrades them as necessary. Cached pages are treated as though they were completely freed, and are also coalesced into the free list, though they may be recovered if their contents are remapped into a process.

For each memory object, a radix tree is maintained, with a level of nodes for each page size. Thus, for each address in a memory object, the appropriate superpage or reservation can be located using time that is constant for a given architecture.

4. Measurements: The authors demonstrate that the best case behavior for a large set of programs, by ensuring that they are guaranteed access to fully contiguous physical memory. With the exception of two processes, TLB misses are reduced by at least 75%, and nearly every workload experiences a speedup. Next, in a choice I appreciated, the authors examined the performance of their system on several degenerate workloads, and measure negligible performance degradation. The workloads are also run with more realistic memory conditions, and they also run the web server concurrently with a synthetic app that demands contiguous memory. I liked that the authors chose benchmark workloads from both ends of the spectrum, creating use cases that artificially stressed their specific design choices; I do wish they had included the tables for those experiments as well, however. I also wished they had chosen a slightly more realistic concurrent workload - such as a workload hosting a multi-user development environment.

Confusion:
I'm still not perfectly clear on what a buddy allocator is - structurally, it sounds remarkably similar to a radix tree.

1. Problem
The TLB coverage is small compared to physical memory size and it makes the performance with large working set worse due to frequent TLB miss.
The superpage consumes large physical memory and leads large traffics because superpage defines how much data reside in physical memory and usually the the size of superpage is big compared to basic page size. System does hurt performance to manage the fragmentation, thus superpage was hard to use.
Superpage has two allocation methods: relocation-based allocation and reservation-based allocation. By the way both methods have demerits: relocation-based allocation results in decreasing performance due to frequent movement of data, and reservation-based allocation causes fragmentations.

2. Summary
Super page in this paper preserves memory space at the fault stage so that limited TLB entries are utilized efficiently. On the other hand Super page, which expects performance benefit from TLB hit from big size of page, has drawbacks which makes the performance to be worse rather than improved. To cope with current shortcoming of superpage, the paper introduces and implements several concepts: reservation list, promotion and demotion control, fragmentation control, population map, and buddy allocator.

3. Contribution
Several things in this paper are not a new concept. For example, the memory allocation mechanisms, reservation-based and relocation-based allocation, and buddy allocator, while they introduced the concept of promotion and demotion (at least in my opinion). The promotion happens gradually when a page fault is occurred. Thus, the page size of super page is increased gradually as well. The demotion takes place when there is no available memory space due to a super page in physical memory. The size of super page is recursively decreased whenever it is required.
To support promotion and demotion, several methods are introduced such as buddy allocator, reservation list, population map. Buddy allocator helps to reduce fragmentation. The page replacement daemon relocates the data to make contiguous free space if the available memory is under certain limit. Reservation list and population map are used to trace reserved spaces, which are pre-emptied by OS when the memory is scarce.

4. Evaluation
The paper evaluated super page efficiency with several benchmark programs. Most of benchmark programs showed better performance when super page is implemented, especially Matrix obtained the speedup of 7.5. The average speed up of spec benchmark is 1.11. The multiple superpage sizes help to improve the performance, which comes from increasing possible TLB hit. Even though the fragmentation makes memory free list scarce, the performance is improved after using the daemon.
I am not sure but running a program on a system seems quite unfair to show the performance improvement, especially in a memory management case.
If the system is distributed system with message passing, I wonder what results the paper can get.

5. Confusion
Why a reservation station and a population map are required in this paper? A population map already knows the unused memory area, am I right?
Can this concept be implemented into the system with a hardware TLB? If so, how it is possible and what kinds of resource does the hardware require?
If the superpage in TLB is frequently touched by other processes to support memory consistency, how about the performance? And how about this case? If there are four cores in a system and three cores are running same program with shared memory and just one among three makes a super page due of different workload on each core, is it possible just one core have a super page or all the three cores should have same super page?

Summary
The paper proposes a design for providing Operating System level support for superpages to improve TLB coverage and performance of real application workloads while effectively handling the complexity and overhead associated with maintaining such a system. This reservation-based superpage design proposes various novel memory management techniques to efficiently manage the contiguity management and paging overhead problems that naturally arise in a superpage based system. It has then been implemented to measure its performance.

Problem
Improvements in TLB capacity lagged improvements in main memory capacity, causing degradation in TLB coverage. This was important as reduced TLB coverage severely impacted application performance, as it led to a higher TLB miss rate and therefore larger number of expensive page table lookups through main memory. Superpages are memory pages of sizes much higher than an ordinary page (base page). They allow a TLB entry to map to a large address region equivalent in size to the superpage size, thus providing potential for significantly improving TLB coverage, TLB hit rates and subsequently the application performance. However, existing superpage implementations were inefficient in handling all the complexities associated with running a superpage based system.

Contribution
The superpages design proposed by the author relies on the assumption that the underlying hardware will allow support for different page sizes. This approach extends previous work on reservation-based approaches to support large superpages of multiple sizes, while effectively managing fragmentation and paging I/O overheads. This is done by use of 1. reservation-based allocation of memory objects (with a preferred superpage size policy) with the set of contiguous frames for the reservation obtained from the buddy allocator 2. Dynamic, incremental promotions and speculative demotions of superpages to balance TLB hit rate performance with 3. Use of multi-list reservation scheme to assist in reservation preemption. 4. Use of a tree-based population map to perform efficient reserved frame lookup, overlap avoidance, promotion decisions and preemption assistance. 5. Memory coalescing performed by the buddy allocator to discover available memory regions.

Evaluation
The authors evaluate their design by implementing it in the FreeBSD-4.3 kernel as a kernel module. The evaluation experiments were run on a Compaq XP-100 machine with the Alpha 21264 processor which supported four different page sizes. They initially ran their implementation against the unmodified system for the CINT2000 and CFP2000 benchmark suites, as well as on other applications (Web, FFTW, etc.) which had differing memory usage patterns over the course of their execution. They observed that as expected, superpages improved TLB hit rates and sped up the respective applications. Specific speed up cases for certain applications with outlier performance characteristics (Matrix, mesa) were then explained. The authors then conduct another study in which they restrict the system to use superpages of only a single size. The obtained results are worse than the corresponding ones where multiple superpage sizes are allowed, indicating the effectiveness of the approach to let the system chose between multiple page sizes. The issue of fragmentation was studied next for both the sequential and concurrent execution cases, in which the performances of the cache and daemon contiguity management techniques have been compared. The page daemon used for the superpage design was observed to have successfully reclaimed contiguity with the passage of time. The authors also developed adversarial applications to gauge the potential overhead of their design components in a pathological case, and found out that the performance degradation was not significant. The authors then prove the superiority of their handling of dirty superpages and then analyze scalability of their approach for future hardwares.

While the authors do provide a substantial body of work in their evaluation, I would like to point out two drawbacks. The first was that they compared their system against an unmodified system during the benchmark run, which was not the only objective of their design. They should have run some tests against the relocation-based HP-UX and/or IRIX systems to predict how much they had improved upon an existing superpage implementation. Also, they did not evaluate the average memory and execution time overhead of each superpage design aspect caused by introducing it to the unmodified FreeBSD system. Such a system profiling would have helped identify probable performance bottlenecks in their implementation.

Questions/ Confusion
1. The superpage management for the stack/heap regions was unclear
2. The implementation of the page daemon to handle cache / inactive pages could have been better explained.

1. summary
The paper talks about memory management techniques in OS that could support the usage of superpages, a common feature provided modern hardware. Increased page size results in better usage of TLB, and the paper discusses how to avoid various issues in memory management brought by the feature of superpages.
2. Problem
The size of main memory has been growing exponentially in the recent years, while the size of TLB keeps lagging behind. It means that the TLB coverage drops quickly, and applications that require large amount of memory can suffer greatly from TLB misses. Using larger page size relieves the problem, but using large page in all case can increase page internal fragmentation and memory pressure.
3. Contributions
The paper proposes memory management using superpages with dynamic size. Their design includes,
i. Reservation-based allocation. When a page is required by an application, instead of just handing out that single base page, the OS may determine a preferred superpage size in the contiguous memory region, and if possible, reserve that larger space in advance for the application. Preemption may be used once memory is under pressure.
ii. Incremental promotion and demotion
Once the current superpage within a certain reservation region is fully populated OS immediately creates a new superpage next to it with the same size, promoting the superpage to a larger size, and this growth in size is incremental. On the other hand, to save I/O cost during swapping, the page size demote down if the page is to be swapped, while only small fraction of it contains valid content.
iii. In paging out and in dirty superpages from the disk, the OS needs to know which base pages of the superpage is dirty to perform effective demotion and save redundant I/O cost. The paper discusses using hashset to record dirty pages with low collision probability.
4. Evaluation
In the paper, comprehensive experiments are made to evaluate the performance of the memory management module. The evaluation is divided into three categories, best case scenario when the common operations are performed on plentiful memory, fragmentation testing, and specifically designed memory usage patterns that denies the benefits of superpages, which gives an upper bound, lower bound, and long-term performance of this module. Their results show that the superpage management achieves great performance boost of around 30% in best cases, resistant to fragmentation, and acceptable overhead in pathological cases. It might be better for them test on long term fragmentation and worst case patterns under cases that combine great memory pressure and frequent small size allocation and de-allocation, though, since contiguity of memory regions is harder to retain when available memory is scarce.
5. Confusion
Can we talk specifically about how multi-list reservation list and population map work in detail?

1. Summary
The paper presents a design of an effective superpage management system, for increasing TLB coverage and reduction of TLB misses thus increasing performance for applications. The paper also discusses about the various challenges faced by OS for superpage management like fragmentation, superpage promotions, demotions and explain how their designs and policies handle these challenges. The authors implement their design Alpha CPU and explain their evaluation on real workloads.

2. Problem
With increases in main memory sizes TLB coverage hasn’t increased much and still remains a very small fraction of the size. This leads to applications with large working sets suffering with performance due to TLB misses. Forming superpages, where the memory page is of much larger size than a normal page, occupies only 1 entry in the TLB leading to a substantial increase in TLB coverage. However there are some tradeoffs associated with use of superpages like high IO costs, increased physical memory requirements, physical memory fragmentation. The problem addressed is to design a management system which balances these tradeoffs and achieves high performance

3. Contributions
the design presented by the authors extends the basic reservation-based superpage to support multiple superpage sizes, demotion of sparsely referenced superpages, effective presentation of contiguity and efficient disk I/O for partially modified superpages. The design used a buddy allocator for allocation of physical memory in contiguous chunks. A multi-list reservation list keeps track of partially used memory reservations and also helps with choosing reservations for pre-emptions/demotions. A population map keeps track of memory allocations in each memory object. these data structures are used to perform allocation, preemption, promotion and demotion policies. Memory fragmentation is controlled by performing page replacement in a continuity aware manner.

4. Evaluation
The authors have implemented the design in FreeBSD on Alpha CPU, where they used OS specific techniques like contiguity-aware page daemon and wired page clustering to implement their design.The authors have evaluated the best case benefits due to superpages i.e when free memory is plentiful and is non-fragmented, for various real workloads showing between decent speedup between 5-25% depending on the workload. The authors also evaluated these benchmarks to illustrate benefits from multiple superpage sizes. Experiments were designed to show efficiency of fragmentation control and memory contiguity based page replacement policy. However the memory overheads due to data structures like population map, multi-list reservation scheme is not evaluated.

5. Confusion
Is the population map a replacement to page table, or will page table still exist along with these data structures?
How is the buddy allocation system implemented?

1. Summary
This paper describes the design for an effective superpage management system. The authors employ multiple strategies like reservation, preemption, multisize page support, promotion, demotion, etc to overcome the challenges and support superpages adequately without compromising system performance.

2. Problem
Over the past few years, the memory footprint of processes has grown significantly but this has not been accompanied by a proportional increase in TLB coverage. This paper describes superpages to combat this problem without increasing the size of the TLB and improving overall performance. Supporting superpages presents several challenges:


  • Multiple page sizes introduce memory alignment issues and lead to fragmentation.

  • Hard to identify a base page as dirty as TLB would now have flags to indicate which superpages are dirty and this course grain management would lead to detrimental performance.

  • Large pages necessitate the availability of contiguous physical memory which is hard to obtain without relocation or reservation strategies.

3. Contributions
The issues faced by a superpage management system are: Allocation, Promotion, Demotion, Eviction and Fragmentation Control. Past approaches were based on relocation or depended on hardware. The system uses a reservation based allocation scheme, where a superpage size is chosen based on the attributes of the memory object for the faulting page. A superpage is created as soon as any superpage sized and aligned extent within a reservation gets fully populated and this is incremental. The system may also periodically demote active superpages speculatively in order to determine if its still being used in its entirety. Dirty base pages are identified via hash digests. A multilist reservation scheme is employed per supported page size A population map is used to keep track of allocated pages and help in the promotion and demotion of super pages.

4. Evaluation
The authors implemented their design as a loadable module for the FreeBSD4.3 kernel on Alpha firmware which implements superpages via PTE replication.They also used a diverse set of workloads from SPEC CPU2000 to FFTW to evaluate their system. The initial experiments were targeted to obtain the best case benefits when memory is unlimited and nonfragmented. Almost all the workloads displayed benefits from 5% to over 25%. The next experiments were targeted to study a system that supports multiple page sizes. The authors repeated the previous experiments but modified the system to support only superpage size and compared it against the multisize implementation. Here again the results indicate that the best superpage size depends on the application. The next experiment was targeted to study the long term benefits of superpages on sequential and concurrent execution. The authors initially fragmented system memory but the contiguity aware page daemon successfully restores required contiguity with negligible overhead. The page replacement policy also supports the system. The next experiments evaluate the system on synthetic pathological workloads (incremental promotion overhead, sequential access overhead, preemption overhead) but the observed practical overhead on the system was a mere 2% and the pathological situations are rarely observed. The authors also study the impact of dirty superpages and scalability of the system.

Overall the authors have done a commendable job evaluating their system on a diverse set of workloads by running a wide variety of benchmarks.

5. Confusion
The description of contiguity aware page daemon is not very clear and it would be nice to learn more about this in class.

1. Summary
This paper describes the various issues that may arise due to naively increasing the TLB coverage by providing support for superpages and the performance tradeoffs involved. The authors go on to give details about an effective superpage management system. The proposed system has been implemented in FreeBSD on the Alpha CPU and based on the results obtained from the extensive evaluation, the authors claim that their system offers substantial increase in performance, even under stressful workload scenarios.
2. Problem
The size of main memory has increased exponentially in the recent past. However, the TLB size has not increased at the same rate. This has caused most applications to have working sets greater than the TLB coverage. As a result, this leads to performance degradation due to higher number of TLB misses. This was solved via the introduction of superpages. However, this gives rise to a number of trade-offs such as optimistic allocation vs. pessimistic allocation, impact of fragmentation control techniques and a number of challenges in determining what would be the ideal promotion, demotion and eviction mechanisms. The authors precisely solve the aforementioned problem by taking the tradeoffs into consideration.
3. Contribution
The authors extend the reservation-based management paradigm by taking into account all aspects involved in the superpage management. The support multi-sized superpages that are reserved at page-fault and the size is determined based on whether the objects are of fixed size or dynamically sized. I believe making the allocation policy aware about the nature of the memory object is an important one, as the performance of the system would fall through if the initial reservation itself were not done appropriately. The systems supports preemption of reservations via LRU algorithm and fragmentation control is achieved via coalescing of available memory regions whenever possible. The buddy allocator, which is responsible for managing the physical memory, does this. The system also supports incremental promotion of superpages to the next size when fully populated and speculative demotion that allows the system to determine if a superpage is being used in its entirety. If not, it would make sense to demote the superpage, as that would increase the control granularity over the superpages. Lastly, in order to mitigate the disk IO overhead, the clean superpages are demoted on a write operation.
4. Evaluation
The authors have carried out an extensive evaluation of the proposed system to prove that their design decisions do not negatively impact the performance. The authors prove that their system improved performance by giving details about the speed-up obtained along with TLB misses for the 35 chosen benchmarks. Next, the authors prove the effectiveness of the multi-sized superpages and their page daemon in sustained and long scenarios. Lastly, they also prove that the performance degradation in case of adversary applications was negligible. Though their evaluation is thorough, I feel that it would have made sense to evaluate the performance of their system across many underlying architectures instead of one. I also feel that a comparison of the proposed reservation based paradigm vs. reallocation-based paradigm would be beneficial, as it would give insights regarding the benefits of the adopted approach. Lastly, I feel that different speculative strategies could have been tested out to see which one would perform the best.
5. Confusion
The working of the contiguity-aware page daemon isn’t quite clear. Discussion about the same would be helpful.

1. Summary
The paper proposes a new design to manage superpages efficiently without incurring significant overheads. This design extends reservation based superpage management scheme to support various features like multiple superpage sizes and optimizes it for low overhead execution. The work evaluates the design implementation using a range of real world workloads and benchmarks.
2. Problem
Main memory has grown at a faster rate than TLB sizes. This has resulted in poor TLB coverage and TLB misses are major performance bottleneck. This can be solved by having a larger TLB (but would increase access latency in the critical path) or having larger pages called superpages. Fixed size superpages face internal fragmentation and increased I/O latency when evicting as an entire page needs to be flushed to disk even if only a part of it was changed. Thus the solution needs to support multiple page sizes, decide when to create/destroy a superpage and how to solve the fragmentation problem. Also the solution needs to work without significant overhead.
3. Contribution
The works is built upon Talluri and Hill’s work on reservation based superpage allocation schemes. Contributions of this paper is on extending their work on various fronts which would help in implementing the technique on a real world system.
(i) This design adopts incremental superpage promotion which helps it to achieve correct superpage size based on the how many base pages are being utilized. Also, it demotes superpages speculatively to ease the memory pressure. To avoid high I/O while paging out, paging daemon incrementally demotes superpages when a process writes to them. Thus only a part of the former superpage needs to be written back in case of eviction.
(ii) To solve fragmentation issues, the paging daemon is made contiguity-aware and invokes buddy allocator to coalesce memory from time to time. Coalescing is not only done under memory pressure but also when contiguity falls low. In this way, the system favours towards creation of superpages over keeping accessed pages in the memory for longer time so that it could be accessed later. Also it clusters OS’s internal data structure, which affects contiguity and results in fragmentation.
(iii) The design proposes a data structure called population map to ease page-frame tracking on page fault, avoid overlapping regions and to help in policy implementation decisions like page demotion and reservation preemption.
4.Evaluation
The authors have evaluated their design by implementing it in FreeBSD kernel and ran it on Alpha 21264 processor. They have chosen a gamut of workloads ranging from SPEC benchmarks to real world workloads like a web server servicing over 50000 requests to test their design for best case scenarios and to characterize the overheads under adverse conditions. The evaluation section tabulates the speedup and TLB miss reduction under an optimistic condition like having plenty of free and unfragmented memory. It also evaluates the benefits of supporting multiple superpage sizes and the efficiency of its coalescing technique by testing the design for prolonged time under sequential and concurrent workload executions. The authors’ results convincingly show that their implementation do not add significant operational overhead, improves TLB miss rates and speeds up execution. However it would have been interesting to see how this technique compared against other techniques like the ones implemented in HP-UX and IRIX(Under similar conditions like fixed superpage sizes). Such comparisons would have made it clear whether making such extensive changes to the kernel would have been worthy or not. Also, stating that 3500 lines of c code has been added to the kernel do not clearly depict the engineering effort involved in implementing this design.
5. Doubts
How much of this idea has been implemented in the present day OSdesign? Given that application demands such varied page sizes, has hardware design caught up to support them(eg:tracking base page changes) ?
Also, how feasible is this idea given such extensive changes to the kernel (Other papers considered this as problem and proposed designs which made changes transparent to the kernel)?

1.Summary:
This paper focuses on super page management by mapping large physical memory region into a virtual address space. The design handles challenges such as superpage allocation and promotion tradeoffs, fragmentation control and reduces TLB misses by increasing TLB coverage. The design is implemented on FreeBSD kernel and evaluated for its efficiency on various workloads.

2.Problem:
TLB coverage does not grow with increasing main memory size. This results in more TLB misses and significant performance degradation for applications. A super page is a memory page of larger size than ordinary page, thus reducing TLB entries and increasing its coverage. This approach poses challenges such as hardware support, fragmentation, contiguity, promotion, demotion and paging out of super pages.

3.Contributions:
1) Reservation based allocation for growing super pages of multiple sizes during page faults and preempting reservations(least recently allocated) in cases of memory pressure.
2) The problem of external fragmentation is solved by having buddy allocator which coalesces available free memory regions and following contiguity-aware page replacement policy.
3) Choosing super page size problem is solved by having different reservation mechanisms for fixed size memory objects (super page of size of memory object) and dynamic size memory objects(allocate more space for growth).
4) Incremental promotion of super page sizes and speculatively demoting super pages to determine if whole page is in use or just parts of it.
5) The problem of writing out an entire super page is handled by demoting a clean page and re-promoting when all the base pages are dirty.
6) Data structures such as multi list reservation to keep track of reservations for preemption and population maps to track allocated portion of reservations.

4.Evaluations:
The authors have concluded that they achieve 30%-60% improvement in performance, based on extensive tests using accepted set of benchmarks and application programs. They evaluate the best case benefits of superpages in large memory systems, and show how it drastically reduces the TLB misses, except for web applications since they produce large number of small files resulting in fragmentation, but still have lesser TLB miss rate. They have backed up the claim of improving contiguity by executing web and FFTW applications and comparing them against daemon for page replacement and cache, where daemon performs better. They have also measured the system against various pathological cases and show that the overheads are negligible.
The evaluations could have been performed on wider range of hardware, since hardware support for super pages is one of the issues the authors are trying to solve. Comparison study of existing operating systems that support super pages such as IRIX against this design could have been a good evaluation study as well.

5.Confusion:
More details on working of contiguity-aware page daemon?

Summary:
The paper describes superpage management in hardware that support multiple page sizes. The paper proposes mechanisms for fragmentation control by contiguity-aware page replacement, eager reservation and incremental promotion of superpages, avoiding superfluous writes to disks of large superpages, to support efficient use of superpages.

Problem:
The paper addresses the problems that arise from extending support for multiple page sizes, namely fragmentation of physical memory that often restricts efficient use of the bigger page size, intelligent promotion and demotion of superpages to help improve TLB coverage and release the memory pressure respectively.

Contributions:

Reservation based page allocation had already been explored in previous work. However, an important contribution of this paper was the policy used for reserving superpages depending on the growth potential of various memory objects.
Page replacement in traditional systems had not considered physical location of the pages being paged out. In this paper, they propose a page replacement daemon which is contiguity aware. It works in tandem with the buddy allocator which does free memory coalescing. The daemon walks through inactive pages and reclaims those pages which contribute to contiguity while still respecting the A-LRU policy that is originally implemented.
A set of superpages are promoted to the next higher superpage based on a lazy promotion policy. Promotion is done only when the memory object fully occupies memory equal to the next superpage size. This policy ensures that promotion is done only when it is certain that the memory object will entirely use the superpage.
Demotion of superpages is done on two counts: One, as a consequence of page replacement. Two, speculatively to avoid building up of memory pressure. Demotion is done in decrements.
The paper discusses a technique to share files that are mapped in superpages across processes. Although page sharing is an old trick, implementing it in the context of superpages is not trivial.
The policies mentioned above require bookkeeping about allocated and reserved pages/superpages. This is maintained in a radix tree data structure that provides for fast lookups.

Evaluation:

The system is evaluated on multiple fronts. A best-case speedup and TLB miss reduction scenario is presented to showcase the upper-bound of the benefit of using superpages. This is an important experiment that can be used to infer the overheads during non-ideal scenarios. The paper evaluates the efficiency of the page replacement daemon in restoring contiguity by invoking it with an already well fragmented memory footprint. This is a strong indicator of the performance of the daemon. The paper also discusses the performance of the system under adverse workloads and demonstrate that the overheads are much smaller than the performance benefits. Overall, the evaluation has all the bases covered and is holistic.

Confusions:
How speculative demotions work
The finer working details of population maps.

1. Summary
This paper uses the feature of various page sizes available in most processors to create superpages. These help increase the TLB reach, reduce misses and offer various various performance improvements. The paper analysis various challenges such as insuring continuity and promotion/demotion of a super page while using minimum explicit hints from application programmers.
2. Problem
Memory became cheaper and plentiful, but TLBs did not grow at the same rate leading to a reduction of the TLB coverage. To increase TLB reach by utilizing multiple page sizes the OS had to manage allocation strategies, contiguity requirements and fragmentation control. Hence previous solutions were not able to tackle the problem in its entirety either relying on reservation or relocation, each with its own tradeoffs.
3. Contribution
The primary contribution of this paper is its amalgamation of previously existing strategies into one usable module. The paper uses reservation based allocation but with a superpage size policy based on the nature of the memory contents. Reservations may be preempted by other reservation requests under memory pressure or may be incrementally promoted to a superpage. The paper also monitors superpage usage by periodically demoting them to smaller size. Writing to superpages causes instant demotion as dirty bits are tracked at base page granularity to reduce unnecessary I/O. Various data structures such as a Population Map and Reservations Lists track memory allocations. The paper also changes the page daemon to make it contiguity aware. Other implementation details impact the mapping of kernel memory and shared memory/files mapping. The main issue with this implementation is that it does not seem OS or hardware agnostic. The authors mention where they rely on hardware features without providing alternatives. For example they use the reference bit for speculative demolition and multiple page sizes provided by the Alpha processor. The lack of these features in other hardware platforms may lead to substantial porting efforts.
4. Evaluation
The authors evaluate the best case, sustained long term benefits as well as the performance penalties of the pathologically worst cases. While these methods do prove the advantages of this solution, they do not investigate all the performance impacts and costs for such a deployment in a fine grained manner. For example the authors do not evaluate the latency impact caused by waiting for the preemption of other reservations to allocate contiguous memory segments versus quickly assigning non contiguous memory. This would have added meaningful data to the Concurrent execution experiment where contiguous memory was rare. The latency numbers would show the impact of the contiguity aware daemon on time sensitive tasks. The paper does not compare the performance against pre-existing superpage solutions in IRIX and HP-UX, these may have lead to the conclusion that such complexity was not required in the OS and most users could have statically analyzed and precomputed the ideal superpage size.The paper does not profile the cost of each optimization instead relying on black block performance results. Seeing the cost of individual operations such as Speculative Demolitions and Incremental Promotions may have revealed additional bottlenecks helping direct any future work in this area.
5. Confusion
I am confused by the interaction of the reservation list with the population map. Additionally, I am not clear on the policy to decide superpage sizes for dynamic/static components and how much information the OS has on this a priori.

1. Summary
Supporting multiple page sizes in OS, including large pages known as superpages, helps increase the performance of memory-hungry modern day applications by increasing TLB coverage. However, providing such support comes with new challenges, like superpage allocation and promotion tradeoffs, and fragmentation control. In this paper, the authors highlight these issues and present an efficient superpage management system that achieves performance improvement exceeding 30%.
2. Problem
Over the last decade, memory size in processors and memory-footprint of applications increased at a much faster rate compared to TLB coverage. Thus, TLB misses contributed significantly to performance degradation of applications. Increasing the page size could increase TLB coverage, but leads to the problem of internal fragmentation. Thus, systems use multiple page sizes which involves tackling a new set of issues. There are hardware constraints that need to be satisfied - dependence on page sizes supported by hardware, alignment and contiguity constraints, coarse-grained hardware structures like reference and protection bits for superpages. In the OS, there are tradeoff challenges - allocation (reservation vs relocation), fragmentation control, promotion, demotions and eviction.
3. Contributions
The authors have motivated the problem quite well by presenting the TLB coverage trends and discussing the challenges in detail. An impressive part of this work is that their design focuses on various aspects of superpage management. Fragmentation control has been analyzed and a novel contiguity-aware page replacement algorithm has been designed for contiguity restoration. Speculative demotion helps uncover idle base pages and reclaim their memory by demoting the superpages. Demoting the superpages on observing a write to a superpage is a very clever optimization to prevent wasteful IO. The design also enhances the traditional reservation-based allocation with policies to decide preferred superpage size and to preempt reserved pages. All these schemes work together to minimize the superpage management related overheads and extract significant performance benefits.
4. Evaluation
The authors implemented the design in FreeBSD-4.3 kernel and evaluated it on Compaq XP-1000 that has support for four different page sizes. The evaluation discussed in the paper does an excellent job of driving home the point authors are trying to make, but leaves some questions unanswered. The set of benchmarks used for studying best-case benefits and multiple page sizes is quite varied - mainstream benchmarks, server workloads, scientific applications etc. They have included some memory intensive applications like matrix manipulations and image rotation which are quite suitable for studying the impact of TLB coverage. The study for best case performance benefits shows TLB miss reductions as high as 100% with a performance gains of more than 25%. Further, different applications benefit from different sizes of superpages depending on their working sets and memory access patterns highlighting the need for multiple page sizes and the benefit of on-the-fly page size selection. This becomes even more crucial for applications like mcf that benefit from different page sizes during different phases. The authors have done well to stress on severe fragmentation that systems suffer from just after 15 minutes of usage, and then used webserver and FFTW benchmarks to make a case for the efficacy of contiguity restoration mechanisms, with minimum performance degradation caused by modifications made to traditional page replacement policies. Towards the end, authors presented a very insightful breakdown of the overheads associated with various mechanisms using microbenchmarks.
However, the authors used the wide variety of benchmarks only for an optimistic estimate of performance gains when the memory is plentiful and unfragmented. It would have made more sense to run all the benchmarks without these assumptions to get a more realistic set of results and then compare them against the maximum possible gains. Similarly no comparison against existing superpage management designs has been presented. It would also have been interesting to isolate the benefits achieved by using speculative demotion of superpages. Lastly, some experiments to try out different policies to decide page size and to promote the reserved superpages would have lent another dimension to the analysis.
5. Confusion
The organization and usage of population maps for 4 distinct purposes is not quite clear. A detailed discussion on maps would be quite useful.

1. Summary
This paper talks about the problem of increasing the TLB coverage to support the increasing memory size as well as making the proper design decisions while considering different issues and tradeoffs. They propose a multi-list reservation scheme of super-pages which uses population map to keep track of memory allocation and to implement the eviction, promotion and demotion policies. They also justify most of their design choices using various workloads

2. Problem
The size of main memory had grown exponentially in size but the size of TLB could not keep up with it. The main motivation for this paper was improving the TLB coverage. To solve this problem super-pages were introduced but those too introduced many issues, tradeoffs like the need to control fragmentation, deciding the type of allocation relocation vs reservation and having proper policy for evicting, promoting and demoting pages.

3. Contributions
The main contribution of the paper has to be the extension in using super-pages and the various policies associated with it. They have a multi-size page based reservation scheme which supports preemption (based on LRU) when necessary. They are using incremental promotion which happens when super-pagesized and aligned extent within a reservation is fully populated. They also support speculative demotion (a level down based on probability) which becomes necessary when protection bits of individual base page change. They are using coalescing, contiguity-aware reservation to control fragmentation. Of all the features I found the paging out dirty pages by demoting and repromotion the most interesting. I believe most of the policies mentioned in the paper were not used with super-pages before. I didn’t understand how pmap (datastrucure of system) are used for various functions.

4. Evaluation
The evaluation of this paper was impressive as they were able to justify most of their design decisions. They used 11 different benchmark and showed the improved performed (in terms of speedup and decreased percentage of TLB misses). They showed that for different applications, different page sizes produce the best speedup (in all benchmarks except mesa which they have explained too), they also showed how the whole system can perform when multiple page sizes are used. Then they showed how effective their paging daemons can be for sustained long and concurrent runs. I liked the fact that they showed that even in worst case scenarios (adversary applications), the performance degradation was negligible. I would have liked to see for how many of these benchmarks relocation could have been better than reservation.

5. Confusion
Would like to see an example of how pmap (when fullpop, somepop is used for promotion, preemption etc) is used and why upward propagation of counter is required?

Summary:
The paper proposes a design along with implementation and detailed evaluation results highlighting the pros of a superpage management system, primarily to solve the problem of TLB coverage unable to cope up with increasing memory and larger application current working set sizes.

Problem:
Main memory size has been increasing but the TLB coverage hasn’t been increasing at this rate. So the number of TLB misses has been increasing which affects performance. So the hardware currently supports the feature called superpages, which is a memory page typically larger than a normal page and that is referenced using one entry in the TLB, hence enhancing the TLB coverage. But since larger and varied page sizes could lead to internal fragmentation, a system that manages superpages with policies and mechanisms have been described in this paper.

Contributions:
The paper begins by mentioning the most common issues in superpage management namely allocation, fragmentation control, promotion, demotion and eviction. It talks about the relocation and reservation based approaches to give a background about the currently used methods.

The paper then describes the design details of the superpage management system based on the reservation based superpage management scheme which uses a multi-list reservation allocation scheme which consists of a buddy allocator to allocate pages during page faults. Since the superpages are not of a fixed size, the preferred superpage size policy is used to determine the size dynamically. Reservations made are also preempted under memory pressure instead of refusing to allocate memory for requests. Coalescing memory regions is performed by the buddy allocator to reduce fragmentation. The pages are promoted incrementally when the reserved memory is populated until max superpage size. The pages are demoted in the case of partially modified superpages.

The multi-list reservation scheme in this design essentially maintains the partially used memory by keeping track of its usage. Population maps keep track of allocated pages for each memory object and thus helps in faster reserved frame lookup, avoiding overlap, promotion. The contiguity aware page daemon coalesces free pages to keep a contiguous page list available. It also describes optimizations such as paging out dirty superpages using hash digests.

Overall the main contributions of this paper are variable sized superpages instead of single sized superpages and the concept of allocation and demotion that has been explained above, primarily after allocating contiguous memory.

Evaluation:
Wide range of experiments have been conducted on real workloads ranging from Web Server, Image Processing to Matrix Multiplication. The design is implemented on the FreeBSD-4.3 Kernel. As expected a reduction in TLB missed was observed across majority of the benchmarks, and the best case performance has also been noted. The page demotion technique also appears to be particularly effective based on evaluation results. But since the system hasn’t been designed for workloads where in a large number of small files are accessed, building superpages spanning multiple objects would be required to improve performance. One of the main components in the system is the paging daemon which maintains contiguity in pages and so experiments were conducted to evaluate its performance, but not in realistic scenarios. The advantage of having multiple page sizes for superpages over a single size superpage has also been evaluated.

The paper mentions about the high overhead of the 160-bit SHA-1 hash function and mentions about optimizations but does not provide an evaluation for the optimizations such as lazy hash computation and eliminating the hash cost from the critical path. The worst case overhead of the population map structure has not been analysed, but the worst case overheads of all the policies such as promotion, demotion, preemption of reservation have been analysed. Since the paper is based on the reservation based scheme, comparing and contrasting in detail against a relocation based scheme could have been done.

Doubts:
Would like to know about the current usage of super pages in modern OSs, because the idea seems pretty reasonable. Couldn’t understand the inference from the Mesa workload results clearly.

Summary

This paper presents various transparent superpage management mechanisms for general-purpose operating systems that can increase TLB coverage and minimize TLB misses, while still ensuring better performance for user-level applications.

Problem

Superpages have been introduced as a solution to the problem of growing disparity between TLB coverage and main memory size. However, accurate implementation of superpages requires a number of careful considerations. Not only there are hardware-imposed constraints, but also challenges associated with making appropriate trade-offs for page allocation, deallocation, eviction and fragmentation control. This paper aims to address these challenges and presents a transparent scheme for efficient superpage management.

Contributions

The paper extends the reservation-based superpage management mechanism from previous works and provides major enhancements like support for multiple superpage sizes, efficient contiguity preservation of physical memory and improved disk I/O for partially modified pages. The key ideas contributed by the paper are:

  1. instead of having a single superpage size, superpages are reserved based on a preferred size policy that allows efficient tradeoff between increased memory footprint and fragmentation,
  2. preemption of reserved unallocated frames during increased memory pressure,
  3. a buddy allocator that coalesces contiguous memory regions and a contiguity-aware page daemon that strives to maintain memory contiguity thereby increasing performance,
  4. an efficient design of a population map data structure to keep track of allocated base pages to memory objects,
  5. wired-page clustering of kernel pages to avoid memory fragmentation,
  6. a multi-list reservation scheme that allows for efficient reservation preemption, and
  7. several immediate and speculative superpage demotion strategies that reduce the cost of superpage eviction and optimize disk I/O costs of partially modified superpages.

Evaluation

The authors have practically implemented their design in the FreeBSB operating system and presented a good summary of the performance results for an extensive suite of benchmarks. Under the best-case scenarios, superpages consistently outperform the unmodified version with an improvement gain of 5%-25% for over 35 benchmarks. Their idea of using multiple superpage sizes clearly shows significant benefit over a fixed superpage size. The contiguity-aware page daemon shows dramatic performance speed-up even in the presence of widespread memory fragmentation. They have even shown that there proposed solution is robust and resilient even in worst-case pathological cases.

 However, I believe that the paper has missed on evaluation of a couple of subtle points that call for specific attention. The contiguity-aware page daemon causes performance degradation due to deviation of page replacement policy from A-LRU. In the evaluations, the authors have shown that this only amounts to an insignificant 3% performance hit for the web server process (when it is the only other concurrent process running). But, I suspect that this performance hit may be much more significant in a real-world scenario when there would be multiple concurrent processes contending for pages and the evaluations have not highlighted this aspect.

Another point to note is that the evaluations present an overall picture that shows the advantages of superpages over normal unmodified operating system. However, I was also keenly interested in the benefits of each of the various design decisions that the authors have explicitly made over previous related works. But, evaluations fail to give any inference here. For example, how much is to be gained by just using the speculative demotions or the multi-list reservation scheme over any other related approaches. It seems to me that the overall solution aims to fine-tune many variables at once and it is difficult to distill the value of each design decision from the evaluations.

Confusion

The implementation of population map and how population maps are used to make promotion and preemption decisions are somewhat not clear.

1. Summary
The paper discusses about dynamic superpage management techniques of allocation and revocation in Operating Systems (OS). It also discusses how they handle the side effects of adopting such a strategy by progressively promoting or demoting pages to super page or vice versa.

2. Problem
Increased TLB misses as improvements in main memory were outperforming the growth in TLB reach. Existing super page algorithms had high memory footprints and suffered from internal fragmentation. Also, increasing the size of TLB was not feasible due to cost and performance constraints.

3. Contributions
The authors have tried to address the problems of superpages by introducing several techniques. In reservation based allocation, they ‘reserve’ a particular size of the page anticipating what will be the ideal size of the memory object. If the memory object is fixed, they propose to allocate superpages of maximum possible size and then progressively allocate pages of smaller and smaller size such that it does not exceed the size of object. Otherwise for dynamic objects, they allocate a larger superpage, with the expectation that the object will require a large memory in the near future. This was better than previously proposed techniques since it reduced the excessive memory footprint.
They then address the issue of allocating memory under memory pressure by preempting existing reserved memory based on LRU which was a fair choice but they did not cite any previous work or presented statistical evidence.
Memory allocation/deallocation over a period of time can lead to excessive fragmentation. To address this, they use the buddy allocation algorithm to coalesce the region of memory to form a larger chunk. They did this incrementally. Similarly, in case of eviction or writing a dirty superpage to disk, they recursively break it to smaller pages, check its hash, (which they again optimized by computing it lazily) which again reduced unwanted IO.
Multi-list reservation scheme further augmented the buddy allocator by keeping a track of free pages whereas population maps use radix tree to efficiently figure out best regions to place free memory. Overall, there were significant contributions which proved that superpages, though a concept which existed earlier, can be actually be implemented.

4. Evaluation
Although they evaluated the implementation quite extensively across benchmarks and considering pathological cases, the authors do not present the evaluation of test case where the system is sustained to high workload and creates a large fragmentation (the Web server application only creates fragmentation), since the underlying page daemon works only during idle CPU cycles. Also, they could have also provided the profiling details of access overheads of population maps; average case traversal cost of reservation lists.

5. Confusion
I would like to discuss more about multi-list reservation and population maps in class. Also, how will the system behave once for an already allocated superpage, the object eventually outgrows it?

1. Summary
This paper presents a transparent solution for super-page support in the Operating System. This solution supports multi-sized super-pages and can be leveraged for transparent application memory super-paging even in general-purpose operating systems. Best-case performance speed-ups of over 25% in 30% of the benchmarks were observed, while worst case slow-down is negligible in most cases.

2. Problem
• Main memory and working set size of applications have grown exponentially.
• However, TLB reach has not grown proportionally due to associativity and access time requirements.
• Hence, TLB misses are degrading performance by larger proportions.
• Also, physically addressed caches can contain more data than the TLB covers.
• Hence, there is a lot of potential speedup that is unrealised due to the low TLB coverage.
• Finally, these are all trends that are showing signs of increasing, not abating.

3. Contributions
Primary contributions :
• The extension/generalization of an existing reservation-based super-page management system
• A novel contiguity-aware page-replacement mechanism
• Pioneering investigation on the effects of fragmentation in super-page systems
• Intelligent mechanisms that are vital for an efficient practical solution, such as super-page demotion and dirty super-page eviction.
Reservation-based Mechanism :
Free physical memory regions are reserved at run-time upon a page-fault and promoted when a threshold number of page frames are used. The size of the reserved region is decided by effective policies for fixed and dynamic sized memory objects. A pre-existing reservation-based mechanism was extended to support multiple sizes, scalability, demotion, efficient contiguity preservation, efficient disk I/O.
Mechanisms, Memory Pressure and Fragmentation :
If the requested page size is not available in a contiguous segment, existing reservations are pre-empted, using the Reservation Lists. External fragmentation is alleviated by coalescing, and by the contiguity aware page replacement daemon. This daemon is activated under conditions of memory pressure, or lack of contiguity.
Promotion and Demotion are both incremental. This is empirically justified in promotion, and leads to minimal superfluous I/O in demotion.
Population Map :
The Population Map is the main data structure that enables all these features. It is a radix tree that keeps track of allocated sub-super-pages within each memory object. Each level corresponds to a super-page size.

4. Evaluation
The authors have performed comprehensive evaluation and benchmarking to provide data-backed observations on the following - best case speed-up, worst-case slow-down, real-world evaluation, estimation of overhead, and justification of design choices.
The test environment was an augmented FreeBSD-4.3 kernel, on an Alpha 21264 processor with 4 page sizes. Various benchmarks / applications were used.
For best case analysis, plentiful and non-fragmented free memory was made available. Out of 35 benchmarks, 10 had over 25% improvement, while 18 had over 5% improvement. In real-world applications, an average speed up of 11% was observed. For worst-case analysis, adversarial pathological workloads were synthetically created to make the system pay all the mechanism costs without utilising any of the benefits. Performance degradation was in the range of 0%-2%. This is lesser than the average improvement, and the pathological cases are unlikely, worst case scenarios. An experiment was also performed to prove that the performance gains are maintained over time, as fragmentation is efficiently handled by coalescing and the page replacement daemon. Super-page demotion justification - without degradation, flushing the entire super-page penalizes performance by a factor of 20. A comparison of multiple super-page sizes vs a single super-page size in the context of the other innovations here is also done.
My Opinion :
The authors have provided data to justify their design choices, in lieu of 'empirical observations', which is great. Measurements are extensive in evaluating the system. The performance degradation in a couple of cases is blamed on the Alpha firmware/hardware. Would be reassuring to see this proved using an alternate test environment. While the benefits are notable, given the low overheads, it seems to me that the performance gains are not as impressive as they could be. On that note, would be nice to see more data on profiling / performance.

5. Confusion

Super-page size policy for dynamically growing memory objects - unclear.

1. summary
In this paper authors analyze different issues of supporting superpages of both kinds - single sized and multiple sized superpages, in terms of superpage allocation and promotion tradeoffs, fragmentation control and propose the design of an effective superpage management system that tackles those issues. The authors show substantial performance benefits of using this management systems that are sustained even under stressful workload scenarios.

2. Problem
Computer memories have increased in size faster than TLBs. Most TLBs (at that time) covered a megabyte or less of physical memory. Reduced TLB coverage makes virtual memory systems less efficient since many applications have working sets that are not completely covered by the TLB. Supporting Superpages can address the issue of TLB coverage but poses other challenges such as increased memory footprints and higher paging traffic whereas adopting multiple size Superpages can lead to increased physical memory fragmentation.

3. Contributions
The main contribution of this paper is that it addresses four major issues of supporting superpages which were not addressed *collectively* before: allocation, promotion and demotion of superpages and Fragmentation control due to multiple sized superpages.
* It builds on top of the previously proposed reservation-based approach to support multiple superpage sizes.
* It proposes a novel contiguity-aware page replacement algorithm to control fragmentation.
* It proposes the design of demoting and paging out dirty superpages as appropriate to handle memory pressure that is crucial to make the overall system practical.
* It proposes population map for internal to keep track of base pages within each memory objects to enable efficient overlap detection, making promotion decisions and identification of unallocated regions.

4. Evaluation
The authors evaluated their systems with different possible scenarios and factors such as best case scenario when memory is plentiful and non-fragmented versus stressed situation, when only one superpage size is used, when workloads are run for a long period of time and concurrent workloads using diverse set of applications. Thus, they make convincing arguments about their assumptions and proposed solutions. They evaluated their design in the FreeBSD-4.3 kernel using Compaq XP-1000 machine. Their experiments show that when the memory is plentiful the TLB miss rates reduce drastically around 99% and increasing the speed by more than 25% for many applications. The authors also show experimentally that supporting multiple superpage sizes is most beneficial since best superpage size depends on the application. The authors analyze the effect on contiguity of using the cache scheme versus the contiguity aware page replacement scheme (daemon). The authors also attempt to analyze the overhead of their design ,for instance, in the concurrent execution section.
In addition to these experiments I would have liked to see evaluations of each of their design in isolation to understand their overhead and contributions more clearly. It would also be interesting to see how this system performs on hardware different than Alpha with different support for superpages.

5. Confusion
It is not clear whether the system they used for evaluation actually used the population map using the radix tree. In the evaluation section the authors mention that the Alpha firmware implements superpages by means of page table entry (PTE) replication. But the population map using the radix tree seems to be an important part of the design.
I am not sure how the reservation (not the list) is actually represented.

1. Summary
To utilize the superpage support provided by hardwares and reduce TLB misses, the authors built a transparent superpage management system with reservation based allocation, fragmentation control and incremental promotion and demotion. No modification to the hardware or application is required.

2. Problem
Memory sizes is growing while TLB remains small. To achieve higher TLB coverage, larger page size or a mix of various page sizes is necessary. Previous superpage management systems did not address the fragmentation problem very well. Some use memory relocation and perform poor, some require further hardware change to weaken the contiguity requirement, and others rely on hints from applications.

3. Contributions
The paper extends the reservation based allocation to support multiple page sizes. With later preempting allowed, the system tends to choose the largest size without wastage. The victim of preempting is selected as the least recently grew one.
The system uses the buddy allocator to manage free frames and multiple lists to manage preemptable reservations. The former is effective in achieving better contiguity by coalescing buddy pages.
The contiguity-aware page daemon is more aggressive in reclaiming allocated pages. Cache pages are considered reservable and the reservation may be preempted once the page is reactivated. Inactive pages are cleaned and moved to the cache list when contiguity is low. Active pages from closed files are marked inactive immediately after losing all referenced.
A hierarchy of promotion and demotion is maintained by population maps. Pages get promoted to the next size level when all in the region are populated. Beside required demotion when a page gets evicted or its protection bits change, superpages are also broken into smaller ones to achieve finer granularity of reference bits and dirty bits.

4. Evaluation
The paper first shows the benefits of superpages when fragmentation is not a problem. A large variety of workloads are used and the speedup against unmodified system as well as profiles like page usage and TLB misses reduction rate are provided. Some noticeable cases are analyzed in details, including the low TLB miss reduction in web and performance penalty in mesa. The authors also take care of the effects of page coloring.
The benefits of having multiple page sizes are then illustrated by running the same workloads with restricted page size choices. The results prove the assumption well and anomalies get explained.
The effectiveness of contiguity-aware page replacement is evaluated by running the web workload with another one that requires contiguity. Their new replacement policy outperforms original one significantly. However, the generalizability of this policy is not well tested. As the new policy inactivates clean pages backed by files aggressively, it is actually tuned specially for the web application.
The overall and detailed overheads are measured by running pathological applications. Discarding PTE replication required by the hardware, the overhead of their implementation is rather small.
The authors also use another pathological application to show the benefit of demoting partial dirty pages. While it is a good idea to provide worst case overhead with this method, it is not suitable here to confirm the benefits. Results on previous comprehensive workloads with or without demotion would be better.
They finally discuss the scalability of the system with completely theoretical analysis.
In general, they proved the idea of superpage is powerful and the overhead can be small, but they failed to demonstrate how well their fragmentation control and demotion mechanisms perform in complex settings.

5. Confusion
How is the preferred page size for dynamically sized objects set? It says the reservation is allowed to “reach beyond the end of the object”, but also requires it to be limited to the current size of the object.

1. Summary
This paper introduces effective superpage management OS which evaluates to deliver high sustained performance with negligible pathological degradation. OS designates reservation based allocation, fixed policy for superpage sizing, ability to preempt reservations, memory coalescing, promote or demote memory extent and specialized data structures like population map to keep track of base page allocation
2. Problem
TLB coverage has lagged behind the exponential growth in main memory due to its small size due to which it has unable to provide page hits for larger working sets in modern applications. With larger on board, physically addressed caches, TLB misses become more expensive with data translation. There is a problem of internal fragmentation with keeping base pages larger. There is necessity in modern OS to handle hardware constraints like page size support, contagiousness in memory and address space to be size multiple.
3. Contributions
With addition of reservation based allocation, authors attempt to provide contagiousness in memory to reduce page miss for large working sets of applications. Relocation based schemes are prohibited with reserving largest, page-aligned superpage size policy. Page replacement mechanisms like daemon operating in background is tweaked to make it memory-contiguity aware to handle persistent memory pressures which act as a companion for buddy allocator which performs memory coalescing. Memory footprints of applications are optimized with schemes like incremental promotions and demotions. Use of data structures like radix tree and hash table leads to efficient lookups of base pages within each virtual memory object. Kernel flexibility in choosing largest superpage aligned virtual address retains the ability to create superpage for each process mapping a single file in their address space. Disk IO access is made efficient by demoting clean superpages when a write is issued and then promoting on demand.
4. Evaluation
Authors of the journal evaluate their system in FreeBSD on the Alpha CPU with real life workloads and benchmarks like web server, image manipulation, and dynamic allocation on hashmaps and compute bound compilation and linking. The core problem of TLB misses were virtually eliminated by effective superpage management given plentiful free memory. Due to availability of large contiguous memory, common desktop applications gained huge performance improvements with flexibility in selecting multiple page sizes. Another side effect of superpage usability in OS which authors tackle is page coloring because of OS behavior of maintaining contiguity in both virtual and physical address space. Recoverability of contiguous memory proved to be more visible in case of tweaked page daemon and page replacement policy than cache scheme. Overhead was however imposed with Alpha firmware specific PTE replication mechanism to maintain population maps on incremental promotion of pages. With advent of superpage, it is also required by hardware manufactures to have compatible structures for scalability.
5. Confusion
There are hardware level solutions possible as well like allow memory controller to have holes in superpages or enable TLB to permit superpages that consist of non-contiguos base pages. How does this approach compare to that?

1. Summary

In this paper, the authors propose the design for an effective superpage management system. They evaluate their design proposal based on the performance of their implementation in FreeBSD on the Alpha CPU.

2. Problem

Superpages play a key role in increasing the TLB coverage. However, they suffer from serious limitations due to the hardware imposed constraints on page size and page boundaries, tradeoffs in page allocations, fragmentation and paging operations. In this paper they study the existing superpage management system and propose a design solution that carefully analyses the limitations and achieves better performance.

3. Contribution

In order to increase TLB coverage and improve performance, the superpages are used. It is essential to carefully understand the underlying hardware and software features to maximise utility. The superpage management system implements a reservation-based allocation using a multi-list reservation allocation scheme where the largest, aligned superpage that contains the faulting page and does not overlap with existing reservations or allocations is allocated. It proposes using multiple superpage sizes. The pages are incrementally promoted or demoted depending upon the usage. In cases of scarce memory, the system preempts existing reservations. The buddy allocator maintains multiple lists of free blocks. Population maps keep track of allocated base pages within each memory object. They propose solutions for paging out dirty superpages which is a costly operation involving many I/O operations using hash digests. In order to consider the underlying hardware, some implementation specific issues are also described in the paper. This involves modifying the page daemon making it contiguity aware, clustering pages about to be wired in pools of contiguous physical memory and having multiple mappings.

4. Evaluation

The author evaluates the performance of the design on a FreeBSD kernel. They have effectively evaluated the design by comparing performance in both best case and worst case scenarios in terms of pressure on system memory. They have used a wide range of benchmarks to evaluate the performance in different workloads. Improvements in common desktop applications is measured. They also measure the performance by using only one superpage size and multiple sizes, proving that allowing the system to choose multiple page sizes has the maximum performance. They evaluate their proposed solution for contiguity restoration. Evaluations on pathological workloads show negligible performance degradation. Thus, the author carefully tests his proposed solution and determines its efficiency and improvement areas.

5. Confusion

1. How do the page daemons control fragmentation?

Summary
Most OS support super-pages which increase the coverage of TLB. Supporting super-page involves lot o f challenges . The paper suggests an effective Super Management System with variable super page sizes.

Problem
The size of the Main memory is increased over the years but the TLB coverage has lagged. TLB coverage can be increased by using super-pages. Most OS support superpages but it involves lot of challenges such as fragmentation, Tradeoff, hardware constraints (super-page size must be among a set of page sizes supported).The paper proposes an effective superpage management system with support for different size of super-pages .

Contributions:

The main Contribution of the paper is an effective super-page management system using the following ideas
1] Extending Reserve based allocation: The author extends the reserve based allocation of superpages by supporting multiple sizes of superpage in efficient manner by introducing the the concepts of a] Preempting reservations: If the physical memory becomes scarse or excessively fragmented the system can prempt frames that are reserved and not used. b] Incremental Promotion . When superpage region is fully used it is promoted to next higher superpage size. c] Demotion : Demoting a super-page so that there is efficient disk I/O for partially modified superpages.
2]Multi-list-reserveration scheme which helps in keeping track of partially used memory and least recently used which helps in preemption.
3]Population map - keep track of memory allocation of each memory object and helps in reserved frame lookup ,overlap avoidance, promotion, preemption decision.
4] Contiguity aware page daemon: The following operations are carried out by page daemon to increase contiguity a]coalesces free pages to increase contiguity.b]IT also traverses the inactive page list and moves them to cache . If the cached page is referenced the pages are moved back to active list.c]All clean pages backed by file are moved to inactive list.
5] improving contiguity for large pages by identifying wired pages and cluttering them together.
6]Control of fragmentation by performing page replacement in contiguity aware manner.


Evaluation
The Author evaluates different types of workloads starting from the best case where there is enough non fragmented free memory to serve the request for super pages. The results are compared with a unmodified system and table shows detailed representation of number of superpages used and the % of TLB misses reduced for different types of applications which in most of the cases is almost 90% and above. The application speed up ranges from 5% to 25%. The author presents evaluation of recovering from severe fragmentation techniques using cache and page daemon and shows that page daemon seems to recover up-to 42 % of the contiguity and performs well in sequence as well as concurrent execution of processes(one which will increases fragmentation and one which requires contiguous memory)
The paper also evaluates real workloads and benchmarks showing that the system is robust. The paper has properly evaluated each of the concepts explained -incremental promotion, preemption ,scalability.Overall the system impose negligible overhead.

These evaluations seems to be pretty convincing that the superpage management system works well,but the author doesn't show well the applications perform when there is not enough non fragmented memory to serve a superpage request & i think the performance gain would not be the same as the time taken by the paging daemon to recover segments would affect the performance of applications.

Efficient Dirty page handling, which seems to be important part of the system, seems to show that not demoting can cause penalty of factor of 20 . The author mentions that benefits of superpage doesn't seem to help process which don't write to all base pages of super pages but most applications in the real world write only a few pages which definitely have a combined smaller size than a superpage. Hence, I don't think that such an approach would be useful for workloads that work with this assumption (which is the common case in the real world).
Superpages doens't seem to help much in scenarios where there are many light weight process, such as a web servers, whereas the world is more web oriented these days.
The paper also lacks detailed explanation of how reservation list and population map are actually implemented.

Doubts:

Is PTE implemented as population map? how is the reservation list implemented in evaluations.Which all Operating systems in industry use super-pages, and is the implementation similar to what is explained in the paper or is it different?.

1. Summary
Superpage is an optimization technique used to improve system performance between CPU and main memory, enabling each entry in the TLB to map a large physical memory region into a virtual address space. The work in this paper is around superpage (variable-sized memory page, usually larger than ordinary page) management for application memory. Authors employ reservation-based algorithms, study fragmentation issues, devise a contiguity-aware replacement algorithm and impose eviction model for dirty superpages. They test it on real workloads and benchmarks and show significant performance gains.
2. Problem
TLB size is kept small to maintain access times, whereby, while main memory size increased exponentially, the TLB coverage remained low by a large factor. Possible solutions to increase TLB coverage could be: increase page size => internal fragmentation + high I/O traffic, make TLB larger => slower access times, 2-level TLB => not much gain, multiple page sizes(superpages) => choose between reservation-based vs relocation based, estimate its size, handle fragmentation, time to promote, and single reference and dirty bit make demotion and eviction hard. Since hardware these days offers support for superpages, the OS has to harness that.
3. Contributions
The paper designs it in a preemptive reservation model, with the principle of locality to estimate the size of the superpage, and that is opportunistic(go for biggest size possible- no larger than the file/memory object) and then preempt in LRU fashion. This is a neat design to start off with. Then they allow incremental promotions: superpage is created whenever any superpagesized and aligned extent within a reservation is fully populated. With only a single reference and dirty bit of the base page, they elegantly handle demotion- demote the superpage probabilistically while dereferencing the bit or demote on 1st write to clean superpage (hash optimizations prove costly). To control fragmentation, they adopt a contiguity-aware reservation: it biases the policy to select those pages which contribute most to contiguity and also performs periodic coalescing. So, all in all the paper addresses all the complicated issues related to managing superpages in a decent methodology and efficient data structures.
4. Evaluation
The superpage design is implemented in FreeBSD kernel on a Alpha processor that supports 4 page sizes. They have extensively tested every aspect of their design on various workloads and applications, which gain significant speedups under no memory pressure. They also figure out that negative/lower speedups can be linked to their allocator that it doesn’t differentiate between zeroed-out and free pages. They conclude that multiple superpage sizes and contiguity-aware allocations are essential. They could have shown impacts on recursive applications- where the stack undergoes dynamic resizing. Since hardware support for multiple page sizes continues, they could have simulated such a scenario taking 5/10/20 page sizes to see whether the population map tree traversals and lookups, which are proportional to page size, become an overhead.
5. Comments/Confusion
Did not understand clearly the implementation of contiguity-aware page daemon to control fragmentation. I was still left unconvinced with the cost of speculative demotions (with hard heuristics of p=1) and re-promotions.

1. summary
This paper proposes scalable superpages of multiple sizes as a solution to the problem of decreasing TLB coverage. Efficient super page management is performed by supporting reservation based super page allocation with incremental promotion, implementing a contiguity aware page daemon and speculatively demoting pages during eviction.
2. Problem
Over the years the size of main memory has increased but the TLB coverage has not increased at the same rate.This problem can be solved using super pages. The main problems with supporting super pages are as follows: Firstly,there are hardware imposed conditions including page alignment and page size constraints as well as coarse granularity of status bits in the TLB. Secondly , space wastage due to fragmentation . Thirdly, there is a high I/O overhead incurred due to flushing to disk clean base pages.

3. Contributions

When a page fault occurs and a physical frame needs to be allocated instead of allocating one physical frame. In the new design a preferred superpage size is determined based on the attributes of the memory object, a physical frame of that size is allocated.This is the crux of the reservation based approach.When enough pages have been loaded , they are promoted to form a super page. This incremental promotion continues as the population count of base pages increases.When a base page is marked for eviction instead of demoting the super page all the way down to base pages,it is demoted incrementally to the next smaller superpage.

Allocation of physical pages in contiguous extents of multiple sizes can lead to memory fragmentation . A contiguity aware page daemon has been implemented to handle this problem.Buddy allocator maintains multiple lists of free blocks ordered by size , it coalesces adjacent blocks to form larger blocks. Population maps track allocated portions of reservations. Memory pages used by the kernel are marked are non page-able and non evict-able which can lead to fragmentation, these wired pages are clustered together in pools of contiguous memory to avoid this problem.

When a super page needs to be flushed to disk , there is no way of knowing how many base pages are dirty which forces the entire super page to be flushed to disk . to prevent this clean superpages are demoted to the next lower super page size whenever a process writes to it . An alternative to this is to maintain a hash digest of dirty pages and check the hash before flushing page to disk.

4. Evaluation
By providing sufficient contiguous free memory to a range of benchmarks ,the best case benefit from implementing super pages has been shown. The results of the experiment have been well justified. Further they take into consideration overhead of not differentiation zeroed out pages and the effect of subsuming page coloring. Similar experiments have been conducted with multiple page sizes. Further the system is evaluated with fragmented memory through sequential and concurrent workload.Through these experiments it is shown that the system achieves nearly best case speedup. Performance degradation due to synthetic pathological workloads has been justified. For example performance degradation due to incremental promotion overhead is mainly due to hardware specific PTE replication constraints. Running experiments to empirically justify the reasons for choosing reservation based super page allocation over relocation based allocation would have made the evaluation section stronger.

5. Confusion
The contiguity aware page daemon principle of considering cache pages as available for reservation could have been explained in more detail.

Summary
The paper presents a general and transparent superpage management system which incurs negligible overhead even in pathological situations. The system is implemented in FreeBSD on the Alpha architecture and is evaluated on a range of real workloads and benchmarks. The performance benefits range from 30% to 60% in several cases and are sustained even under stressful workload conditions
The Problem
TLB coverage has not been able to keep up with the exponential increase in main memory size, in fact it represents a very tiny percentage of the physical memory (less than a megabyte usually). Relative TLB coverage is seen to be decreasing by roughly a factor of 100 over 10 years. This results in severe performance degradation for applications with large working sets. Moreover modern machine are equipped with on-board physically addressed caches that are larger than the TLB coverage which makes TLB misses even more expensive. To battle this problem, most general purpose CPU s provide support for superpages. However superpages have to be used with caution as it might lead to several problems like increased application footprint, heightened physical memory requirements, paging traffic , memory fragmentation etc which could outweigh all its advantages. On the top of this, TLB hardware design imposes additional constraints. Effective management of superpages hence pose a complex,multi-dimensional optimization task. The paper analyses these issues and presents a practical design for superpage management.
Contributions
1.It extends the previously proposed reservation based allocation to accommodate multiple superpage sizes and even very large superpages. The preferred superpage size is the maximum superpage size that can be selected for a memory object. During situations of scarcity of free physical memory or excessive fragmentation , reserved but unused frames are preempted.
2.The paper discusses a novel page daemon which uses contiguity aware page replacement for fragmentation control under persistent memory pressure. This helps in sustainable contiguity maintenance for building superpages of multiple sizes.
3.It provides effective solutions for the various aspects of superpage management:
Incremental promotions - A superpage gets created when any supported size gets fully populated within a reservation, a smaller superpage gets incremented to the next larger sized superpage.
Speculative demotions – Superpage containing the base page targeted by the page daemon to be evicted is demoted to the next smaller size.
Eviction of dirty pages -Clean superpages are demoted whenever a write attempt is made, repromotion is done later if all base pages are dirtied.
Evaluation
The evaluation section is quite comprehensive. Series of extensive experiments have been carried out on as many as 11 different benchmarks and real workload as well. The first set of experiments presents the best-case speedups obtained when free memory is plentiful and non fragmented. The paper accounts for each and every observation, for example it argues that the web server benchmark Web, shows little gain in performance because it accesses a large number of small files and that a possible solution can be building superpages that span across multiple memory objects. I really like the fact that the paper checks the effectiveness of their proposed contiguity restorement policies via two sets of experiment – first one consists of fragmenting the system by first running web server followed by sequential execution of FFTW 4 times, the second experiment runs the web server concurrently with a contiguity-seeking application. The paper measures an empirical contiguity metric as a function of time to monitor the available contiguity in the system. A comparative performance analysis of fixed sized superpage vs multiple superpage sizes is also provided which I feel is a very crucial part of the evaluation. Another aspect of the paper which I really appreciated is their analysis of the worst case overheads for the different features of their superpage system like incremental promotion, preemption etc. I have not quite seen such an analysis in other papers and found this to be an innovative way of overhead measurement . The design choice of demoting clean superpages on writing is also justified. Finally the paper also provides a discussion on scaling the superpage size. However I would have liked to see a performance analysis of the radix tree implementation of the population map. Also how would the performance change across various architectures with differing h/w TLB support for superpages?
Confusion
What h/w changes are required by the TLB to support a mixture of superpages of various sizes?


1. Summary: This paper talks about providing a transparent super-paging support in Commercial Operating Systems, which would help increase the effective TLB coverage, reducing the path-critical TLB misses, and improving the performance of the system. The authors present their design and test it on a range of workloads.
2. Problem: TLBs are small in size, and fully associative to keep their access time small. As a result, the TLB reach has not increased much, even though the physical memory and cache sizes have increased. Thus, the relative impact of TLB misses is on the rise. The existing strategies: a) reservation-based, supported a single super page size. b) relocation-based, suffered from copying costs. The contemporary OSs of the time, IRIX and HP-UX followed eager promotion policy, and required the user to specify a hint. Thus, these were not transparent. The authors aimed to provide a transparent implementation, which would be able to handle multiple super-page sizes, without copying costs, and with improved efficiency.
3. Contribution: The authors, in their effort to provide a practical and transparent implementation introduce new implementation strategies to solve the problem. They identified memory objects in the virtual address space of a process, and make super-page aware allocation using buddy allocator, with the maximum superpage size determined by the type of the memory object. In this way, they added support for multi-sized superpages. They implement a contiguity-aware page replacement ALRU policy to get maximum contiguity. They make use of multi-level reservation lists to keep track of sparsely-filled superpages. They also add the idea of incremental promotion/demotion to efficiently utilize resources, and adapt to the needs of the application. To solve the problem of partial dirty superpage, and increased I/O cost as a result of it, they provide extremely flexible demotion policy, which is aided by the incremental approach discussed before. They also overhaul the existing page table structure, and propose a radix-tree structured population map (which appear like multi-level pagetable) for each memory object. They, then use counters to make the promotion decisions.
4. Evaluation: I would like to point out three areas where the authors missed out in evaluation: 1st, they discuss their idea of population map for each memory object implemented using a radix tree, but do not evaluate this. Instead, they use Alpha’s existing page table structures, and use PTE replication, and credit slowdowns in some case, to this limitation. This could be interesting because, traversing the radix tree, and calculating the hashes using hash map may require multiple memory transactions. Moreover maintaining this tree for each memory object may incur some extra overheads. 2nd, the authors run their benchmarks with “enough free memory” to eliminate the gains of page coloring used in traditional OSs. It would be interesting to see the effects of cache conflicts, since the authors are using the physical memory expansively, in their effort to provide contiguity, which would result in even more cache conflicts. 3rd, they provide an evaluation on Alpha architecture only. It would be interesting to see results on hardware/simulators which may have some pre-exisitng support for multi-size superpages. Apart from this, the authors provide a detailed evaluation of most of the policies/designs implemented. They show that their support for multiple superpage sizes helps the applications to adapt on their system, and perform as well, as if they were running on their best superpage size system. In some cases, they even show an improvement. They also show the effectiveness of their contiguity-aware replacement policy which provides benefits in long term. This is important as it is one of the new features introduced by them. They also designed workloads to test their policy of immediate demotion on a write. Apart from this, they show that their implementation reduces TLB misses in applications which access contiguous pages. By doing this analysis, they show which of the workloads will benefit more.
5. Confusion: Why is the radix tree for each memory object? The authors mention that current superpages map contiguous virtual and physical memory. How can the mappings be discontiguous on one side? Even with no zero-page hints for buddy allocator, why is Mesa workload so slow? How is multi-sized superpages implemented in modern OSs?

Summary : This paper proposes a new super-page management system which give substantial performance benefits often exceeding 30%. This system can be easily integrated into existing general purpose operating systems. The performance benefits are sustained even under complex workload conditions, memory pressure and overheads of implementation are very small.

Problem : Most general purpose processors support memory pages of large sizes, called super pages. Superpages enable each entry of TLB to map a large physical memory region. This dramatically increases TLB coverage, reduces TLB misses and performance improvements. However, it poses challenges to operating system in terms of super page allocation, promotion tradeoffs, fragmentation control etc.

Contribution : In my opinion the primary contributions of this implementation are:
i) Variable page size superpages: previous work in the area of superpages have mostly been with single sized super pages. This variable size feature can adapt to a larger variety of applications hence controlling problems of fragmentation and contiguous memory allocation and making it specific to application requirements
ii) Making the page replacement policy itself aware of memory contiguity : this is strikingly different and amortizes the the cost of page replacement to actually restore memory contiguity and has also been proven to be effective even on concurrent execution of contiguity seeking and contiguity breaking applications.
iii) The concept of reserving a large chunk initially and allocating and promoting incrementally. This ensures that the allocation of memory happens contiguously, rather than creating space as and when required.

Evaluation :
This design was implemented on FreeBSD-4.3 kernel as a loadable module. Compaq XP-1000 machine (with Alpha 21264 processor at 500MHz , 512MB RAM, fully associative TLB with 128 entries for data and 128 for instructions) was used. The evaluation was done on a varied range of Benchmarks.
1) When adequate contiguous memory regions was provided ,free memory is plentiful and non-fragmented : miss reduction close to 100%. 10/35 show greater than 25% improvement. However, this is an idealistic case as when multiple processes run in a system there is always a contention for free physical memory. But, more importantly it gives an understanding of which applications will show little or no improvement with this implementation of super-pages: examples:
• Web servers because of accesses a large number of small files. It brings out a limitation that this system will never attempt to build super pages that span multiple memory objects.
• Mesa shows a performance degradation because allocator does not differentiate zeroed out pages from other free pages.
However, at the other extreme are cases like FFTW which form as high as 60 super-pages.

2) This paper evaluates the superpage implementation for a lot of different cases :
i) Benefit from multiple super page size evaluated by changing the system to support only one superpage size for each of 64KB,512KB and 4MB. This is important because it is one of the novel features compared to older research and the result shows that best size is application dependent. Allowing the system to choose page size gives higher performance which is an important observation.
ii) Maintaining contiguous memory regions is one of the key requirements for good performance of this system. This system actively restores contiguity . This is evaluated by
• Sequential execution :first fragmenting the memory and (by using Web Server workload) and then restoring it (using FFTW). The "daemon" is able to restore 20 out of requested 60 pages
• Concurrent Exection : web-server run concurrently with contiguity seeking application. The daemon is again able to fulfill ~30% of superpage requests.
3) Lastly it also evaluates the overall overhead in practise and proves that it is just 1%.

The benefits from the concept of superpages, the key distinguishing features (multiple superpage sizes and contiguity maintenance) have both been evaluated for lot of workloads.
Why have we not considered basic page moving (by changing page table entries) to restore contiguity periodically. My intuition says page moving will be costly but that could be one more which they could have evaluated as this approach will come intuitively to many readers.

Confusion :
Seems too good an approach. I cannot think of any reason why this should not be implemented. Is this idea commercialized?

Summary
This paper focuses on the design and implementation of a superpage management system and also provides an evaluation of the effectiveness of their design.

Problem
With the increase of main memory in a system, there rose a need in the increase of TLB coverage. Keeping the TLB size constant, use of large-size pages i.e. superpages, was a way to increase TLB coverage. Thus, each entry in a TLB would point to a larger memory; hence increasing it's coverage. With the existing techniques for supporting superpages, there existed issues and tradeoffs within an operating system. This paper aims at solving these issues with their design.

Contributions
1. Use of multiple superpage sizes : When a page fault occurs, instead of providing a single new page frame, the system provides a superpage of a preferred size - decided by a superpage size policy. Thus, a reservation based allocation is made here where continuous pages frames corresponding to a given superpage size is chosen.
2. Incremental promotions and speculative demotions : A superpage is created when the reserved superpage size allocation for a memory object gets fully populated. The conversion of base pages to a superpage starts from a smaller superpage size. Thus, an incremental promotion is implemented. Similarly, if a page is chosen to be evicted(done with speculation) from a superpage, it is demoted to the next smaller superpage size.
3. Paging out dirty superpages : When a superpage is marked dirty, there is no track of which exact base pages are dirtied. For a large super page size, this might cause an IO overhead since we would be copying unmodified pages too. Hence, clean superpages are demoted to smaller page sizes whenever a process tries to write into them. Similarly, when all the base pages are dirited, they might be promoted to a higher superpage size. The paper also suggests on a possible use of hashing to detect dirtied base pages in a superpage.
4. The paper comes up with various data structures to help implement the above policies; buddy allocator - to manage physical memory into continuous regions of different sizes, multi-list reservation list - keeps track of the partially used memory reservations for objects and population map - to keep track of the allocated memory in each memory object.
5. Continuity-aware page daemon - Allocating/reserving memory in continuous chunks leads to fragmentation. A paging daemon helps with coalescing such fragemented memory chunks and perform contiguity-aware page operations.

Evaluation
The paper provides a detailed evaluation on all it's newly introduced design techniques. The paper tests each of the techniques in a best case and a worst case scenarios. A bunch of workloads with varying memory usages were used to demonstrate the effectiveness of the design. A good speed up and a reduction in TLB miss rates were observed for almost all the benchmarks were seen with the use of multiple superpage sizes as compared to using single superpage size. The use of paging daemon was evaluated vs the use of caches for fragmentation control and the avalaibility of contiguous memory were found more with the use of paging daemon over a period of time. The paper further also evaluates a worst case for all of it's superpaging policies like for promotions/demotions, preemptions of reservations and total overhead with all these policies. The authors mention a degradation in peformance in some cases but they attribute most of it to the hardware limitation of Alpha systems which were used for the evaluation. While some showed a negligible performance downgrade. However, I feel these are not really justified as the actual design does not match with the platform used for evaluation. The paper does not account for the memory overheads or compute overheads in using data structures like population maps and the associated hash tables for it. Since, the performance degradation were mostly conferred upon the hardware implementation by the authors, it would have been good, had the authors presented the comparison of existing hardware implementation (use of PTEs) to their software based approach of using population maps.

Confusion
A hash table is used to locate population maps. Is this hash table global or local to a process ?

1. Summary
This paper describes the use of multiple sized large pages to manage Virtual Memory transparently. It explains various tradeoffs while allocating superpages to achieve high performance for real workloads. It also presents a detailed experimental evaluation to demonstrate the merits of the approach.

2. Problem
The ratio of TLB coverage to main memory size has been decreasing dramatically over the past few years. The working set of an application is beyond the usual TLB coverage which results in frequent TLB misses and Page Faults. These misses are especially unnecessary given that a lot of times data is present in the cache. The solution to this problem is using superpages. The implementation of superpages is constrained by restrictions imposed by hardware like virtual and physical address contiguity, page size being multiples of sizes supported by a processor, alignment of the starting address of a page and single reference bit for a large page in a TLB. These restrictions leads to many issues that are difficult to deal with. Use of large pages can result in enlarged application footprints, leading to increased physical memory requirements and higher paging (I/O) traffic. A scheme with multiple sized paging leads to fragmentation issues and complexity in finding the right policy to allocate page sizes to different applications.

3. Contribution
The major contribution of this paper is identifying the urgency to improve TLB coverage and developing mechanisms and policy to support multiple sized superpages while reducing the overheads and issues that such a system can lead to. They use a reservation based allocation system that kicks in on a page fault. The allocated size of the page is based on whether the region belongs to the code segment, data segment or the stack and the heap. Available physical memory is classified into contiguous regions of different sizes which is looked up to get region of required size. In order to control fragmentation they perform contiguity aware replacement and coalescing. They can also preempt reservations to handle fragmentation or memory scarcity. As the memory footprint of an application grows, incremental promotions are also performed to increase the size of the page. In order to ensure that the superpage allocation was not aggressive, speculative demotions are performed which ensure unused base pages return to the free list. In order to handle the I/O traffic related to large superpages, they demote clean superpages whenever an application tries to write to them and repromote’s them later when all the subpages are dirty. In order to have a fast lookup of translation or to avoid overlap, they use a radix tree structure with the levels indicating the size of page possible in a system. This structure makes the implementation of promotion and demotion related mechanisms faster and easier.

4. Evaluation
They implemented their design in the FreeBSD-4.3 kernel and used a Compaq processor to evaluate the performance. Workloads varied from Web Server, Image Processing, Linking, Spec Benchmarks to Matrix Multiplication. They evaluate the best case performance of an application for multiple large page sizes and see that such a system can improve performance by an average of 30 percent and reduce TLB misses to almost zero. The evaluations also include running applications for a long period of time, concurrent execution, evaluation of demotion of super pages on write, pathological workload analysis and targeted experiments to measure overheads. However, the superpage policy is not effective for applications that have a large memory footprint but access a large number of small files as the system does not try to build a super page that spans multiple memory objects. I guess it would have been interesting to observe the performance of such a system on a real world server.

5. Questions
1) Application footprint has grown by orders of magnitude in the present era. How effective would these mechanisms be for very large page sizes to better suit the current needs?
2) I was not able to understand the performance degradation of the Mesa workload.

1. Summary
This paper seeks to address the performance degradation caused by application working sets outgrowing TLB coverage. It presents and evaluates a virtual memory management system which can handle multiple page sizes as supported by the hardware. The mechanism presented is transparent to the applications and effectively tackles challenges required to support memory pages of large sizes called superpages.

2. Problem
The TLB coverage had been decreasing relative to the main memory size and had resulted in increasing TLB misses which negatively affected application performance. One option to tackle this problem was to increase the size of base pages used by the operating system. However, this approach also had its drawbacks caused by increased internal fragmentation leading to premature memory pressure. Some earlier research had suggested using partial-subblock TLBs which allow superpage entries with holes, but this approach supported pages which are only moderately greater than the base pages and was not clear on how to support multiple superpage sizes. Contemporary operating systems like IRIX and HP-UX supported eager superpage creation at page-fault time based on memory available at allocation time and a user hint. This approach required experimentation to determine the optimal page size for various segment for the application and suboptimal setting was known to lower the application performance.

3. Contributions
This paper proposes and tackles the issues of superpage demotion and evition which were not address by any previous work. Another contribution of this work is a biased page replacement policy which can regain continuity of fragmented physical memory without relocating pages in most cases. The design proposed in this paper uses reservation based allocation for superpages. The policy used for memory objects differs for fixed size objects like code segments, memory mapped files and dynamic sized objects like heap and stack. For fixed size objects the policy is to reserve the largest superpage that does not reach beyond the end of the object and for the dynamic objects which can grow one page at a time is to reserve a largest superpage which does not overlap with existing reservations or allocations. The reserved pages are kept in a reservation list till they are accessed and the reservations can be preempted in case of memory pressure to serve other memory requests. The design allows for smaller superpages to be promoted to larger superpages if the smaller superpage is populated to the next larger superpage size and allows for superpage demotion to limit the amount of i/o when a base page of a superpage is evicted. Superpage demotion can also be triggered speculatively in case of memory pressure to determine if the superpage is actively used in its entirety. The design also proposes multi-list reservation scheme and population map to support efficient lookups.

4. Evaluation
The paper presents an extensive analysis of the new mechanism under best case scenarios when there is enough memory available for both the modified and unmodified systems to benefit from page coloring. From the experiment, authors observe that 18 of the 35 benchmarks show over 5% improvement while 10 showed improvements over 25%. One application - Mesa shows performance degradation as the allocator does not distinguish between zeroed-out pages and other free pages. This can be a huge problem if there are applications running which allocate a large chunk of memory and zero it out, because now instead of mapping each of the zeroed virtual page to a single special zero-page in physical memory the allocator will allocate several physical pages. Though the performance impact of this is negligible when memory is plentiful, this behavior can quickly lead to memory pressure and several i/o operations leading to reduced performance in real world systems as has been already observed(and documented) with the Linux's Transparent Huge Page implementation. Another cause of concern is that the system does not attempt to build superpages from memory that spans multiple objects, this will not lead to suitable performance gain in applications which do a large number of small memory allocations like the webservers as observed by the authors. The authors also run some sustained long runs and concurrent runs which show the benefit of their paging daemon in a controlled environment but I think the analysis would have been more comprehensive if they had shown the benefits of their implementation on a real world server.

5. Confusion
I am confused as to how multi-list reservation scheme and the population map interact with each other.

1. Summary
In this paper, the authors propose policies and mechanisms to take full advantage of hardware support for superpages, while making it completely transparent to applications.
2. Problem
Even though the size of main memory continues to drastically increase, the TLB reach has not scaled at the same speed. As a result, many programs tend to have poor performance because their working set does not fit in the TLB, causing expensive page faults. Hardware vendors have added the option of superpages, where a TLB entry could represent a page bigger than the base page size (i.e. 4MB superpage size compared to 4KB base page size). Through this feature, applications could potentially lower their page fault rate.
3. Contributions
In order to take advantage of h/w superpage feature, the OS needs to be aware of it and take full advantage. For that reason, the authors propose policies and mechanisms that would help many applications to transparently benefit from superpages. The design includes a policy of preferred superpage size and allocating that much space on first allocation of a base page. Through this reservation-based allocation, the OS bets on the application to end up allocation the rest of the contiguous memory that is reserved for it, and therefore the reservation was worth it. This works because if an application ends up not allocation the entire guessed superpage size, those base pages can be deallocated and used elsewhere.
The design includes incremental promotions to avoid internal fragmentation and speculative demotions in case of memory pressure and need for swapping out. In addition, the system would demote dirty superpages to avoid writing out the entire superpage to disk when needed.
In order to make this happen, the system needs to have an eager memory handler, where you demote or move pages around when contiguous memory is becoming scarce. The system includes a buddy allocator and a multi-list reservation structure to quickly find enough contiguous space for an allocation, and a population map to keep track of base pages that are allocated within a memory object.
4. Evaluation
The authors did a thorough experiment of all possible situations and showed how their system perform in those situations. They showed performance of the benchmarks when only one superpage size is used, when multiple page sizes are used, when workloads are run for a long period of time, and when workloads act in the worst possible case. These are great ways to evaluate such a design proposal, because benchmarks could perform differently on the various experiments, so doing all of them would show where a benchmark does better. In particular, using more than 10 benchmarks helps to cover various workloads. I found the long period of time experiment most interesting because papers usually do not perform such an experiment, but the way the authors did it was controlled, so it gave precise results of 55% speedup in FFTW benchmark’s performance using the custom daemon memory allocator.
5. Confusion
Could we go over the structure of TLBs created for superpages? Are there designs to allow dynamic number of various superpage sizes?

1. Summary
Superpages map large physical memory to virtual address map and drastically reduces TLB misses. Implementing them, however is a complex and multidimentional task which the authors of this paper have accomplished. FreeBSD daemon demonstrates this across 11 workloads that stresses the system and shows how effective superpages are in terms of improving performance.

2. Problem
Allocation of a superpage is a challenge and managing them leads to fragmentation in memory space. Early solutions to this was to have reservation-based system that immediately promoted base pages to superpage and suboptimal setting led to lower performance. Page relocation uses a software managed TLB to keep a counter updated by TLB miss handler and when it exceeds the threshold, relocation of all the pages are done in the memory to form a superpage. This has multi level drawbacks with robustness to fragmentation being the only plus point. An operating system can not distinguish which base page causes the dirty bit for the super page and as a result of this there is high I/O penalty paid by flushing out the entire superpage. The hardware support in terms of partial subblock TLBs wherein holes are allowed in superpages, and the address remapping memory controllers introduces another level of address translation to allow non-adjacent promotion of pages to superpages. None of these have been commercially successful.

3. Contributions
A hybrid solution has been proposed by the authors of this paper that makes use of buddy allocator, multi-list reservation scheme and population map. It ensures scalability via large multiple sized superpages, contiguity without compaction and efficient I/O with partially modified superpages. Buddy allocator looks at size of the memory object that is faulting the process and allocates a relevant superpage. It uses coalescing for fragmentation control. Multireservation scheme divides the superpages to extents and helps provide free pages to buddy allocator, partially populated to the superpage list that is just below the current highest superpage list and it removes the completely filled pages from the reservation list. Population map, on the other hand helps map VPN to PFN at page fault and maintains the reserved information. It detects and avoids overlapping regions and keeps track of unallocated contiguous memory regions. This helps in page promotion as well. The way it demotes a superpage when its dirty bit is set and only repromotes when all the base pages are dirty, helps increase I/O efficiency.

4. Evaluations
Detailed analysis has been done with multiple workloads in terms of evaluating the speedups and reduction in % of TLB misses with various superpage sizes and also a combination of them all. Drastic performance benefits of 30-60% were observed. The speculative demotion of superpage to recursively find the dirty base page helped a lot in I/O volume reduction. I like the fact that they target the smaller superpage first for eviction when memory pressure arises. Since the I/O flushing out and repromoting the larger superpages will be more expensive. Presuming applications mostly allocate majority of the memory object in the early stages of execution in case of preferable superpage size policy can turn out to be inefficient when a process periodically increments its size of the memory objects. What I did not appreciate is the fact that there was no provision of handling multiple small memory objects to build a superpage that would help in the workload performance. This could be accommodated as an extension to the population map and would drastically increase efficiency of most distributes systems task results.

5. Questions
What is the cost of buddy allocator differentiating zero-ed out pages and from the free pages.

1. Summary
This paper describes a model of transparent superpage support by the operating system, thus allowing for increased TLB coverage and fewer TLB misses. It describes a system for promotion and demotion of superpages as well as ameliorating some inherent problems such as fragmentation and wired page discontinuity. The authors then implement it in FreeBSD on the Alpha CPU and evaluate their system against the ideal “best case” scenario with a series of workloads.

2. Problem
While memory size has grown dramatically over the past several decades, TLB coverage has not: at the time of this paper, coverage was generally about 1MB or less, which can cause many TLB misses and performance penalties. Most modern CPUs provide support for something called a superpage, or a memory page of larger size that still takes up only one TLB entry. If used improperly, though, superpages can actually increase I/O costs and fragmentation. A management system is needed to balance the various tradeoffs.

3. Contributions
They address the five main issues involved with superpages: allocation, promotion, demotion, fragmentation, and eviction. Previous approaches had either been hardware-based or one of two paradigms: reservation-based or relocation-based. Reservation-based schemes reserve regions at page fault time that can be pre-empted to regain free space. These regions are then promoted when the number of frames in use reaches a certain threshhold. Relocation-based schcmes create superpages by copying sufficient frames to a contiguous region.
Here, the authors choose to improve on the reservation-based scheme. They determined a preferred superpage size, and the buddy allocator gives them a number of page frames for that size that are then tentatively reserved. There is a specific policy for determining preferred pages based on the attributes of the memory object. Pages are promoted incrementally when a reserved space is fully populated, all the way up to the maximum superpage size allocated. They are demoted when base pages are evicted or on a speculative basis according to a preset probability. Dirty superpages are demoted into their base pages to avoid having to page out the entire block of memory.
There are several overall improvements. First, they move to a solely OS-based approach, with the only requirement for hardware being that of page sizes. Second, they come up the contiguity-aware page daemon, which as far as I can tell is an algorithm that aggressively tries to free cache/unused page space. Third, they also go into more detail for demotion of superpages. In [6] of their sources, for instance, demotion is mentioned only briefly, and mechanics such as speculative demotion and demotion upon dirtying of base pages is not mentioned at all.

4. Evaluation
They implemented their design on the Alpha firmware, which implements superpages via PTE replication. First they tested workloads with nonfragmented and unlimited free memory. There are substantial benefits for each type of workload, especially the ones that access large areas of memory. From the experiments, it's also clear that multiple superpage sizes are the best approach as different applications have different page needs.
As an extension to this, though, I would've liked to have seen tests applied to the other extreme of too many superpage sizes. They have tests where they lock superpage sizes to single sizes (4KB, 512KB, etc.), which reflects the situation on the i386 which only have 4KB and 4MB pages. They mention that the Itanium CPU has ten different sizes, though, as opposed to Alpha's 4. I wonder if having too many superpage choices would add overhead in terms of memory footprint and processing.
Also, as far as I can tell, their workloads consist of single applications run one at a time? They don't give detail on how the workloads are specifically structured. Wouldn't the order/variety of a single workload affect the speed benefits of superpage use? I would like to see more detail on that to get a better picture of the benefits.

5. Confusion
How exactly does the contiguity-aware page daemon work? I read it but I'm still confused as to how the improvement is so drastic over the cache scheme. Also, are noncontiguous superpages possible without hardware support? I looked into some other papers but the one that I found dating from 2012 still requires hardware support to function.

Post a comment