« The working set model for program behavior | Main | Implementing Remote Procedure Calls »

Practical, transparent operating system support for superpages

Practical, transparent operating system support for superpages. J. Navarro, S. Iyer, P. Druschel, and A. Cox. In OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation, 2002. ACM.

Reviews due Thursday, 2/16

Comments

1. Summary

This paper addresses the issues of increased fragmentation and higher I/O traffic associated with supporting large sized pages called superpages. It does this by a reservation-based scheme which reserves a contiguous region at page fault time, preemption of reservations under memory pressure and dealing with fragmentation through incremental promotion and demotion of superpages.

2. Problem
The growth in the size of TLB in terms of number of entries has remained very slow compared to the exponential growth in the size of main memory as well as the working set demands of current applications.

This causes processes with large working sets to incur a high performance penalty through TLB misses. To keep the access time for TLB low, there is a need to increase TLB coverage without increasing the the TLB size. This can be achieved through superpages which is an effective way to reduce TLB misses through increased TLB coverage.

However, the use of superpages brings about the challenges of memory fragmentation and increased I/O demands. Page-relocation based schemes suffer from dealing with TLB misses in a reactive manner. Various techniques such as superpage allocation, promotion, demotion and superpage replacement or eviction, etc. are discussed to support superpage management.

3. Contributions
a. Main contribution of this paper is Reservation-based allocation.
This method reserves a contiguous region of memory at page fault time, Hence, reduces the chances of not finding superpages due to fragmentation.
Also, reservations can be preempted if its most recent allocation occurred least recently by use of a multi-list reservation scheme.

b. Population map keeps track of allocated base pages for each virtual memory object which proves and support efficient lookup during promotion and preemption.

c. A contiguity aware paging daemon to restore contiguity and a buddy allocator that automatically coalesce contiguous free contiguous memory regions are implemented to keep fragmentation at control.

d. Incremental promotions and demotions help the system determine the optimal page size for different applications and also reduce fragmentation

e. Optimizations to improve performance such as demotion of clean superpages when a process attempts to write to them.

f. Wired page clustering to avoid scattering of kernel memory pages and contiguity-aware page daemon to increase available contiguity.

4. Evaluation
The authors provide thorough evaluations covering a wide range of workloads/benchmarks including both best-case as well as adversial-case scenarios.
They demonstrate the benefits of superpages, when there is plentiful free memory for superpage allocation.
The issue of fragmentation was studied with different memory behavior i.e, both the sequential and concurrent execution cases.
and it is shown that it works well in both cases.
An adversary application benchmark is then run to show the impact of performance overheads associated with the population map data structures, and very small degradation is seen.

5. Confusion
a.Why do processors only support certain page sizes, as mentioned in the paper?
b. Working of population map was not very clear. Can you elaborate on this?

1. Summary
The paper provides mechanisms and policies for handeling large pages in the operating system. They implement their proposed solution for the FreeBSD system and show that their approach can reduce TLB miss rate by upto 99%.
2. Problem
With the increasing size of physical memory, the pressure on a TLB increses causing a reduced address space coverage and an increse in the miss rate causing a heavy performance penalty.
3. Contributions
The authors extend and combine previously developed ides of reservation, relocation and hardware support for superpages to develop their solution. Their primary contribution in my opinion is the ability to handle multiple sizes for these superpages and mechanism to transparently move to larger or smaller size as and when required. They term these mechanisms as promotion and demotion. They also develop speculative and preemptive versions of these mechanisms. They also deal with issues related to what to do when a large page is dirty (wheather to page out the whole thing or demote and page out the smaller page). They also develop support data structures such a list of the various super page sizes and population maps (some thing that seems like an additional page table)
4. Evaluation
The authors try to provide an indeoth look at their implementation by testing on a real system on real hardware with an inclusive benchmark for best case and worst case scenarios. In this section they show that their approch can lead upto 60 % performance benefits and upto a 99% reduction in TLB miss rate. This all seems good in terms of impact on timing performance and overheads, but they do not talk about the memory impact/overheads of the additional data structures required by their approach.
5. Confusion
Where are the population maps residing, are they per process or are they a global structure ?
The authors talk about walking the page tables upto 6 times in the overhead analysis, wow this can lead to a huge performance overhead in a scenario where we are running inside a virtual machine environment and both guest and host are using superpage support, has anything been done since this paper to mitigate this ?

1. Summary
This paper introduces a transparent mechanism to support multiple superpage sizes in the TLB. The system uses reservation-based allocation to allocate memory to each process, and it uses a variety of mechanisms and policies to manage superpage entries in the TLB.

2. Problem
The TLB stores common virtual-to-physical address mappings, serving as a cache for the page table. Memory sizes have been increasing, but TLB sizes must remain relatively small in order to keep access time low. To increase TLB coverage of memory, modern CPU's provide support for superpages, memory pages of a large size. Issues with the use of multiple page sizes include alignment in physical and virtual address spaces, physical memory fragmentation, and higher paging traffic. Furthermore, each superpage has only one reference bit, dirty bit, and set of protection attributes.

3. Contributions
The authors implement operating system support for superpages on FreeBSD with an Alpha 21264 processor, which supports 8KB, 64KB, 512KB, and 4MB superpages. However, their ideas work for any potential set of superpage sizes. The system performs reservation-based allocation, preferring to allocate a large superpage to a memory object upon creation rather than relocate memory objects to create a large superpage later. If part of a superpage is not used and available physical memory is scarce or fragmented, the operating system can preempt unused superpages within a larger superpage. Furthermore, the allocator performs coalescing of available memory to reduce fragmentation. A superpage is incrementally promoted if all smaller superpages are fully populated. The authors also demote a superpage recursively if a base page must be written to disk. A multi-list reservation scheme and population maps keep track of allocated pages. The authors made minor changes to the FreeBSD kernel to enable these new ideas.

4. Evaluation
The authors ran a variety of benchmarks to demonstrate that supporting superpages reduces the TLB miss rate and generally causes speedup. Certain applications have an ideal superpage size, but even these applications experience speedup if the system supports multiple superpage sizes. The paper did not have much description of the memory overhead of superpage support. For example, how much physical memory is used up if an adversarial application allocates very little memory, but forces the operating system to create large population maps?

5. Confusion
There were a lot of ideas in this paper. Which ideas in Section 4 existed already, and which ones were new to this system? Furthermore, why were both the multi-list reservation schemes and population maps necessary?

1. summary
The paper developed a general and tansparent super-page management system which uses super-page to increase TLB coverage, thus improve performance of applications. They followed reservation-based super-page management strategy and proposed ways to tackle serveral problems of priori reservation-based method. Their super-page management system do provide some nice perporities including incremental promotions, fragmentation control, speculative demotions, dirty superpage paging out with relatively small overhead according to their evaluation. They also describted the trade-offs issues for this kind of system to address.

2. Problem
With the main memory becoming very large, the TLB size remained relatively small, so the TLB coverage decreases which result in performance degradation. To exploit the larger page size is a way to address the problem. But a lot of issues there to address for making this idea really work. Previous work may not do good in fragmentation control, not support demotion and dirty page evicition. And one important problem is that how to do efficitive management with the loss of fine-grained page usage information while using super-page. Eg, a big super-page with partly hot and partly cold. and flushing modified small part of a big page.

3. Contributions
a). It developed a system that seems work fine using the reservcation-based super-page managemnet model with low overhead.
b). Being able to handle fragementation on super-pages, and proposed a contiguity-aware page replacement algorithms to solve.
c). It have several trade-off made and specific techinques that make this kind of system more practical, like reservation and preemption, incremental promotion, speculative demotions.

4. Evaluation
They do a lot of benchmarking and evaluation from different perspectives and looks reasonable. Thye analyzed the best-case benefits of the system when the memory is plentiful. They also used different workload to stress the system. Moreover, they measured the overhead of the system in several pathological cases.
The experiments shows that they achieve significant proformance improvement with small overhead.

5. Confusion
a). In Multi-list reservation scheme, why fully populate extents are not reinserted into the reservation lists.
b). Will the super-page hurt scaling?
c). The overhead seems still large, like accessing reservation list, page map, promotion ..., how could they make it little overhead?

Summary: While large size memory pages (superpages) provide many benefits such as increased TLB coverage, reduced TLB misses, and improved overall system performance, they also lead to challenges such as overheads related to managing their allocation or promotion, and inefficient fragmentation. This paper tries to build an effective management policy to address these concerns. The paper discusses reservation based allocation method, demotion and eviction of superpages, and contiguity-aware page replacement policy to manage superpages and control fragmentation.

Problem: TLB size has not grown with the same pace as memory. Consequently, applications with large memory footprint pose challenges such as TLB misses and degraded system performance. This mandates non-uniform memory page sizes and hence the concept of very large pages (superpages) of varying size. However current control mechanisms such as single reference and dirty bit forces OS to treat superpage as one single page causing problems like inefficient page usage tracking and redundant write backs. Fixed size superpages has its own issues such as inefficient paging overheads in case of too large superpage size and small TLB coverage in case of too small size.

Contributions: 1. Introduces the concept of multiple sizes for superpages. While very large size superpages are helpful for improved TLB coverage, presence of medium/small size superpages provides OS greater control over paging and fragmentation. Superpages grow from small to large.
2. Uses reservation based approach. Base pages are reserved for superpages. A process is allocated a base page in case of page fault. It is promoted to a superpage only if process requires multiple pages and these pages combine to form a size reserved for superpage.
3. Uses the idea of periodic demotion. To ensure all superpages are actively in use, it demotes superpages periodically. Demotion of a superpage is done to next smallest superpage size instead of directly demoting it from a superpage to base page.
4. Improves page eviction policy. Instead of evicting a superpage entirely, it keeps demoting the superpage to next smallest superpage until eviction demands are met.
5. Maintains multiple reservation lists. Partially populated reserved page frame extents are tracked to enable page pre-emption under memory pressure.
6. Has contiguity aware page daemon to move pages from one state to another (Inactive to Cache) / page out under memory pressure.

Evaluation: The design was evaluated on a system with FreeBSD kernel and Alpha 21264 processor. Various workloads such as integer benchmark suite, non-blocked matrix transposition, image processing, flourstones benchmark and others were used to test the performance of the system. The author also compared the performance of system with multiple sizes for superpages versus single size. In worst-case and average case workload scenarios, performance of this design was either similar with negligible overhead or marginally better and in best cases, it was significantly better (upto 750%), attributed to memory usage patterns of applications. However, I believe authors should have discussed more about ways to address performance issues in applications like Mesa.


Confusion: What factors determine the period in case of periodic demotion? Also, is the period decided on a system level or process level?

1. Summary
Super pages provide various performance improvements by providing large TLB coverage and thus reducing expensive TLB misses. This paper talks about efficient management of super pages by using reservation based allocation. Authors implemented their design in FreeBSD on the Alpha CPU and obtained more than 30% performance gain.

2. Problem
Over the past few decades, the size of TLBs have increased at a very less rate as compared to the size of the main memory. Thus, the mappings for very less fraction of pages can be stored in the TLBs leading to large number of TLB misses and performance degradation. Super pages are large sized pages that can be used to increase the TLB reach without increasing its size. However, efficient management of super pages is extremely important to get maximum performance gain. Inappropriate use can lead to high paging traffic and internal fragmentation. Moreover, there are several overheads associated with super pages like single protection bits for all the bases pages, modification of page tables during promotion and demotion of super pages and paging out dirty pages.

3. Contribution
Authors used reservation based allocation policy to avoid page relocation in case a process needs more pages. When a page faults and is moved in the memory, enough continuous space is reserved so that the pages requested in future can be stored there. If no more contiguous space is left in the memory then some inactive super pages’ reservation can be preempted and be allocated to other processes. Buddy allocator performs coalescing at regular intervals to avoid page fragmentation and page replacement daemon performs contiguity aware page replacement. Promotions are done incrementally and super pages can be demoted speculatively in case of memory pressure. They also used hash digests to reduce disk I/O while moving a super page from memory to disk.

4. Evaluation
Authors used several benchmarks and applications for evaluation and most of them showed to have better performance. Most of the applications obtained around 99% TLB miss reduction when large amount of free and unfragmented memory is available. They also showed that best super page size is dependent on application and allowing applications to choose from various sized provides better performance. They also created some adversary applications to figure out the overhead due to their implementation. Even these applications show very less performance degradation with an average of 1%.


5. Confusion
It looks like the overhead involved with most of the operations associated with management of super pages is very high like modification of PTE during promotion and demotion, demotion of all pages if any one of them is dirty and use of population map. Is it really possible to get 30-60% of performance gain under these conditions?

Summary:
The TLB coverage can be increased significantly by supporting multiple sized pages called superpages. This paper identifies and overcomes several challenges in effective management of memory to obtain the performance benefits of superpages.
Problem:
In most of the processors the TLB coverage represented a very small portion of the physical memory. Applications commonly have large working sets and suffer many TLB misses. Many different works have used the idea of superpages to overcome this : Reservation-based approach, reallocation-based approach and hardware based approach. Still there was a need for effective management of these superpages to optimize fragmentation control, promotion, demotion and eviction policies.
Contributions:
The main goal of this work is to extract performance by analyzing issues in the existing implementations of superpages and build a system which manages them effectively.
> This system builds on the idea of reservation-based approach where the first page fault will reserve a superpage of size large enough to fit the entire memory object.
> Promotions are incremental, whenever the allocation size reaches the size of next available superpage size.
> Fragmentation control is achieved through a buddy allocator which coalesces memory when possible. Page replacement daemon is also modified to preserve memory contiguity.
> Unallocated portion of reservations are tracked using reservation lists, preemption chooses the reservation whose most recent allocation occurred least recently.
> Demotion is also incremental, superpage is demoted to a size where the selection base page is no longer part of the superpage.
> Allocations are tracked using per memory object population map, helps in making allocation, promotion and preemption decisions.
Evaluation:
The design was implemented in FreeBSD on Alpha CPU for evaluation. Various kinds of workloads were used to evaluate the systems. A TLB miss reduction of 98%-99% was obtained for most of the test cases. Also, different applications showed requirements for different size of superpages with FFTW standing out with 60 super pages of size 4MB. Only one application, Mesa, showed a small performance degradation that too because of allocator and not because of their implementation. They also evaluated their system for a variety of adversary applications and none of them showed high performance degradation.
Confusion:
How is desired superpage size selected for dynamically sized objects ?
Not clear how speculative demotion of active superpage is implemented.

Superpages

1. Summary
RAM sizes have increased significantly as compared to TLB size and hence TLB coverage. Thus, the frequency of TLB miss has increased drastically degrading overall performance of applications. This paper proposes large sized pages to increase TLB coverage and hence, system performance.
2. Problem
Processors support different sized pages at the same time. But, Operating Systems may not utilize this feature. Memory pages larger than an ordinary page is called Superpage. Providing Superpage support in OS poses some challenges like – Internal Fragmentation, Alignment issues, additional IO overhead, etc. This paper provides a solution to exploit performance benefits of Superpages while addressing its issues.
3. Contributions
This paper has addressed lot of issues caused by Superpages. The paper has proposed use of Reservation based allocation strategy for allocation. This strategy doesn’t have high relocation overhead but might lead of internal fragmentation. To control fragmentation, this paper proposes tweaking allocation algorithm to consider contiguity of pages in demand. As a result, the OS might face high contention for contiguous allocations. The OS might aggressively swap-out inactive pages to solve this problem. Another improvement in this paper is the use of promotion or demotion of superpages. Superpages can be of multiple sizes. The OS can perform incremental promotion of group of base pages to a Superpage. This improves TLB coverage without any internal fragmentation. Similarly, OS can demote incrementally if found any inactive base page in the Superpage in case of memory pressure. Superpages have the problem of higher IO overhead while eviction. The paper addresses this issue by maintaining hash of each individual page in the given Superpage. Hashes can be used to detect dirty individual base page avoiding writing entire superpage. The paper proposes two data-structures Reservation Lists and Population Map to assist OS in page allocation. Multi-List Reservation Lists keep track of reserved pages sorted by allocation time. These lists are useful during pre-emption for superpage demotion or eviction. Population Map is an indicator of empty spots in the allocated Superpages in the system. The map is helpful for Superpage promotion, allocation and pre-emption.
4. Evaluation
The paper has described extensive evaluation of the Superpages design. The Evaluation shows best-case speedup for common applications. Paper also describes speedup gained by having multiple sized superpages. The paper also contains the experiment demonstrating overhead incurred in page daemon and its performance implications.
5. Confusion
Why do we need to map Superpage at address which is multiple of its size?
Can you please explain the impact of multiple mappings of same file in the memory in this design?

Summary
The paper studied overheads and challenges of supporting super page like allocation and promotion trade-offs as well as fragmentation control. Then the paper adopts the reservation-based system design approach and extended the design to support features including multiple superpage sizes and scalability to very large superpages.

Problem
In order to keep TLB look-up efficient, TLB size cannot grow as memory size grows and resulting in more TLB misses. Superpages intend to increase TLB coverage but using large superpapers carelessly can result in enlarged application footprints.

Using super pages has other constraints and problems:
1. Size must be supported by processor
2. Required to be contiguous in both virtual and physical memory
3. Starting address much be a multiple of its size
4. Only a single reference bit, dirty bit and set of protection attributes.

Contribution
According to the authors, the paper made four contributions:
1. Extends reservation-based design to support multiple, potentially large superpages
2. Investigate on superpage fragmentation
3. Novel contiguity aware page replacement algorithm
4. Tackles problem that have been overlooked like eviction of dirty pages.

I think the paper’s design of superpage management also made good contribution to the problem, for example:
1. Fragmentation control: Coalescing + contiguity-aware page replacement
2. Incremental promotions: create superpage as soon as any superpage-sized and aligned extent within a reservation gets fully populated.
3. Demotions: incremental and does demotions speculatively to determine if the superpage is still being actively used in its entirety.
4. Page out dirty superpages: demote clean superpages whenever a process attempts to write into them (or using a hashing alternative)
5. Reservation list chooses the reservation at the head of the list to preempt.
6. Population map: use a radix tree structure to keep track of allocated base pages within each memory object.

Evaluation
I like how the paper run comprehensive benchmarks on real-world systems and machines. However, I do have several concerns with the evaluation:
1. I hope the paper evaluate more on storage overheads, especially for population map which I think can bring in significant storage overheads.
2. I was hoping to see a comparison of the paper’s implementation not only with base pages but also other superpage implementations. Otherwise, it is harder to tell whether the speedup achieved is from the paper’s careful design for superpages management or simple from using superpages.
3. In table 1, even though we saw significant speedup for some of the benchmarks. However, the scenario is based on when plenty of memory is available which is not always the case in real world. It would be helpful to also provide evaluation when memory is under pressure as well.

Confusion
1. Why does processors only supporting certain page sizes, as mentioned in the paper? How does page size affect processor?
2. Do larger-sized reservations appear in multiple reservation lists?

Summary

Use of superpages improves the usage of TLB by reducing pagefaults. The paper deals with the various memory management issues which arise upon introducing variable sized superpages.

Problem

Main memory size has been steadily increasing over the years thus keeping up with the Moore’s law. But this lead to the steady decline in TLB coverage which subsequently lead to higher page fault rate. Increasing the page size solves this problem but introduces new problems like internal fragmentation. Though new hardware provides support for a variety of page sizes, the OS has to deal with hard challenges like fragmentation control to ensure contiguous extents of physical memory, promotion and efficient paging out of large sized superpages. The paper proposes various policies to efficiently solve such issues.

Contributions

Reservation based page allocation is not something novel and was already published in the work by Talluri and Hill. However this paper brings in the idea of dynamically selecting the superpage size based on the attributes of the memory object. Other new ideas proposed in this paper are :

1. Preemption of reserved unallocated frames during increased memory pressure.

2. Memory allocation/deallocation over a period of time can lead to excessive fragmentation. To address this, they use the buddy allocation algorithm to coalesce the region of memory to form a larger chunk. They did this incrementally. Similarly, in case of eviction or writing a dirty superpage to disk, they recursively break it to smaller pages, check its hash, (which they again optimized by computing it lazily) which again reduced unwanted IO.

3. They address the issue of allocating memory under memory pressure by preempting existing reserved memory based on LRU which was a fair choice but they did not cite any previous work or presented statistical evidence.

4. A population map is used to keep track of allocated pages and help in the promotion and demotion of super pages.

5.Speculative superpage demotion strategies that reduce the cost of superpage eviction and optimize disk I/O costs of partially modified superpages.

Evaluation

The authors have carried out an extensive evaluation of the proposed system to prove that their design decisions do not negatively impact the performance. The authors prove that their system improved performance by giving details about the speed-up obtained along with TLB misses for all the chosen benchmarks. They have backed up the claim of improving contiguity by executing web and FFTW applications and comparing them against daemon for page replacement and cache, where daemon performs better. They have also measured the system against various pathological cases and show that the overheads are negligible. The evaluations could have been performed on wider range of hardware, since hardware support for superpages is one of the issues the authors are trying to solve.


Confusion

I have a few questions:

1. This is a basic question. More memory means more money spent but what is the exact reason for it to slow down even in case of Random Access Memory? Why can’t we increase the TLB size beyond certain limit without affecting latency?

2.Buddy allocator has been referred to multiple times in the paper in different scenarios but it’s actual purpose and implementation is not clear to me.

3. I would like the contiguity aware page daemon to be discussed futher.

1. Summary
Superpages are used to increase the TLB reach.Non-efficient use of superpages, multiple sized pages leads to fragmentation problems. The paper describes some of the policies to be used to get the maximum performance out of using superpages by keeping the overhead of implementing it to minimum.

2. Problem
Large number of TLB misses impact on the performance of a system significantly and hence even though we have larger main memory and on chip caches, TLB reach limits the performance of the system. Using the Superpages instead of the regular pages increases the TLB reach and thus solves this problem.

3. Contributions
The policies proposed by the authors for the efficient use of superpages are as follows:
-> reservation based policy for allocating superpages.
-> the maximum superpage size the can be used in the memory object of the faulting page is chosen as the superpage size.
-> the buddy allocator coalesces the memory regions into contiguous blocks of memory to control the fragmentation.
-> when the pages are contiguously allocated and they reach the size of one of the superpages and are aligned, the pages are promoted to be a superpage of the nearest size.
-> In case of memory pressure, the system preempts the reservation that is not used, instead of refusing to allocate memory.
-> reservation lists of different sizes keep track of all the reservations whose unused part is equal/ less than the size of the corresponding reservation lists.
-> each reservation lists are sorted by time, so that the reservations are evicted in the LRU fashion.
-> population map keeps track of allocated blocks in each superpage and thus helps in deciding which part of the superpage can be evicted in case of memory pressure.

4. Evaluation
The designed policies were implemented on top of FreeBSD-4.3 kernel and was run on the Alpha 21264 processor. Various benchmarks were run to test the proposed design policies.
In the best case where there is enough memory for all the allocations to succeed without having the preempt any reservation, the TLB misses were negligible.
The use of multiple pages, fragmentation control, overhead introduced by their decision were also evaluated and proved that the proposed design decisions gives improved performance with minimum overhead.

5. Confusion
Could you talk about hybrid system which uses both relocation and reservation based allocation policy for allocating superpages.

1. Summary
Various general-purpose processors provide support for superpages. Superpages increase the TLB coverage, reduce the rate of TLB misses and hence promise to provide enhanced performance. But superpages come with their own set of problems regarding allocation, fragmentation control etc. The paper aims at providing a solution to these transparently while guaranteeing improved performance.
2. Problem
A superpage is a memory page which is larger than a normal base page. Using superpages one can dramatically increase the TLB coverage and reduce TLB misses. But the use of superpages can be achieved by using multiple page sizes. This leads to the problem of physical memory fragmentation, compromise between factors like contiguity and overhead, the trade-offs of promotion and demotion. Since the hardware maintains single dirty bit and single reference bit, this also imposes overhead on OS during demotion or flushing out to disk.
3. Contributions
The paper studies the issues associated with previously used superpage allocation approaches. Reservation-based allocation is enhanced to overcome many of these drawbacks. The paper introduces preferred superpage size policy to handle the requirements of fixed sized memory objects and dynamically sized memory objects satisfactorily. It introduces a pre-empting policy based on the observation that reservations which have not experienced recent allocations are less likely to be fully allocated soon.
The paper also evaluates an alternative approach to paging out dirty superpages using hashing. Since this incurs high overhead, it is considered for future work. One major contribution of the paper is the implementation of multiple reservation lists. One more is the introduction of population map datastructure to assist in allocation, promotion, pre-emption decisions and maintaining contiguity.
The paper introduces some changes to FreeBSD’s components to implement an efficient superpage management mechanism. FreeBSD’s page daemon is changed to activate when contiguity falls low and includes some policy changes in the way active, inactive and cache pages are handled. It also implements wired page clustering to avoid the fragmentation of wired pages. The paper also overcomes the problem caused by multiple mappings.
4. Evaluation
A good portion of the paper is used to provide systematic evaluation of the new OS support introduced by the paper for superpages. The authors consider almost a complete range of benchmarks to evaluate their system ranging from the best case scenario to the worst cases. Their results also demonstrate that best superpage size is application dependent and allowing the choice of multiple superpage size gives higher performance. Not only does they consider best applications, but also run tests against 3 synthetic pathological workloads – Incremental promotion overhead, pre-emption overhead and sequential overhead. The paper gives the performance degradation figures for these workloads and backs them with explainations. The paper also discusses the support for scalability and the obstacles that it might face.
5. Confusion
1. Can you please explain the concept of page daemon and the modifications made to it by this paper?
2. Can you please explain the solution to multiple mappings problem?
3. What does ‘holes’ in partial subblock TLB mean?

Summary
This paper provides an implementation of superpages to enable larger TLB reach and thus, reducing the TLB misses. And these are handled transparently from applications. The authors discuss about existing solutions, their drawbacks and provide a practical solution to the smaller TLB reach problem.

Problem
There are two main problems that the authors are trying to solve.
1. TLB reach hasn't been increasing at the same rate as main memory thus having reachability limited to a fraction of main memory thus leading to higher TLB misses for applications (which have increased memory footprint at the same rate at which memories are increasing)
2. Nowadays computers are shipped with more physically addressed caches that are larger than actual TLB coverage; meaning, some of TLB misses actually translate to data present in cache (making the cached data useless)

Other problems include: Increasing the page size of the system would not help much either as it would cause plenty
of internal fragmentation due to partly used pages. This will result in early memory pressure leading to higher I/O demands. Single reference and dirty bit makes it harder of OS to determine which base page has been modified or being accessed.

Contribution
This paper extends a previous reservation-based approach (by Talluri and M.D. Hill) to work with multiple superpage sizes (based on support by the hardware) and demonstrate its advantages. The authors propose a new page replacement algorithm which considers the contiguity as a major factor while evicting pages. This algorithm controls fragmentation through superpage promotions and demotions.
In reservation-based allocation, set of pages are preallocated for a memory object and based on page faults, these memory locations are mapped into the process' page table. When lot of reservations are done, memory pressure will make the these reservations to be preempted. From a set of candidates, pages are picked which were least recently allocated. Promotions are done in incremental fashion; when regions are populated by applications, set of pages are incrementally promoted to next higher page size. Demotions can happen during page replacement. It can also happen during protection bits of super pages are modified. This is because, the base page to which the protections was modified is part of the big super page and we have only one bit for the entire superpage. Thus, demotions of super pages are done. Speculative demotions are also done to check if the superpage is actively being used or not. Demotions happen to clean superpages when on of the base pages are modified and are promoted again when all of the base pages are dirtied. One more solution to the dirty bit problem of is to have hash values calculated for each of the base pages and while writing the page to disk, hash values can again be check for dirtiness (As expected, this is a costly operation). Authors use multi-list reservation scheme where they maintain reservation list to track reserved pages that are not fully populated. This list is kept sorted as it would be easier to find the least recently allocated pages (for replacement). Population maps keep track of allocated base pages; they are trees in which the root node represents the higher superpage size and the leaves represent the base pages. Population maps are basically used for reserved frame lookup, overlap avoidance, promotion decisions and preemption assistance.

Evaluation
Authors have developed their design on Aplha processor running FreeBSD. They choose various workloads to from Integer benchmarking, float benchmarking, web, Image processing etc. They provide speedup of each of these benchmarks on a best-case (free of stress) environments and with memory pressure. They clearly show that their implementation provides a decent speedup of 5% to 25% and in some cases, 750%. Through heavy workloads, authors show that their approach is capable of sustaining contiguity even during memory pressure.

Confusion
Memory controllers providing another layer of translation (physical address to machine address (probably)) seemed more plausible methods to coalescing free memory. Why did this approach didn't pick up ?

Why does these methods (superpages) aren't used well in modern OSes ?

Summary
The paper proposes a design of superpage management system to improve TLB coverage without proportionally enlarging the TLB size. The paper introduces various ways to tackle the complexity and overhead associated with maintaining such a system. It proposes various novel memory management techniques to efficiently manage the contiguity and paging overhead problems that generally arise in a superpage based system.

Problem
TLB coverage has increased at a much lower pace than main memory. This redcuced TLB coverage severely impacted the application performance as it led to higher TLB miss rate. Supepages are the solution to increase the TLB coverage without proportionally enlarging the TLB size. However, existing superpage implementations were inefficient in handling all the complexities associated with running such a system.

Contributions
This approach extends a previously proposed reservation based approach to work with potentially very large superpage sizes. It proposes novel continuity aware page replacement algorithms to control fragmentation. It tackles issues such as superpage demotion and eviction of dirty superpages. Some of the main techniques used are
1. Reservation based allocation of memory objects with set of contiguous frames for the reservation obtained from buddy allocator.
2. Incremental promotion and speculative demotions of superpages to balance TLB hit rate performance
3. Multi-list reservation scheme to assist in reservation preemption
4. Population map, a radix tree based map to perform effiencent reserved frame lookup, overlap avoidance, promotion decision and preemption assistance.

Evaluations
The authors evaluate their design implementing it in the FreeBSD-4.3 kernel as a loadable module. The evaluation is divided into three categories 1. best case scenario when free memory is plentiful and non-fragmented, in this case the results show improvements over 5-25% on various benchmarks. Only Mesa shows a negligible degradation, which the authors attribute to their allocator not differentiating zeroed-out pages from other free pages. 2. The issue of fragmentation was studied next for both the sequential and concurrent execution cases, in which the performances of the cache and daemon contiguity management techniques have been compared. 3. next they evaluated specifically designed memory usage patterns that denies the benefits of superpages. The authors claim the performance benefits of 30% to 60% in several cases and show that the system is robust even in pathological cases.

Confusions
The usage of population map for four distinct purposes is not quite clear. It would be useful if we can discuss this in detail in class.

1. Summary
Navarro et al. study how superpages can be the next natural step to increasing performance due to the limited tlb reach growth compared to main memory.

2. Problem
TLB reach is limited resource that has not grown nearly at the rate of main memory. Furthermore, research has shown that TLB performance plays a significant role in the application’s performance. This group evaluates how superpages can be used to extend the TLB reach and thus improve performance.

3. Contribution
They implement a transparent solution to superpage management and discuss the aspects of a system required to make superpages successful. Overall, the approach is rather intuitive and borrows from the buddy-allocator. The key decision made is how to determine superpage size requirements. It seems to be done ok with a conservative immediate guess to avoid excess memory pressure and the possibility of multiple promotions. They also evaluate the performance of multiple page sizes as well as the page sizes individual, which provides justification for multiple superpage sizes.

4. Evaluation
They evaluate this system using several benchmark suites. I think it’s pretty obvious that we’d see improved results with certain benchmarks and only marginal (or degradation) with others as they explain due to the memory access patterns or the access of several small files. Overall, I’m pretty happy with the numbers they provided. They explored the case of adversarial workloads which is nice to see and clearly outline the limitations of their design and implementation. One thing they don’t necessarily explain is the impact of the adversarial workload patterns running concurrently with other workloads. Is one adversarial workload enough to degrade performance of the entire platform?

5. Discussion
I think the most interesting note is the idea of reevaluating the page table and how to introduce superpages to it. This seems to be a path forward to eliminating a lot of their linear runtime costs like updating the page table with all entries of the superpage.

1. Summary
The paper describes methods to support multiple sizes of superpages in order to improve performance of application programs by decreasing TLB miss rates, and do so in a transparent manner.

2. Problem
Size of main memory and the size of working sets have been growing, but the sizes of fully associative TLBs have not due to the need for fast lookups. As a result, TLBs have been able to cover only a small portion of the available memory, and hence performance penalties have increased due to higher TLB miss rates. Superpages can cover more memory using existing TLB structures, but they can result in higher memory footprint of programs and increase memory pressure.

3. Contributions
The paper tries to minimize the disadvantages of superpages mentioned above by using all available sizes of superpages and using mechanisms to consolidate pages into larger superpages, while avoiding fragmentation. A reservation based scheme is used to allocate pages to avoid overheads of copying pages when forming a larger superpage in a contiguous memory region. For memory objects that take large amounts of space in the virtual address space, appropriately sized superpages are allocated, whereas for smaller objects, base pages are allocated. These base pages can then be incrementally promoted to larger superpages when nearby pages in a contiguous address range are also accessed. When free physical memory becomes scarce, reservations can be preempted to get the required amount of memory. The buddy allocator performs coalescing of available memory regions to avoid fragmentation. Superpages may be demoted in size when there is a need for page eviction. Demotion may be done speculatively to determine which of the base pages of the superpage are being used actively. Population map data structures keep track of reservations, and help in page promotion and preemption decisions.

4. Evaluation
The evaluation mostly consists of synthetic benchmarks like SPEC and some pathological workloads. It is shown that TLB misses reduce drastically and some benchmarks gain significant performance benefits with superpages. It is also seen that multiple superpage sizes provide more benefits than having a single size. The recovery from fragmentation of physical memory seemed a bit slow though, as it required multiple runs of the FFTW benchmarks to reach its best case performance. It is not mentioned how long each run of FFTW takes.

5. Confusion
What are the overheads of kernel data structures such as population maps in terms of memory consumed?

1. Summary
This paper presents a solution to efficiently support super pages of multiple sizes in OS, in a reservation-based manner. Authors implement the proposed system based on FreeBSD, and demonstrate super page’s performance benefits compared with single page system.
2. Problem
TLB reach has become a bottleneck for programs with large working sets. And for modern processors with large on-chip last level cache, a limited TLB reach makes the LLC useless since TLB lookup, and possibly update, is on the critical path of memory access. One solution is to use super pages of different sizes, to increase TLB reach while reducing the fragmentation and I/O overheads of simply using large pages. Although multiple commercial processors already support super pages, OSes lag behind due to the complexity of managing pages of different sizes, including alignment, fragmentation (both internal and external), I/O overheads of large pages, etc.
3. Contributions
This paper proposes a reservation based allocation approach to support super pages. It is more efficient than previous proposals because
1) It reduces fragmentation by having a buddy allocator managing all available memory and coalescing them in the background, and by having a preempt mechanism to retrieve continuous memory from larger continuous memory previously allocated but not fully populated. To implement this preempt mechanism, they design the reservation list data structure to keep track of the available preemption size of each previously allocated super page, and a per super page population map to keep track of actual memory layout on base page granularity, for different memory objects (I guess they refer to memory segments).
2) It reduces the I/O overhead and space waste of unnecessary super pages (partially idle) with speculative demotion. It also lets the TLB get the benefit of super pages as soon as possible with incremental promotion.
3) It uses a continuity aware page replacement algorithm. When both allocating super page from available memory and preemption fail, the page daemon will try to evict continuous inactive page to restore necessary continuity for the failed request.
The paper also proposes a way to avoid swap the whole super page to disk when it’s only partially dirty, by hashing each base page lazily on write and compare the new hash and old hash on page out.
4. Evaluation
The authors implement the system based on FreeBSD and use real world workloads to test its performance. Experiments show that super pages, especially super pages of multiple sizes benefit performance when memory pressure is low, and the proposed page replacement algorithm is efficient for reclaiming continuity. However, authors do not compare super pages’ performance when memory pressure is high, with single page systems. Besides, authors do not give concrete data or solid analysis about the overheads of supporting super pages, such as size of additional data structure, population map specifically, dirty super pages, etc.
5. Confusion
- What could be the size of the population maps. They seem pretty expensive since they are per super page data structure and are rich in pointers.
- Why would wired pages get scattered after some time of execution?

1. summary
This paper describes a mechanism to allow operating system to utilize superpages in the address space of programs.
2. Problem
As memory sizes have increased, TLB sizes have not been able to keep pace. This has led to a situation where the total amount of memory accessible by the entries contained in the TLB at any given time to be an increasingly smaller portion of both total memory, and the memory used by programs as their memory demands increase. This then leads to an increasing number of page faults each of which comes with an inherent memory penalty to fetch the translation. Arbitrarily using larger page sizes can lead to problems of its own. In cases where the larger pages are largely unused, much of memory would be wasted. If memory is not allocated in contiguous sections, fragmentation can prevent larger pages from fitting into the address space later when requested and can prevent current use from being combined into a larger page.
3. Contributions
This paper introduced a set of policies to make efficient use of larger superpage sizes. All memory allocation was preferentially places on memory boundaries that would be able to align with a superpage if/when needed. When memory was allocated, the full superpage space was also reserved, though not actually allocated yet, so that future memory requirements could be placed contiguously in what would become the same superpage. Once the superpage space was filled with smaller allocated pages, the region would be promoted to a larger superpage. In order to be sure the full page was actually in use, occasional demotion was done and the reference bits checked for any sections of the superpage that could potentially be paged out.
4. Evaluation
To evaluate their method, they ran a full suite of test programs on an Alpha system under a range of conditions. They show that when memory is plentiful, nearly all programs benefit from the use of superpages from the reduction in TLB misses. They then show that the use of their multiple super page sizes shows more benefit than any one size alone. Their fragmentation concerns are also analyzed over time with some evidence that their coalescing method allows for improvements over time on a highly fragmented memory system. Their analysis generally does a good job of showing their improvements over the base system without the use of superpages, but what it doesn’t show is how it might compare to other superpage systems that perhaps promote pages more readily as they mention is possible.
5. Confusion
Somewhat of a side topic, but how do the “wired” kernel pages end up so fragmented over time? I was also slightly unsure of the demotion methodology. Are large pages demoted entirely down to base pages or to just a size smaller?

1. Summary
This paper looked at how to effectively use superpages to increase TLB coverage. In order to make sure superpages actually can function the paper not only looked at how to create superpages but also how to better insure contiguous memory for the superpages.

2. Problem
The basic problem that this paper was trying to solve was the severe lack of TLB coverage. In fact it has gotten so bad that physical addressed caches have more memory than the TLB coverage. Superpages have already been proposed to help alleviate this problem but there reliance on contiguous sections of memory lead to serious fragmentation problems.

3. Contributions
The paper uses a promotion/demotion system for moving base pages or smaller superpages to bigger superpages. To ensure that contiguous memory is available for a superpage when base pages are ready to be promoted the paper has the OS try to assign a contiguous section of memory from the first allocation. One way the paper tries to handle fragmentation is through their buddy allocator which performs coalescing of free memory. However, the buddy allocator only works when there is plenty of memory so the paper also describes a page replacement daemon. The daemon tries to find inactive pages when it is activated by a lack of contiguous memory. Because there is no inner knowledge of a base page residing in a superpage the paper implements speculative demotions to try and find inactive pages within a superpage.

4. Evaluation
The paper begins its evaluation section with a broad overview of benchmarks run with their superpaging under no memory stress. However, this doesn't say much as even the paper admits fragmentation quickly occurs when running naturally. This portion also contains the strange contradiction that the paper claims the gcc benchmark has significant performance improvement with 1.3% improvement over the baseline but the paper also claims that the 1.5% degradation of mesa is negligible. Further sections of the evaluation do cover memory stress and show the remarkable job the daemon does in recovering contiguous sections of memory for superpage use.

5. Confusion
The paper seemed to lack a study of the potential overhead that this system could incur especially with the PTE replication or the recursive demotion. Was this overhead ever addressed or how has it been overcome?

1. Summary
Superpages (large size memory pages) enable mapping of large physical memory region to virtual address space. The paper talks about the design of an OS that supports superpage management transparently and also discusses different mechanisms therein.

2. Problem
TLB coverage refers to the amount of cached virtual-to-physical address mappings. The last decade has seen low-paced increase in TLB coverage as compared to main memory size. This results in many TLB misses for applications with large working sets. There is, thus, a need to use the support for superpages (provided by modern day processors). Furthermore, the problem of superpage allocation becomes a complex task while addressing issues like memory fragmentation and simultaneously ensuring good sustained performance.

3. Contributions
The paper discusses managing superpages efficiently and provides multiple mechanisms & policies to do that. A superpage size policy chooses desired size from multiple available sizes thus addressing the problem of fragmentation and contiguous memory allocation. Incremental promotion of superpage sizes and speculative demotion, when evicting pages, allow easy adaption to variety of applications. To address the problem of performance degradation owing to large partially dirty superpages, clean superpages are demoted whenever a process tries to write into them and repromoted later if all base pages are dirtied. A continuity aware page daemon supports coalescing fragmented memory portions. Different data structures used in implementation of these policies include multi-list reservation list - tracks partially used memory reservations and population map - keeps track of allocated base pages within each memory object. All these mechanisms and policies are shown to be implemented transparently in the system.

4. Evaluation
The superpage management system is implemented in FreeBSD on the alpha CPU and evaluated on real workloads. A variety of applications like web server, image processing, FFT are considered to evaluate the system. Considering best case performance of applications with multiple superpage sizes, performance improvements of 30% are observed and there are almost no TLB misses. The system is also evaluated for adversarial applications using synthetic pathological workloads. Performance degradation of up to 2% is observed which implies negligible overhead.

5. Confusion
Overall, the idea of supporting superpages of multiple sizes looks good. Do modern day OSs provide any support for such a thing ?
Is it common to use synthetic workloads while testing system performance ? Couldn’t the authors think of real applications as adversaries and then discuss overhead ?

1. Summary
This paper addresses the issues of increased fragmentation and higher I/O traffic associated with supporting superpages. It does this by a reservation-based scheme which reserves a contiguous region at page fault time, preemption of reservations under memory pressure and dealing with fragmentation through incremental promotion and demotion of superpages.

2. Problem
Processes with large working sets can incur a significant performance penalty through TLB misses. Using superpages is the only effective way to reduce TLB misses through increased TLB coverage. Other approaches such as increasing TLB size are not feasible because the TLB is fully associative and any size increase would result in increased access time in the critical path. However, the naive use of superpages can result in issues of memory fragmentation and increased I/O demands. Existing reservation-based schemes (IRIX, HP-UX) handle these issues but are not transparent and need experimentation to determine the optimum page size. Page-relocation based schemes are more robust but are reactive and not proactive in dealing with TLB misses. The authors propose a system which deals with all these issues in a holistic manner.

3. Contributions
1. Reservation-based allocation: This scheme reserves a contiguous region of memory at page fault time. This allows it to avoid the cost of page relocation in the critical path and also reduce the chances of not finding superpages due to fragmentation
2. Preemption of reservations: If the system finds itself under memory pressure, it preempts the reservation with the least recent page allocation.
3. Incremental promotions and speculative demotions help the system determine the optimal page size for different applications and also reduce fragmentation
4. Optimizations to improve performance such as demotion of clean superpages when a process attempts to write to them, wired page clustering to avoid scattering of kernel memory pages and contiguity-aware page daemon to increase available contiguity.

4. Evaluation
The authors evaluate their design with several benchmarks for the best-case scenario of plentiful memory. Moderate performance benefits are seen in certain cases. More importantly, the TLB miss reduction is impressive with above 99% in several workloads. This has very positive implications for virtualization where TLB flushing can be quite important. The authors also show that their system is able to figure out the ideal page size for various applications, which was a weakness in previous systems.

The authors also demonstrate through a more realistic test the impact of their contiguity-aware page daemon which is able to serve many more superpose requests with a minimal (3%) performance overhead. An adversary application benchmark is also run to show the impact of performance overheads associated with the population map data structures, and a degradation of about 2% is seen. What is not shown is the memory overhead of the contiguity aware data structures.

5. Confusion
1. Is there a population map for each process or per "memory object”? A walkthrough of the population map structure with all the mechanisms would be really helpful.

1. Summary
This paper introduces a superpage management method. The key features include reservation-based allocation, superpage promotion and demotion, and fragmentation control.

2. Problem
The first problem is that size of TLB did not increase as fast as main memory size. The memory needed by application (working set) generally is far larger than TLB size, which incurs TLB misses. TLB misses will make extra memory access to consult page table, which incurs performance penalty.

The second problem is superpage support has several constraints at that time. First, hardware only supports certain superpage sizes (how to choose proper page size?). Second, a superpage has to be contiguous in physical and virtual address (how to manage superpage to avoid fragmentation?). Third, the TLB entry for a superpage only have single information for whole superpage, e.g. reference bit and dirty bit (how to page out dirty page efficiently?).

3. Contributions
The authors propose a reservation-based allocation method for superpage. During page fault, an contiguous page frames are reserved, and only the mapping of current faulted page is inserted. Subsequent page fault of surrounding pages will only cause the mapping to be inserted into page table. Each page here is still has default page size (e.g. 4KB). When the reserved pages are all referenced, these pages are promoted to a superpage. The page table entries for these contiguous pages will be updated accordingly, and a TLB entry for the superpage will be inserted. To demote superpage when there is no enough memory, if the superpage's reference bit is reset, the superpage is decomposed into next smaller level superpage. This demotion method seems to be aggresive because only one page will influence whether the superpage containing it to be demoted or not. However, we are constrained to the fact only one reference bit per superpage. To control fragmentation, the memory allocator performs coalescing of memory regions whenever possible (however there is trade off between memory contiguity and performance overhead).

To workaround page out superpage, the authors mentioned the method to compute and compare hash of each page. For me, hashing seems a good idea, but the authors said this method was expensive, and future evaluation was needed. I don't know whether we could use Merkle tree here. Each level of tree represents the hashing value of each superpage size. The leaf hashing value is for base page size (4KB), and root hashing value for whole superpage. When a page is written, the hashing value of part of tree is updated in background. When paging out, only part of hash value needs to be compared, not necessarily every base page. However, to solve this problem from the root, hardware support more information for superpage is needed.

4. Evaluation
The authors implemented their design in one version of FreeBSD. To test their prototype's efficiency, they use both benchmarks and real applications. When free memory is plentiful and not-fragmented, superpage shows reduction of TLB misses and speedup, and the authors also showed that the best superpage size depends on the application. For sustained performance benefits in long term application, the overhead of the contiguity restoration of page daemon is only 0.8%. The authors also showed the performance degradation and overhead is around 1% under several adversary applications.

5. Confusion
I don't understand how population map works with reservation (the two pointers between them) clearly. Could you work through an example like Figure 3 in the class?

Summary:

Superpages improve TLB coverage, TLB misses and promise
performaance improvements.However they come at a cost-superpage
allocation, fragmentation promotion tradeoff etc. The paper
propoes a design to make superpaging as efficient as possible.


Problem:

1) Applications with large working sets incur many TLB misses.

2) Memory has grown at a much faster rate than TLB size. Want to
incease coverage w/o increasing TLB size

3) Naiveley having larger pages can lead to more I/O traffic.


Contribution

SUPERPAGES - simply collection of contiguous smaller pages

* Reservation-based allocation. Here, the OS tries to allocate a
page frame that is part of an available, contiguous range of
page frames equal in size and alignment to the maximal desired
superpage size. This is done so if the OS wanted to create a
superpage all the pages are already contiguous and don't need to
be recopied. At pagefault time, the system obtains from the
allocator a set of contiguous page frames corresponding to
the chosen superpage size.

* How to select super page size? - Made very early hard to predict
future behaviour. From reading earlier in the week we already
know looking too far ahaead is a terrible idea. Look at memory
demand of object that created page fault. For fixed size objects
like mmaped files this is easy. For dynamic memory allocations
it gets harder. The desired size is the largest, aligned
superpage that contains the faulting page and does not overlap
with existing reservations or allocations.

* Fragmentation control: OS tries to keep memory as contiguous as
possible to make superpage for cheap. Under pressure either say
no pages available or use some of the reseverved but unused
pages. Papers policy is to never say no but use reserved pages.

* Promotion: If OS allocated a certain threshold number of pages
continuously it may wish to promote these groups of pages into a
superpage and put one entry for it in the TLB.

* Demotion: Superpage is getting fragmented or there is memory
pressure. Go back to status of smaller pages.

* Eviction: works just like regular pages. One dirty bit for
entire superage so entire page often needs to be flushed for
reliability. This slows stuff down. So demote superpages when
some pages are only dirty and re prmote when all pages have been
dirtied. Hash based approach to check who's dirty and clean.


Evaluation

This paper employs the policy of preempting old reservation when free
physical memory becomes scarce or excessively fragmented. Though it
prefers reserved but less used pages, it seems dangerous to
me since the system might kick out some reservations for key processes
(like some server process or login process).


I have to imagine there must be a cleverer way to do speculative
demotion. Currently they always demote all base pages of super pages
with probability 1. Surely they could come up with a better
probability distribution through some form of empirical analysis.

1) Summary

Hardware support for large pages was becoming ubiquitous in the days of this paper. However, OS support for large pages was still lagging behind. In particular, a good OS implementation of large page support must be able to avoid wasting space, fragmenting memory, and wasting to much time writing partially-dirty large pages to disk. The authors propose mechanisms for accomplishing these goals.

2) Problem

As memory grows larger, the percentage of memory covered by the TLB dwindles. One solution to this problem is to increase the page size. This has been shown to produce significant performance speedups due to fewer TLB misses.

However, OS support for large pages poses a challenging problem. When allocating physical memory to processes, the OS must avoid wasting space by giving out too many large pages that are mostly unconsumed. At the same time, the OS wants to prevent as many TLB misses as possible.

Many implementations reserve contiguous memory for a process in anticipation of being able to consolidate a region of memory into a large page. This reduces the amount of page relocation needed, but makes the system susceptible to fragmentation.

Also, large pages can induce high overhead when swapping pages, even if only a portion of a page has been written to. The large amount of I/O can result in large slowdowns, despite the benefit of large pages.

3) Contributions

The authors propose a number of policies for page management, along with two mechanisms for their implementation.

First, the authors propose using reservation preemption and incremental promotion to mitigate fragmentation. Preemption allows a reservation to allocated to a different process when there is memory pressure. Meanwhile, incremental promotion coalesces adjacent pages into larger superpages as it becomes possible. By combining pages incrementally, rather than all at once, the allocator can avoid giving processes memory they don't need.

Second, the authors propose speculative demotion to reduce the amount of unneeded memory allocated to a process. While the authors do not evaluate this technique, they argue that demoting pages temporarily can be a good way of finding out if a large page is useful to a process, similar to the clock page replacement algorithm.

Third, the authors use dirty page demotion to avoid writing large pages to disk needlessly, reducing the I/O overhead of paging.

To implement these policies, the authors propose the Population Map and multi-list reservation systems. Population map is a data structure that keeps track of which sub-pages of a superpage are allocated. Thus, the allocator can tell when it can coallesce pages into a superpage. The multi-list reservation system allows the allocator to find pages of a given size.

4) Evaluation

Overall, the authors demonstrate that their system comprehensively solves many of the problems discussed above while retaining generality. They introduce a lot of novel ideas, and their evaluation section shows that they eliminate almost all TLB misses. They use the SPEC2000 benchmarks for their evaluation, which is reasonable.

Their data do show that most benchmarks experience only a marginal improvement. The benchmarks that benefit the most see a high ratio of large pages to small pages, and this is not achieved for most benchmarks. Still, most benchmarks do at least as well with their techniques as without, and most have a slight benefit.

However, in my opinion, the authors do not explain many of their mechanisms very clearly. They leave out key details in implementation (see Confusions). In particular, they do not address the memory overhead of the mechanisms they define. That is, how much memory does the population map itself take. Are these overheads significant? Since large pages are primarily aimed at systems with increasingly large memories, the footprint of these data structures may be unscalable.

5) Confusion

What is a "memory object"?

Is there a population map per process? This seems like it could have pretty high overhead.

Summary:
The paper presents reservation based approach to transparently manage superpages in operating system. This approach balances various tradeoffs while allocating superpages, so as to achieve high and sustained performance and prevent performance degradation due to fragmentation.

Problem:
Main memory size has increased at a faster pace than TLB coverage. Thus, applications with larger working sets incur many TLB misses and suffer from significant performance degradation. Since TLB lies in the critical path of memory access, increasing the TLB size is not a good option. To solve this problem, most modern CPUs provide support for superpages. But operating systems have not been able to use superpages effectively, thus leading to increased memory footprint, fragmentation and performance degradation.

Contributions:
The suggested approach in the paper uses following mechanisms:
-Support for multiple superpage sizes to reduce internal fragmentation.
-On a page fault, determines a preferred superpage size and reserves the entire set of frame for potential future use as superpage and allocates the faulted page by adding a mapping entry into the page table. Pages are promoted to a superpage when all the pages are touched in the reservation list.
-Speculatively demotes infrequently referenced pages to reclaim memory.
-Demotes a superpage when a process attempts to write into them and repromotes later if all the base pages are dirtied. This is done to save on I/O traffic.
-Maintains a population map that keep tracks of allocated base pages within each memory object and helps in promotion and demotion of superpages.
-Fragmentation is controlled by modifying page replacement policy to factor contiguity restoration and using buddy allocator that maintains multiple lists of free blocks.

Evaluation:
The authors implemented their design in FreeBSD-4.3 kernel and ran it on Alpha 21264 processor machine. They evaluated their design on a variety of benchmarks such as SPEC, Matrix etc. In the best case setting where free memory is plentiful and non-fragmented, data TLB misses are almost eliminated for most applications and a maximum speedup of 7.5 is observed for Matrix. The authors also evaluated the benefit of using multiple superpage sizes. In the real world setting where memory will be fragmented pretty soon, their page daemon is able to recover contiguous region of memory efficiently with minimal performance overhead of 0.8%. The authors report the overhead of their design decision, but majority of overhead is attributed to underlying hardware design such as PTE replication. They should have evaluated performance overhead of demoting dirty superpages for more realistic writing pattern. Also I would like to see the storage overhead of population list and reservation list.

Confusion:
I did not understand multi-list reservation scheme and the mechanism of preempting a reservation list.

1. summary
This paper design a OS support for superpages, which solves the challenges such as allocation and promotion and fragmentation control.
2. Problem
1) Several problems come from the introduction of superpage. The superpage is introduced to solve the small TLB coverage problem. However, inappropriate use of superpages cause many issues,
- enlarged application footprints
- increased phyiscal memory requirements
- higher paging traffic
- physical memory fragmentation which decreases future opportunities for using large superpages
- I/O costs easily outweigh any performance gain of superpages

2) There are some hardware constraints on the superpage
- limited choices of superpage size
- continuous in physical and virtual address space
- page address alignment
- coarse granuality of reference, dirty, protection bits

3) There are many tradeoffs and obstacles in superage design
- allocation: relocation-based vs reservation-based
- fragmentation control:
a) proactively release contiguous chunks of inactive memory at the risk of more I/O.
b) The cost of preempting an existing, partially used reservation agianst the benefits of using large superpages.
- promotion: at once vs incrementally
- demotion: difficult for OS to detect which portions of a superpage are actively used.
- eviction: When a dirty superpage is paged out, the OS has to flush the whole page out.

superpage
- allocation and promotion tradeoffs -> tend to choose the maximum superpage size that can be effectively used in an object; incremental promotions. Promote only regions that are fully populated by the application.
- fragmentation control -> coalescing, contiguity-aware page replacement


3. Contributions
1. Design an effective superpage management system.
- performance benefits > 30%, sustained even under stressful workload scenarios
- general and transparent
- balances various tradeoffs while allocating superpages

2. Extends a previously proposed reservation-based approach to work with multiple, potentially very large superpage sizes.
- support for multiple superpage sizes
- scalability to very large superpages
- demotion of sparsely referenced superpages
- effective preservation of contiguity without the need for compaction
- efficient disk I/O for partially modified superpages

3. Proposes a novel contiguity-aware page replacement algorithm to control fragmentation. The authors make several modifications on the FreeBSD's page daemon (Section 5.1).

4. Evaluation
The authors test their design on several classes of benchmarks and real applications.
1) When system mem is plentiful, the most of the workloads desplay benefits due to superpages, and repeat the experiments with different superpage sizes shows that allowing the system to choose between multiple page sizes yields higher performance.
2) Run sustained experiments and demonstrate that their design (daemon) has a speedup because of better fragmentation control.
3) Experiment under several adversary applications, the new method has very small degradation.

5. Confusion
In 4.2 Preferred superpage size policy, "To avoid wastage of contiguity for small objects that may never grow large, the size of this superpage is limited to the current size of the object.". How do you define the object here? Stacks, heaps? How can you tell the size of these object before allocation and using?
4.7 Paging out dirty superpages, why not use a bitmap to record the dirty of each base page in a superpage?

1. Summary
This paper presents a system to transparently support multiple page sizes in an operating system. It provides transparent support for multiple super page sizes. The pages are dynamically promoted to a super page when the number of contiguous used pages crosses a certain threshold. To tackle memory pressure and the effect of fragmentation the super pages can also be demoted to smaller sizes. A FreeBSD implementation of this super page management system gives up to 30% performance benefit to the applications.

2. Problem
The size of Translation Lookaside Buffers (TLBs) have not increased in proportion to main memory sizes. As a result, the TLB coverage (the total amount of memory accessible through TLB directly) to total memory ration has decreased significantly over the years. This causes the processes with bigger working sets to face a high number of TLB incurring significant performance penalty. The usage of larger pages can increase this ratio since larger page sizes mean more amount of total memory is accessible directly through TLB (i.e, without a page table lookup). However, merely increasing the page table size leads to higher internal fragmentation and greater paging traffic. Therefor, a more dynamic and transparent page size management is needed. This paper designs and implements one such system.

3. Contributions

When a process allocates memory, a large contiguous segment is proactively reserved instead of allocating a single page. When the actual number of used pages within that contiguous segment reaches a certain threshold that contiguous segment is promoted to a super page and the entries in the page table and TLB are adjusted to reflect that change. These promotions are incremental in nature which means that a contiguous region is promoted to the next available super page size which may in turn be promoted to the next bigger size available up to the maximum page size which in this case is 4 MB. To control fragmentation and tackle, memory pressure, the super pages can be demoted to smaller super pages or to base pages. This is important because paging in and out super pages would incur so much I/O overhead that it will outweigh any benefits achieved through super paging. Similarly, this demotion also occurs when a super page is dirtied. This is because the hardware only provides one bit for the dirtiness of a super page so it is impossible to know when base page in a super page was written to. Hence, a demotion on write occurs to avoid writing the whole super page to disk. However, these demoted pages can later be re-consolidate into super pages.
The authors use multi list reservation scheme and population maps to implement this scheme. These data structures is used by the buddy allocator which is their global super page manager.

4. Evaluation
The Evaluation of the paper seems quite comprehensive. The authors separately measure the various benefits of super pages under about 10 different workloads. Most of the workloads show a TLB miss reduction of about 99% and speedup of over 10%. mesa was the only benchmark which actually suffered from the usage of super pages. I think the authors should also have measured the memory overhead of using super pages.

5. Confusion

I did not quite understand how the population map is working. Also, it wasn't clear to me if the population map is for a process or for the whole system? The authors mention that population maps keep track of allocated base pages within each memory object. Wouldn't this have a huge memory overhead?

1. Summary
This paper introduces how to support the superpage larger than the traditional 4K page. It has a series of algorithms and data structures to support various superpage size, and can deal with the problem of memory fragmentation, memory preempt, etc.

2. Problems
Computer now has a much larger memory than before, and the working set of program increases a lot. However, the size of TLB does not increase because it needs to be very fast. So, using the traditional 4K page, the TLB can not cover the large working set, TLB miss rate increases a lot, and TLB miss exception and TLB miss handling becomes the overhead in some cases, especially in some data intensive application where the working set is very large. Using superpage, one TLB entry can represent a much larger page. This increases the TLB coverage and reduce the overhead of TLB miss. However, how to manage superpages and how to deal with problems like fragmentation is challenging.

3. Contributions
This paper designs a reservation based allocation for superpage management. During allocation, it tends to choose a maximum superpage to avoid future relocation. It has an incremental promotion and demotion mechanism, so, superpages can be managed at different granularity (page size), which can be helpful in preempting reservation, dirty superpage paging out. Because large superpages can be dynamically split into small ones. There are also special data structures to support this system. Reservation list can find the proper superpage during preempt, population map is a hierarchy structure for super pages, it can track reservations or superpages, and can help perform superpage promotion during page fault.

4. Evaluation
In the experiment section, this paper uses different benchmark to show he benefit of superpage. The speedup of using superpage is related to the memory usage and memory access patterns of different applications, and can reach 7 times speedup for matrix operation. Different application has different preferred superpage size, and allowing multiple superpage size and using incremental promotion allows applications to automatically adjust to a suitable page size. For fragmentation, this paper uses two benchmark with different memory behavior (web and FFTW) and run them in turn to test their fragmentation control works well. There are also some other experiments to test the overhead and scalability of this system.

5. Confusion
Not very clear about how this system deals with fragmentation. Section 4.4 and 5.1 have some related things, but I can’t understand that, especially how the page daemon works to improve contiguity.

1. Summary
TLB coverage can be extended through the use of superpages. This paper provides mechanisms for the OS to use them effectively and transparently to the user.

2. Problem
The size of memory, working sets, and on-chip caches have increased much faster than the coverage provided by the TLB, leading to lots of page faults. Contemporary hardware allows for the TLB to use superpage entries that cover a large set of base pages, but OSs have not found a way to use them effectively or without the user’s intervention. The use of superpages automatically by the OS leads to a plethora of policy choices and issues with fragmentation.

3. Contributions
This paper shows that the overheads associated with the OS managing superpages are far offset by the benefits provided by using them. It also outlines a series of OS policy choices relating to superpages that the authors believe to be practical.
Overall, they show that using superpages transparently by the OS is indeed possible without greatly changing the OS. Many of the policy choices they make feel arbitrary and are not compared with other options, so I think their contribution is more the proof of concept than the policy choices themselves.

4. Evaluation
They modify a FreeBSD kernel running on a machine that supports a variety of page sizes to use their suggested changes. Their workloads come from SPEC and a variety of other benchmark suites and applications.
First, they show the best-case benefits of superpages, when there is plentiful free memory for superpage allocation. Predictably, they see the TLB miss rate shrink to near 0%.
They illustrate their system gains benefit from allowing pages to dynamically change their size by restricting the system to a single superpage size.
They do some interesting longrunning experiments by deliberately fragmenting the memory space to see how well the modified page daemon reclaims space overtime. They do this by running a webserver with a large memory footprint and a more conventional benchmark application. First, they run them in series and track how long it takes for the conventional benchmark to regain stable performance. Then, they run them in parallel to see how much the benchmark was affected. I thought it was interesting to measure the performance of the fragmentation-fighting methods for the system under stress.

5. Confusion
I did not understand the discussion of page coloring. They seemed to conclude that the superpage system interfered with the page coloring algorithm, but then conclude that “both […] effectively benefit from pagecoloring.”

Post a comment