[SMALLER PAGE TABLES: OR HOW TO STOP FILLING MEMORY WITH THOSE DARN THINGS] We now tackle the second problem that paging introduces: page tables are too big. We start with out a linear page table. As you might recall (or might not, this is getting pretty detailed), linear page tables get pretty big. Assume again a 32-bit address space, with 4KB pages. As above, we see that this leads to a 4MB page table. Recall also: we have one page table *per process*! Thus, with a hundred active processes (not uncommon on a modern desktop machine), we will be allocated 400MB of memory just for page tables! Thus, we are in search of some techniques to reduce this heavy burden. There are a lot of them, so let's get going. [SIMPLE: MAKE BIGGER PAGES] We could reduce the size of the page table in one simple way: use bigger pages. Take our 32-bit address space again, but this time assume 16KB pages. We would thus have an 18-bit VPN plus a 14-bit offset. Assuming the same size for each PTE (4 bytes), we now have 2^18 entries in our linear page table and thus a total size of 1MB per page table, a factor of four reduction in size of the page table (which exactly mirrors the factor of four increase in size of each page). The major problem with this approach, however, is that big pages lead to waste *within* each page, a problem known as *internal fragmentation* (as the waste is *internal* to the unit of allocation). Applications thus end up allocating pages but only using little bits and pieces of each, and memory quickly fills up with these overly-large pages. Thus, most systems use relatively small page sizes: 4KB (as in x86) or 8KB (as in SPARCv9). Our problem will not be solved so simply, alas. [HYBRID APPROACH: PAGING AND SEGMENTATION] Whenever you have two reasonable but different approaches to something in life, you should always examine the combination of the two to see if it can obtain the best of both worlds. We call such a combination a *hybrid*. For example, why eat just chocolate or just peanut-butter when you can combine the two in a lovely hybrid known as the Reese's Peanut Butter Cup? [2] System designers similarly had the idea of combining paging and segmentation in order to reduce the memory overhead of page tables. We can see why this might work by examing a typical linear page table in more detail. Let's say we have a large address space, but the currently used heap and stack are pretty small. Specifically, we use a tiny 16KB address space with 1KB pages: 0 |---------------------| 4 | (program code) | ... | | | | 1K |---------------------| | | | | | | | (free) | | | | | 4K |---------------------| | (heap) | | | | | 2K |---------------------| | | | | v | | | ....... | | | | | | | (free) | | | | | ....... | | | ^ | | | | 14K |---------------------| | (stack) | | | | | 15K |---------------------| | (stack) | | | | | 16K |---------------------| The linear page table for this address space might look like this: PPN | valid | prot | present | dirty -------------------------------------- 10 | 1 | r-x | 1 | 0 - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - 23 | 1 | rw- | 1 | 1 - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - - | 0 | --- | - | - 90 | 1 | rw- | 1 | 1 91 | 1 | rw- | 1 | 1 This example assumes the single code page (VPN 0) is mapped to physical page 10, the single heap page (VPN 4) to physical page 23, and the two stack pages at the other end of the address space (VPNs 14 and 15) are mapped to physical pages 90 and 91 respectively. As you can see from the picture, *most* of the page table is unused, full of *invalid* entries. What a waste! And this is for a tiny 16KB address space. Imagine the page table of a 32-bit address space and all the wasted space when most of the address space goes unused. Actually, don't imagine it; it's too gruesome. Thus, our hybrid approach. Instead of having a single page table for the entire address space of the process, why not have one per logical segment? In this example, we might thus have three page tables, one for each the code, heap, and stack portions of the address space (note: you could easily do two segments instead of three; one which packed the static code and then the dynamically-growing heap together, and another which held the dynamically-growing stack. This approach was taken on the VAX minicomputer [3]). Now, remember with segmentation, we had a segment table that told us where each segment lived in physical memory with base/bounds pairs. We still have that structure in the MMU; here, though, we use it not to point to the location of the segment itself but rather to point to the base physical address of the page table of that segment. The bounds register is used similarly to indicate the end of the page table (i.e., how many valid pages it has). (an example could go here) [MULTI-LEVEL PAGE TABLES] A different approach doesn't rely on segmentation but attacks the same problem: how to get rid of all those invalid regions in the page table instead of keeping them all in memory? We call this approach a *multi-level page table*, as it turns the linear page table into something like a tree. We now describe this approach in more detail. Let us imagine a small address space of size 16 KB, with 64-byte pages. Thus, we have a 14-bit virtual address space, with 8 bits for the VPN and 6 bits for the offset. A linear page table would have 2^8 (256) entries, even if only a small portion of the address space is in use. Here is one example of such an address space: ______________ 0000 0000 | code | 0000 0001 |____________| 0000 0010 | free | 0000 0011 |____________| 0000 0100 | heap | 0000 0101 |____________| 0000 0110 | free | 0000 0111 | | .... .... | free | |____________| 1111 1110 | stack | 1111 1111 |____________| In this example, virtual pages 0 and 1 are for code, virtual pages 4 and 5 for the heap, and virtual pages 254 and 255 for the stack; the rest of the pages of the address space are unused. To build a two-level page table for this address space, we start with our full linear page table and break it up into page-sized units. Recall our full table (in this example) has 256 entries; let us assume each PTE is 4 bytes in size. Thus, we have a total page table of size 1 KB. Given that we have 64-byte pages, the 1-KB page table fits into 16 64-byte pages. Each page can hold 16 PTEs. What we need now is a structure that can point to each of the page of the page table. We call this structure the *page directory*; it is a simple linear array with one entry per page of the linear page table. Because we have 16 pages of page table, we thus need 4 bits. We use the top four bits of the VPN to index into this directory: VPN 7 6 5 4 3 2 1 0 \_____/ | | |----> use thes bits to index into the page directory The beauty of the page directory arises when there is a large region of the page table that has all invalid entries. Instead of marking them all invalid in the page table itself, we instead can mark the page directory pointer that points to this part of the page table invalid; thus, that portion of the page table does not have to be allocated any memory. The *page directory* for the example address space above: pointer to page of PT | valid? ----------------------- paddr1 | 1 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 ------ | 0 paddr2 | 1 Each *page directory entry* (PDE) describes something about a chunk of the page table for the address space. In this example, we have two valid chunks of the address space (at the beginning and end), and a number of invalid mappings in-between. At paddr1 (some physical address), we have the first chunk of 16 page table entries for the first 16 VPNs in the address space. It would look something like this: PPN | valid | prot -------------------- 10 | 1 | r-x 23 | 1 | r-x -- | 0 | --- -- | 0 | --- 80 | 1 | rw- 59 | 1 | rw- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- This chunk of the page table contains the mappings for the first 16 VPNs; in our example, VPNs 0 and 1 are valid (the code segment), as are 4 and 5 (the heap). Thus, the table has mapping information for each of those pages. The rest of the entries are marked invalid. A similar chunk of page table is found at paddr2; this is for the last 16 VPNs of the address space: PPN | valid | prot -------------------- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- -- | 0 | --- 55 | 1 | rw- 45 | 1 | rw- Thus, for VPNs 254 and 255 (the stack), we have valid mappings. Hopefully, what we can see from this example is how much space savings are possible with a multi-level indexed structure. In this example, instead of allocating 16 pages for a linear page table, we allocated only 3: one for the page directory, and two for the chunks of the page table that have valid mappings. The savings for actual-sized address spaces could obviously be much bigger. Note that we haven't quite explained the entire picture of how to find the right page table entry for a given VPN. Let's take VPN 254 (to a stack page) as an example. Recall that we have a 14-bit virtual adddress space. Thus, an address that refers to VPN 254 might look like this: VPN offset 11111110 000000 Recall that we will use the top 4 bits of the VPN to index into the page directory. Thus, 1111 will choose the last entry of the page directory above. This points us to a valid chunk of the page table located at address paddr2. We then use the second 4 bits of the VPN (1110) to index into that page of the page table and find the desired PTE. 1110 is the next-to-last entry on the page, and tells us that page 254 of our virtual address space is mapped at physical page 45. By concatenating PPN=45 with offset=000000, we can thus form our desired physical address and issue the request to the memory system. Thus, we can think of this process as splitting the VPN into two components, VPN_pagedir and VPN_chunkindex. The VPN_pagedir is used to index into the page directory itself to find the relevant chunk of the page index or inform us that the region of the address space is invalid. If valid, the VPN_chunkindex is then used to find the desired PTE. Pictorally, we thus end up with something like this: /--------- VPN -------------\ VPN_pagedir | VPN_chunkindex | offset VPN_pagedir is used to index into the page directory If valid use the address in PDE to find chunk of the page table then, use VPN_chunkindex to index into this page of PTEs and find the right PTE We thus may also see how to decide how to split the bits of the VPN into VPN_pagedir and VPN_chunkindex. Given a page size P, and a PTE size S, we know that a page can contain P/S PTEs. Thus, VPN_chunkindex is chosen such that there are the right number of bits to choose P/S PTEs (log2 (P/S)). The example above has P=64 and a PTE size of 4, and thus P/S is 16. Thus, the number of bits in the VPN_chunkindex is log2 (16) or 4. The number of bits in VPN_pagedir is thus the total number of bits in the VPN minus the number of bits in VPN_chunkindex. The other benefit of multi-level structures is that they allow some freedom in the placement of pieces of the page table. So far, we have assumed that page tables live in physical memory (an assumption we will relax below); if that is the case, and we have a linear page table (basically, just an array of PTEs, indexed by VPN), then the entire linear page table must reside contiguously in physical memory. For a large page table (say 4MB), finding such a large chunk of unused contiguous free physical memory can be quite a challenge. With a multi-level structure, we add a level of indirection in the page directory which points to pieces of the page table; that indirection allows us to place those pages wherever we would like in physical memory. It should be noted that there is a cost to multi-level tables; on a TLB miss, two loads will be required to get the right translation information from the page table (one for the page directory, and one for the PTE itself), in contrast to just one load with a linear page table. Thus, the multi-level table is a small example of a *time-space trade-off*. We wanted smaller tables (and got them), but not for free; although in the common case (TLB hit), performance is obviously identical, a TLB miss suffers from a higher cost with this smaller table. One last note: more than two levels are possible (e.g., x86 has a three-level mode as well). These make for even sparser trees of page tables, potentially saving even more space while again increasing the cost of servicing a TLB miss. [INVERTED PAGE TABLES] An even more extreme space savings in the world of page tables is found with *inverted page tables*. Here, instead of having many page tables (one per process of the system), we keep a single page table that has an entry for each *physical page* of the system. The entry tells us which process is using this page, and which virtual page of that process maps to this physical page. Finding the correct entry is now a matter of searching through this data structure. A linear scan would be expensive, and thus a hash table is often built to speed up the process. The PowerPC is one example of an architecture that uses such an approach. More generally, inverted page tables illustrate what we've said from the beginning: page tables are just data structures. You can do lots of crazy things with data structures, making them smaller or bigger, making them slower or faster. Multi-level and inverted page tables are just two examples of the many things one could do. (example here?) [PAGING THE PAGE TABLES] Finally, we discuss the relaxation of one final assumption. Thus far, we have assumed that page tables reside in kernel-owned physical memory. Even with our many tricks to reduce the size of page tables, it is still possible, however, that they may be too big to fit into memory all at once. Thus, some systems place such page tables in *kernel virtual memory*, thereby allowing the system to swap some of these page tables to disk when memory pressure gets a little tight. (more here would be useful)