[SMALLER PAGE TABLES: OR HOW TO STOP FILLING MEMORY WITH THOSE DARN THINGS]

We now tackle the second problem that paging introduces: page tables are too
big. We start with out a linear page table. As you might recall (or might
not, this is getting pretty detailed), linear page tables get pretty big.
Assume again a 32-bit address space, with 4KB pages. As above, we see that
this leads to a 4MB page table. Recall also: we have one page table *per
process*! Thus, with a hundred active processes (not uncommon on a modern
desktop machine), we will be allocated 400MB of memory just for page tables!
Thus, we are in search of some techniques to reduce this heavy burden. There
are a lot of them, so let's get going.

[SIMPLE: MAKE BIGGER PAGES]

We could reduce the size of the page table in one simple way: use bigger
pages. Take our 32-bit address space again, but this time assume 16KB pages. 
We would thus have an 18-bit VPN plus a 14-bit offset. Assuming the same size
for each PTE (4 bytes), we now have 2^18 entries in our linear page table and
thus a total size of 1MB per page table, a factor of four reduction in size of
the page table (which exactly mirrors the factor of four increase in size of
each page).

The major problem with this approach, however, is that big pages lead to waste
*within* each page, a problem known as *internal fragmentation* (as the waste
is *internal* to the unit of allocation). Applications thus end up allocating
pages but only using little bits and pieces of each, and memory quickly fills
up with these overly-large pages. Thus, most systems use relatively small page
sizes: 4KB (as in x86) or 8KB (as in SPARCv9). Our problem will not be solved
so simply, alas.

[HYBRID APPROACH: PAGING AND SEGMENTATION]

Whenever you have two reasonable but different approaches to something in
life, you should always examine the combination of the two to see if it can
obtain the best of both worlds. We call such a combination a *hybrid*. For
example, why eat just chocolate or just peanut-butter when you can combine the
two in a lovely hybrid known as the Reese's Peanut Butter Cup? [2]

System designers similarly had the idea of combining paging and segmentation
in order to reduce the memory overhead of page tables. We can see why this
might work by examing a typical linear page table in more detail. Let's say we
have a large address space, but the currently used heap and stack are pretty
small. Specifically, we use a tiny 16KB address space with 1KB pages:

0    |---------------------|
4    |   (program code)    |              
...  |                     |              
     |                     |
1K   |---------------------|
     |                     |
     |                     |
     |                     |
     |       (free)        |
     |                     |
     |                     |
4K   |---------------------|
     |       (heap)        |              
     |                     |              
     |                     |              
2K   |---------------------|              
     |          |          |
     |          v          |
     |                     |
             .......
     |                     |
     |                     |
     |                     |
     |       (free)        |
     |                     |
     |                     |
             .......
     |                     |
     |          ^          |
     |          |          |            
14K  |---------------------|            
     |       (stack)       |              
     |                     |
     |                     |
15K  |---------------------|            
     |       (stack)       |   
     |                     |              
     |                     |            
16K  |---------------------|            

The linear page table for this address space might look like this:

  PPN | valid | prot | present | dirty 
 --------------------------------------
   10 |   1   | r-x  |    1    |   0 
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   23 |   1   | rw-  |    1    |   1  
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   -  |   0   | ---  |    -    |   -
   90 |   1   | rw-  |    1    |   1
   91 |   1   | rw-  |    1    |   1

This example assumes the single code page (VPN 0) is mapped to physical page
10, the single heap page (VPN 4) to physical page 23, and the two stack pages
at the other end of the address space (VPNs 14 and 15) are mapped to physical
pages 90 and 91 respectively. As you can see from the picture, *most* of the
page table is unused, full of *invalid* entries. What a waste! And this is for
a tiny 16KB address space. Imagine the page table of a 32-bit address space
and all the wasted space when most of the address space goes unused. Actually,
don't imagine it; it's too gruesome.

Thus, our hybrid approach. Instead of having a single page table for the
entire address space of the process, why not have one per logical segment? In
this example, we might thus have three page tables, one for each the code,
heap, and stack portions of the address space (note: you could easily do two
segments instead of three; one which packed the static code and then the
dynamically-growing heap together, and another which held the
dynamically-growing stack. This approach was taken on the VAX minicomputer
[3]).

Now, remember with segmentation, we had a segment table that told us where
each segment lived in physical memory with base/bounds pairs. We still have
that structure in the MMU; here, though, we use it not to point to the
location of the segment itself but rather to point to the base physical
address of the page table of that segment. The bounds register is used
similarly to indicate the end of the page table (i.e., how many valid pages it
has).

(an example could go here)

[MULTI-LEVEL PAGE TABLES]

A different approach doesn't rely on segmentation but attacks the same
problem: how to get rid of all those invalid regions in the page table instead
of keeping them all in memory? We call this approach a *multi-level page
table*, as it turns the linear page table into something like a tree. We now
describe this approach in more detail.

Let us imagine a small address space of size 16 KB, with 64-byte pages. Thus,
we have a 14-bit virtual address space, with 8 bits for the VPN and 6 bits
for the offset. A linear page table would have 2^8 (256) entries, even if
only a small portion of the address space is in use. Here is one example of
such an address space:

            ______________
  0000 0000 |    code    |
  0000 0001 |____________|
  0000 0010 |    free    |
  0000 0011 |____________|
  0000 0100 |    heap    | 
  0000 0101 |____________|
  0000 0110 |    free    |
  0000 0111 |            |
  .... .... 
            |    free    |
            |____________|
  1111 1110 |    stack   |
  1111 1111 |____________|   

In this example, virtual pages 0 and 1 are for code, virtual pages 4 and 5 for
the heap, and virtual pages 254 and 255 for the stack; the rest of the pages
of the address space are unused. 

To build a two-level page table for this address space, we start with our 
full linear page table and break it up into page-sized units. Recall our full
table (in this example) has 256 entries; let us assume each PTE is 4 bytes in
size. Thus, we have a total page table of size 1 KB. Given that we have
64-byte pages, the 1-KB page table fits into 16 64-byte pages. Each page can
hold 16 PTEs.

What we need now is a structure that can point to each of the page of the page
table. We call this structure the *page directory*; it is a simple linear
array with one entry per page of the linear page table. Because we have 16
pages of page table, we thus need 4 bits. We use the top four bits of the VPN
to index into this directory:

        VPN           
  7 6 5 4 3 2 1 0 
  \_____/
     |
     |
     |----> use thes bits to index into the page directory

The beauty of the page directory arises when there is a large region of the
page table that has all invalid entries. Instead of marking them all invalid
in the page table itself, we instead can mark the page directory pointer that
points to this part of the page table invalid; thus, that portion of the page
table does not have to be allocated any memory.

The *page directory* for the example address space above:

    pointer to 
    page of PT |  valid?
  -----------------------
      paddr1   |    1
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      ------   |    0 
      paddr2   |    1 

Each *page directory entry* (PDE) describes something about a chunk of the
page table for the address space. In this example, we have two valid chunks of
the address space (at the beginning and end), and a number of invalid mappings
in-between.

At paddr1 (some physical address), we have the first chunk of 16 page table
entries for the first 16 VPNs in the address space. It would look something
like this:

    PPN | valid | prot 
   --------------------
     10 |   1   | r-x
     23 |   1   | r-x
     -- |   0   | ---
     -- |   0   | ---
     80 |   1   | rw-
     59 |   1   | rw-
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---

This chunk of the page table contains the mappings for the first 16 VPNs; in
our example, VPNs 0 and 1 are valid (the code segment), as are 4 and 5 (the
heap). Thus, the table has mapping information for each of those pages. The
rest of the entries are marked invalid.

A similar chunk of page table is found at paddr2; this is for the last 16 VPNs
of the address space:
  
    PPN | valid | prot 
   --------------------
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     -- |   0   | ---
     55 |   1   | rw-
     45 |   1   | rw-

Thus, for VPNs 254 and 255 (the stack), we have valid mappings. Hopefully,
what we can see from this example is how much space savings are possible with
a multi-level indexed structure. In this example, instead of allocating 16
pages for a linear page table, we allocated only 3: one for the page
directory, and two for the chunks of the page table that have valid
mappings. The savings for actual-sized address spaces could obviously be much
bigger.

Note that we haven't quite explained the entire picture of how to find the
right page table entry for a given VPN. Let's take VPN 254 (to a stack page)
as an example. Recall that we have a 14-bit virtual adddress space. Thus, an
address that refers to VPN 254 might look like this:

   VPN    offset
 11111110 000000

Recall that we will use the top 4 bits of the VPN to index into the page
directory. Thus, 1111 will choose the last entry of the page directory
above. This points us to a valid chunk of the page table located at address
paddr2. We then use the second 4 bits of the VPN (1110) to index into that
page of the page table and find the desired PTE. 1110 is the next-to-last
entry on the page, and tells us that page 254 of our virtual address space is
mapped at physical page 45. By concatenating PPN=45 with offset=000000, we can
thus form our desired physical address and issue the request to the memory
system.

Thus, we can think of this process as splitting the VPN into two components,
VPN_pagedir and VPN_chunkindex. The VPN_pagedir is used to index into the page
directory itself to find the relevant chunk of the page index or inform us
that the region of the address space is invalid. If valid, the VPN_chunkindex
is then used to find the desired PTE. Pictorally, we thus end up with
something like this:

  /--------- VPN -------------\
  VPN_pagedir | VPN_chunkindex | offset

  VPN_pagedir is used to index into the page directory
  If valid
    use the address in PDE to find chunk of the page table
    then, use VPN_chunkindex to index into this page of PTEs and find the right PTE

We thus may also see how to decide how to split the bits of the VPN into
VPN_pagedir and VPN_chunkindex. Given a page size P, and a PTE size S, we know
that a page can contain P/S PTEs. Thus, VPN_chunkindex is chosen such that
there are the right number of bits to choose P/S PTEs (log2 (P/S)). The
example above has P=64 and a PTE size of 4, and thus P/S is 16. Thus, the
number of bits in the VPN_chunkindex is log2 (16) or 4. The number of bits in
VPN_pagedir is thus the total number of bits in the VPN minus the number of
bits in VPN_chunkindex. 

The other benefit of multi-level structures is that they allow some freedom in
the placement of pieces of the page table. So far, we have assumed that page
tables live in physical memory (an assumption we will relax below); if that is
the case, and we have a linear page table (basically, just an array of PTEs,
indexed by VPN), then the entire linear page table must reside contiguously in
physical memory. For a large page table (say 4MB), finding such a large chunk
of unused contiguous free physical memory can be quite a challenge. With a
multi-level structure, we add a level of indirection in the page directory
which points to pieces of the page table; that indirection allows us to place
those pages wherever we would like in physical memory.

It should be noted that there is a cost to multi-level tables; on a TLB miss,
two loads will be required to get the right translation information from the
page table (one for the page directory, and one for the PTE itself), in
contrast to just one load with a linear page table. Thus, the multi-level
table is a small example of a *time-space trade-off*. We wanted smaller tables
(and got them), but not for free; although in the common case (TLB hit),
performance is obviously identical, a TLB miss suffers from a higher cost with
this smaller table.

One last note: more than two levels are possible (e.g., x86 has a three-level
mode as well). These make for even sparser trees of page tables, potentially
saving even more space while again increasing the cost of servicing a TLB
miss. 

[INVERTED PAGE TABLES]

An even more extreme space savings in the world of page tables is found with
*inverted page tables*. Here, instead of having many page tables (one per
process of the system), we keep a single page table that has an entry for each
*physical page* of the system. The entry tells us which process is using this
page, and which virtual page of that process maps to this physical page. 

Finding the correct entry is now a matter of searching through this data
structure. A linear scan would be expensive, and thus a hash table is often
built to speed up the process. The PowerPC is one example of an architecture
that uses such an approach.

More generally, inverted page tables illustrate what we've said from the
beginning: page tables are just data structures. You can do lots of crazy
things with data structures, making them smaller or bigger, making them slower
or faster. Multi-level and inverted page tables are just two examples of the
many things one could do.

(example here?)

[PAGING THE PAGE TABLES]

Finally, we discuss the relaxation of one final assumption. Thus far, we have
assumed that page tables reside in kernel-owned physical memory. Even with our
many tricks to reduce the size of page tables, it is still possible, however,
that they may be too big to fit into memory all at once. Thus, some systems
place such page tables in *kernel virtual memory*, thereby allowing the system
to swap some of these page tables to disk when memory pressure gets a little
tight. 

(more here would be useful)