[FASTER PAGING WITH HARDWARE SUPPORT: THE TLB]

When we want to make things fast, the OS needs some help. And help usually
comes from one place: the hardware. Here, to speed address translation, we are
going to add what is called (for historical reasons [1]) a translation-lookaside
buffer, or TLB. A TLB (a part of the chip's MMU) is simply a hardware *cache*
of popular virtual-to-physical address translations. On any memory reference,
the hardware will look first in the TLB to see if the desired translation is
held therein; if it is, the translation is done (quickly) *without* having to
consult the page table (which has all the translations).

Thus, the hardware (and the OS) follows this approach when servicing a memory
reference by a process to a given virtual address. Note that the left column
tells us whether it is the hardware (hw) or the OS (os) that performs the
given action; in some cases (depending on the system), it could be one or the
other, and hence we see hw/os.

hw    1 - Extract VPN from address (easy: just take the top bits from the address)
hw    2 - Use the VPN to index the TLB
hw    If (the translation is in the cache)   // a TLB hit
hw        3a - get PPN from TLB
hw        3b - use it to form physical address
hw        3c - issue request for physical address to the memory system
hw    else (translation is NOT in the cache) // a TLB miss
hw/os     4a - lookup translation in the page table (in memory)
hw/os     4b - if it is *valid*
hw/os            install the translation into the TLB
hw/os            retry the instruction (which will hopefully now be a TLB hit)
hw        4c - if not valid, trap (invalid memory access)
os               terminate process (clean up)

In the common case, we are hoping that most translations will be found in the
TLB (a *TLB hit*) and thus the translation will be quite fast (all in
hardware). In the less common case, the translation won't be in the cache (a
*TLB miss*), and the system will have to perform extra work to first consult
the page table, update the TLB, and try the instruction again.

[TLB: WHO HANDLES THE MISS?]

One question that we must answer: who handles a TLB miss? Two answers here:
the hardware, or the software. In the olden days, the hardware had complex
instruction sets (sometimes called *CISC*, for complex-instruction set
computing) and the people who built the hardware didn't much trust those
sneaky OS people. Thus, the hardware would handle the TLB miss entirely. To do
this, the hardware had to know exactly where the page tables were located in
memory, as well as their exact format; on a miss, the hardware would "walk"
the page table itself, update the TLB, and retry the instruction (as in steps
4a, 4b, and 4c above). An example of an "older" architecture that has
*hardware-managed TLBs* is the Intel x86 architecture, which uses a fixed
*multi-level page table* (described below); the current page table is pointed
to by the well-known CR3 register.

More modern architectures (e.g., the MIPS R10k or SPARC v9, both *RISC* or
reduced-instruction set computers) have what are known as *software-managed
TLBs*. On a TLB miss, the hardware simply raises a trap, which stops the
processor from doing what is has been doing and instead jumps to a *trap
handler*. The trap handler is code within the OS that is written with the
express purpose of handling TLB misses. When run, the code will lookup the
translation in the page table, use special "privileged" instructions to update
the TLB, and then return from the trap, which allows the hardware to try the
instruction again (this time resulting in a TLB hit).

Note that when you are running the TLB miss-handling code, you need to be
extra careful not to cause more TLB misses to occur, otherwise you will induce
an infinite loop of TLB misses. Many solutions are possible; one could keep
TLB miss handlers in unmapped physical memory (and thus they are not subject
to translation), or reserve some entries in the TLB for permanently-valid
translations and use some of those permanent translation slots for the handler
code itself.

The big advantage of the software-managed approach is flexibility: the OS can
use any data structure it wants to implement the page table, without any
change in the hardware. The hardware is also greatly simplified by not having
to worry about these intricate details of memory management. 

[TLB: WHAT'S REALLY IN THERE?]

Let's look at the contents of the hardware TLB in more detail. A typical TLB
might have 32 entries and be what is called *fully associative*. Basically,
this just means that any given translation can be anywhere in the TLB, and
that the hardware will search the entire TLB in parallel to find the desired
translation. Here is a small TLB that has four entries:

  [ VPN1 | PPN1 | some other bits ]
  [ VPN2 | PPN2 | some other bits ]
  [ VPN3 | PPN3 | some other bits ]
  [ VPN4 | PPN4 | some other bits ]

Note that both the VPN and PPN are present in each entry, as a translation
could end up in any one of these locations. More interesting is the "some
other bits" field, which is needed for various reasons which we now describe.
The TLB has a *valid* bit, which says whether the entry has a valid
translation or not. Another common set of bits are *protection* bits, which
determine how a page can be accessed (as in the PTE). For example, code pages
might be marked *read and execute*, whereas heap pages might be marked *read
and write*. 

One more interesting field to have in a TLB entry is known as an *address
space identifier* or ASID. The problem the ASID is there to solve is that the
TLB is shared by many processes. For example, when one process (P1) is
running, it accesses the TLB with translations that are valid for it. For
example, the 0th page of process P1 might be mapped to physical page
10. Another process may also be ready in the system (P2), and the OS might be
context-switching between it and P1. Let us assume the 0th page of P2 is
mapped to physical page 17. Thus, the our TLB with two valid entries might
look like this:

    VPN    PPN    valid   prot  
  [   0  |  10  |   1   | rwx ]
  [ ---- | ---- |   0   | --- ]
  [   0  |  17  |   1   | rwx ]
  [ ---- | ---- |   0   | --- ]

As you can see from this diagram, the hardware has a problem: when a virtual
address is generated with VPN=0, there are *two* entries that have what look
like a valid translation for such a page. Without an ASID, the OS could deal
with this by *flushing* the TLB on every context switch (such a flush
instruction would be privileged, and would essentially just mark all entries
in the TLB as valid=0). However, that is costly: it means that every time a
process runs, the first thing it will spend its time doing is suffering from
TLB misses in order to fill up the TLB with popular translations again. To
avoid this cost, hardware designers introduced the ASID, which allows us to
differentiate translations across different processes. You can think of the
ASID as a process identifier (PID), but usually it has fewer bits than that
(say 6 for the ASID instead of the full 32 for a PID). Our TLB, with ASIDs in
place, would thus look like this:

    VPN    PPN    valid   prot  ASID 
  [   0  |  10  |   1   | rwx |  1   ]
  [ ---- | ---- |   0   | --- | ---- ]
  [   0  |  17  |   1   | rwx |  2   ]
  [ ---- | ---- |   0   | --- | ---- ]

Note there are cases when two entries may point to the same *physical* page,
such as this:

    VPN    PPN    valid   prot  ASID 
  [  10  |  101 |   1   | r-x |  1   ]
  [ ---- | ---- |   0   | --- | ---- ]
  [  50  |  101 |   1   | r-x |  2   ]
  [ ---- | ---- |   0   | --- | ---- ]

This might occur, for example, when two processes *share* a page (a code page,
for example). In the example above, process 1 is sharing physical page 101
with process 2; P1 maps this page into the 10th page of its address space,
whereas P2 maps it to the 50th page of its AS. Sharing of code pages (in
binaries, or shared libraries) is useful as it saves space in physical
memory. 

[TLB: WRAP UP]

We have seen how hardware can help us make address translation faster. By
providing a small, dedicated on-chip TLB as an address-translation cache, most
memory references will hopefully be handled *without* having to access the
page table in main memory. Thus, in the common case, the performance of the
program will be almost as if memory isn't being virtualized at all, an
excellent achievement for an operating system.


[1] In fact, the first cache of any kind on a chip was a TLB. Later, people
realized that caches for instructions and data would also be useful.