[FASTER PAGING WITH HARDWARE SUPPORT: THE TLB] When we want to make things fast, the OS needs some help. And help usually comes from one place: the hardware. Here, to speed address translation, we are going to add what is called (for historical reasons [1]) a translation-lookaside buffer, or TLB. A TLB (a part of the chip's MMU) is simply a hardware *cache* of popular virtual-to-physical address translations. On any memory reference, the hardware will look first in the TLB to see if the desired translation is held therein; if it is, the translation is done (quickly) *without* having to consult the page table (which has all the translations). Thus, the hardware (and the OS) follows this approach when servicing a memory reference by a process to a given virtual address. Note that the left column tells us whether it is the hardware (hw) or the OS (os) that performs the given action; in some cases (depending on the system), it could be one or the other, and hence we see hw/os. hw 1 - Extract VPN from address (easy: just take the top bits from the address) hw 2 - Use the VPN to index the TLB hw If (the translation is in the cache) // a TLB hit hw 3a - get PPN from TLB hw 3b - use it to form physical address hw 3c - issue request for physical address to the memory system hw else (translation is NOT in the cache) // a TLB miss hw/os 4a - lookup translation in the page table (in memory) hw/os 4b - if it is *valid* hw/os install the translation into the TLB hw/os retry the instruction (which will hopefully now be a TLB hit) hw 4c - if not valid, trap (invalid memory access) os terminate process (clean up) In the common case, we are hoping that most translations will be found in the TLB (a *TLB hit*) and thus the translation will be quite fast (all in hardware). In the less common case, the translation won't be in the cache (a *TLB miss*), and the system will have to perform extra work to first consult the page table, update the TLB, and try the instruction again. [TLB: WHO HANDLES THE MISS?] One question that we must answer: who handles a TLB miss? Two answers here: the hardware, or the software. In the olden days, the hardware had complex instruction sets (sometimes called *CISC*, for complex-instruction set computing) and the people who built the hardware didn't much trust those sneaky OS people. Thus, the hardware would handle the TLB miss entirely. To do this, the hardware had to know exactly where the page tables were located in memory, as well as their exact format; on a miss, the hardware would "walk" the page table itself, update the TLB, and retry the instruction (as in steps 4a, 4b, and 4c above). An example of an "older" architecture that has *hardware-managed TLBs* is the Intel x86 architecture, which uses a fixed *multi-level page table* (described below); the current page table is pointed to by the well-known CR3 register. More modern architectures (e.g., the MIPS R10k or SPARC v9, both *RISC* or reduced-instruction set computers) have what are known as *software-managed TLBs*. On a TLB miss, the hardware simply raises a trap, which stops the processor from doing what is has been doing and instead jumps to a *trap handler*. The trap handler is code within the OS that is written with the express purpose of handling TLB misses. When run, the code will lookup the translation in the page table, use special "privileged" instructions to update the TLB, and then return from the trap, which allows the hardware to try the instruction again (this time resulting in a TLB hit). Note that when you are running the TLB miss-handling code, you need to be extra careful not to cause more TLB misses to occur, otherwise you will induce an infinite loop of TLB misses. Many solutions are possible; one could keep TLB miss handlers in unmapped physical memory (and thus they are not subject to translation), or reserve some entries in the TLB for permanently-valid translations and use some of those permanent translation slots for the handler code itself. The big advantage of the software-managed approach is flexibility: the OS can use any data structure it wants to implement the page table, without any change in the hardware. The hardware is also greatly simplified by not having to worry about these intricate details of memory management. [TLB: WHAT'S REALLY IN THERE?] Let's look at the contents of the hardware TLB in more detail. A typical TLB might have 32 entries and be what is called *fully associative*. Basically, this just means that any given translation can be anywhere in the TLB, and that the hardware will search the entire TLB in parallel to find the desired translation. Here is a small TLB that has four entries: [ VPN1 | PPN1 | some other bits ] [ VPN2 | PPN2 | some other bits ] [ VPN3 | PPN3 | some other bits ] [ VPN4 | PPN4 | some other bits ] Note that both the VPN and PPN are present in each entry, as a translation could end up in any one of these locations. More interesting is the "some other bits" field, which is needed for various reasons which we now describe. The TLB has a *valid* bit, which says whether the entry has a valid translation or not. Another common set of bits are *protection* bits, which determine how a page can be accessed (as in the PTE). For example, code pages might be marked *read and execute*, whereas heap pages might be marked *read and write*. One more interesting field to have in a TLB entry is known as an *address space identifier* or ASID. The problem the ASID is there to solve is that the TLB is shared by many processes. For example, when one process (P1) is running, it accesses the TLB with translations that are valid for it. For example, the 0th page of process P1 might be mapped to physical page 10. Another process may also be ready in the system (P2), and the OS might be context-switching between it and P1. Let us assume the 0th page of P2 is mapped to physical page 17. Thus, the our TLB with two valid entries might look like this: VPN PPN valid prot [ 0 | 10 | 1 | rwx ] [ ---- | ---- | 0 | --- ] [ 0 | 17 | 1 | rwx ] [ ---- | ---- | 0 | --- ] As you can see from this diagram, the hardware has a problem: when a virtual address is generated with VPN=0, there are *two* entries that have what look like a valid translation for such a page. Without an ASID, the OS could deal with this by *flushing* the TLB on every context switch (such a flush instruction would be privileged, and would essentially just mark all entries in the TLB as valid=0). However, that is costly: it means that every time a process runs, the first thing it will spend its time doing is suffering from TLB misses in order to fill up the TLB with popular translations again. To avoid this cost, hardware designers introduced the ASID, which allows us to differentiate translations across different processes. You can think of the ASID as a process identifier (PID), but usually it has fewer bits than that (say 6 for the ASID instead of the full 32 for a PID). Our TLB, with ASIDs in place, would thus look like this: VPN PPN valid prot ASID [ 0 | 10 | 1 | rwx | 1 ] [ ---- | ---- | 0 | --- | ---- ] [ 0 | 17 | 1 | rwx | 2 ] [ ---- | ---- | 0 | --- | ---- ] Note there are cases when two entries may point to the same *physical* page, such as this: VPN PPN valid prot ASID [ 10 | 101 | 1 | r-x | 1 ] [ ---- | ---- | 0 | --- | ---- ] [ 50 | 101 | 1 | r-x | 2 ] [ ---- | ---- | 0 | --- | ---- ] This might occur, for example, when two processes *share* a page (a code page, for example). In the example above, process 1 is sharing physical page 101 with process 2; P1 maps this page into the 10th page of its address space, whereas P2 maps it to the 50th page of its AS. Sharing of code pages (in binaries, or shared libraries) is useful as it saves space in physical memory. [TLB: WRAP UP] We have seen how hardware can help us make address translation faster. By providing a small, dedicated on-chip TLB as an address-translation cache, most memory references will hopefully be handled *without* having to access the page table in main memory. Thus, in the common case, the performance of the program will be almost as if memory isn't being virtualized at all, an excellent achievement for an operating system. [1] In fact, the first cache of any kind on a chip was a TLB. Later, people realized that caches for instructions and data would also be useful.