MEMORY: TLBS, SMALLER PAGETABLES

Shivaram Venkataraman
CS 537, Spring 2023
- Project 3 is due Monday
- Project 1 grades
AGENDA / LEARNING OUTCOMES

Memory virtualization
- What are the challenges with paging?
- How we go about addressing them?
RECAP
Goal: Eliminate requirement that address space is contiguous

Idea:
Divide address spaces and physical memory into fixed-sized pages

Example page size: 4KB
For each mem reference:

1. extract **VPN** (virt page num) from **VA** (virt addr)
2. calculate addr of **PTE** (page table entry)
3. read **PTE** from memory
4. extract **PFN** (page frame num)
5. build **PA** (phys addr)
6. read contents of **PA** from memory

**READ 0x1100**
PROS/CONS OF PAGING

Pros
No external fragmentation
- Any page can be placed in any frame in physical memory

Fast to allocate and free
- Alloc: No searching for suitable free space
- Free: Doesn’t have to coalesce with adjacent free space

Cons
Additional memory reference
- MMU stores only base address of page table

Storage for page tables may be substantial
- Simple page table: Requires PTE for all pages in address space
- Entry needed even if page not allocated?
STRATEGY: CACHE PAGE TRANSLATIONS

CPU
Translation Cache

RAM
PT

memory interconnect

TLB Entry

<table>
<thead>
<tr>
<th>Tag (virtual page number)</th>
<th>Physical page number (page table entry)</th>
</tr>
</thead>
</table>

Fully associative

Any given translation can be anywhere in the TLB
Hardware will search the entire TLB in parallel
**TLB ACCESSES: SEQUENTIAL EXAMPLE**

- **CPU's TLB**
  - **Valid**
  - **VPN**
  - **PPN**
  - |   |   |
  - | 1 | 1 |
  - | 5 |

- **PTBR**
  - **P1 pagetable**
  - |   |   |   |   |
  - | 0 | 1 | 2 | 3 |

- **Virt**
  - load 0x1000
  - load 0x1004
  - load 0x1008
  - load 0x100c
  - load 0x2000
  - load 0x2004

- **Phys**
  - load 0x0004
  - load 0x5000
  - (TLB hit)
  - load 0x5004
  - (TLB hit)
  - load 0x5008
  - (TLB hit)
  - load 0x500C
  - …
Consider a processor with 16-bit address space and 4kB page size. Assume Page Table is at 0x2000 and each PTE is of 4 bytes.

<table>
<thead>
<tr>
<th>Virtual Addresses</th>
<th>Memory accesses</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x3000: load 0x5320, %eax</td>
<td></td>
</tr>
<tr>
<td>0x3004: load 0x4004, %ebx</td>
<td></td>
</tr>
<tr>
<td>0x3008: mul %ecx, %eax, %ebx</td>
<td></td>
</tr>
<tr>
<td>0x300C: store %ebx, 0x5324</td>
<td></td>
</tr>
<tr>
<td>0x3010: load 0x5328, %ebx</td>
<td></td>
</tr>
</tbody>
</table>
Virtual Addresses
0x3000: load 0x5320, %eax
0x3004: load 0x4004, %ebx
0x3008: mul %ecx, %eax, %ebx
0x300C: store %ebx, 0x5324
0x3010: load 0x5328, %ebx

Memory accesses

<table>
<thead>
<tr>
<th>Valid</th>
<th>VPN</th>
<th>PPN</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>2</td>
<td>6</td>
</tr>
<tr>
<td>0</td>
<td>7</td>
<td>23</td>
</tr>
<tr>
<td>0</td>
<td>2</td>
<td>5</td>
</tr>
<tr>
<td>0</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>89</td>
</tr>
</tbody>
</table>
TLB: POLICIES

How to we replace entries in the TLB?

How do we handle context switches?
PERFORMANCE OF TLB?

Miss rate of TLB: \( \frac{\text{# TLB misses}}{\text{# TLB lookups}} \)

# TLB lookups? number of accesses to \( a \) = # TLB misses?

\[ = \text{number of unique pages accessed} \]

Miss rate?

Hit rate?

Would hit rate get better or worse with smaller pages?
Workload A

```c
int sum = 0;
for (i=0; i<2048; i++) {
    sum += a[i];
}
```

Workload B

```c
int sum = 0;
srand(1234);
for (i=0; i<1000; i++) {
    sum += a[rand() % N];
}
srand(1234);
for (i=0; i<1000; i++) {
    sum += a[rand() % N];
}
WORKLOAD ACCESS PATTERNS

Spatial Locality
Sequential Accesses

Temporal Locality
Repeated Random Accesses
**TLB REPLACEMENT POLICIES**

**LRU**: evict Least-Recently Used TLB slot when needed
LRU TROUBLES

Workload repeatedly accesses same offset (0x01) across 5 pages (strided access), but only 4 TLB entries.

What will TLB contents be over time?
How will TLB perform?
TLB REPLACEMENT POLICIES

LRU: evict Least-Recently Used TLB slot when needed

Random: Evict randomly chosen entry

Sometimes random is better than a “smart” policy!
CONTEXT SWITCHES

What happens if a process uses cached TLB entries from another process?

1. Flush TLB on each switch
   Costly → lose all recently cached translations

2. Track which entries are for which process
   – Address Space Identifier
   – Tag each TLB entry with an 8-bit ASID
TLB EXAMPLE WITH ASID

PTBR

PT

P1

P2

P2

P1

P1

P2

Virtual

load 0x1444 ASID: 12

load 0x1444 ASID: 11

Physical

TLB:

<table>
<thead>
<tr>
<th>Valid</th>
<th>Virt</th>
<th>Phys</th>
<th>ASID</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>9</td>
<td>11</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>5</td>
<td>11</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>2</td>
<td>12</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>11</td>
</tr>
</tbody>
</table>

1 5 4 ...

P1 pagetable (ASID 11)

6 2 3 ...

P2 pagetable (ASID 12)
TLB PERFORMANCE

Context switches are expensive
Even with ASID, other processes “pollute” TLB

Architectures can have multiple TLBs
- 1 TLB for data, 1 TLB for instructions
- 1 TLB for regular pages, 1 TLB for “super pages”
HW AND OS ROLES

If H/W handles TLB Miss

CPU must know where pagetables are

- CR3 register on x86
- Pagetable structure fixed and agreed upon between HW and OS
- HW “walks” the pagetable and fills TLB

If OS handles TLB Miss:

“Software-managed TLB”

- CPU traps into OS upon TLB miss.
- OS interprets pagetables as it chooses
- Modify TLB entries with privileged instruction
TLB SUMMARY

Pages are great, but accessing page tables for every memory access is slow.

Cache recent page translations $\rightarrow$ TLB
- MMU performs TLB lookup on every memory access.

TLB performance depends strongly on workload:
- Sequential workloads perform well.
- Workloads with temporal locality can perform well.

In different systems, hardware or OS handles TLB misses.

TLBs increase cost of context switches:
- Flush TLB on every context switch.
- Add ASID to every TLB entry.
QUIZ 11: MORE TLBS

https://tinyurl.com/cs537-sp23-quiz11

1. What problem(s) can be solved by using ASIDs?

2. For a hardware-managed TLB miss, which of the following statements are true?

3. For a software-managed TLB miss, which of the following statements are true?
Disadvantages of Paging

Additional memory reference to page table → Very inefficient
- Page table must be stored in memory
- MMU stores only base address of page table

Storage for page tables may be substantial
- Simple page table: Requires PTE for all pages in address space
  Entry needed even if page not allocated?
SMALLER PAGE TABLES
WHY ARE PAGE TABLES SO LARGE?

Waste!
Many invalid PT entries

<table>
<thead>
<tr>
<th>PFN</th>
<th>valid</th>
<th>prot</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>1</td>
<td>r-x</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>-</td>
</tr>
<tr>
<td>23</td>
<td>1</td>
<td>rw-</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>-</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>-</td>
</tr>
<tr>
<td>28</td>
<td>1</td>
<td>rw-</td>
</tr>
<tr>
<td>4</td>
<td>1</td>
<td>rw-</td>
</tr>
</tbody>
</table>

...many more invalid...

how to avoid storing these?
AVOID SIMPLE LINEAR PAGE TABLES?

Use more complex page tables, instead of just big array
Any data structure is possible with software-managed TLB
  – Hardware looks for vpn in TLB on every memory access
  – If TLB does not contain vpn, TLB miss
    • Trap into OS and let OS find vpn->ppn translation
    • OS notifies TLB of vpn->ppn for future accesses
OTHER APPROACHES

1. Multi-level Pagetables
   – Page the page tables
   – Page the pagetables of page tables…

2. Inverted Pagetables
MULTILEVEL PAGE TABLES

Goal: Allow page table to be allocated non-contiguously

Idea: Page the page tables
– Creates multiple levels of page tables; outer level “page directory”
– Only allocate page tables for pages in use
– Used in x86 architectures (hardware can walk known structure)
Multilevel Page Tables

20-bit address:

- outer page (4 bits)
- inner page (4 bits)
- page offset (12 bits)

base of page directory
ADDRESS FORMAT FOR MULTILEVEL PAGING

30-bit address:

| outer page | inner page | page offset (12 bits) |

How should logical address be structured? How many bits for each paging level?

Goal?

- Each inner page table fits within a page
- PTE size * number PTE = page size
  
  Assume PTE size = 4 bytes
  
  Page size = 2^12 bytes = 4KB
  
  → # bits for selecting inner page =

Remaining bits for outer page:

- 30 - ___ - ___ = ___ bits
# Multilevel Translation Example

## Page Directory

<table>
<thead>
<tr>
<th>PPN</th>
<th>valid</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x3</td>
<td>1</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>0x92</td>
<td>1</td>
</tr>
</tbody>
</table>

## Page of PT (@PPN: 0x3)

<table>
<thead>
<tr>
<th>PPN</th>
<th>valid</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x10</td>
<td>1</td>
</tr>
<tr>
<td>0x23</td>
<td>1</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>0x80</td>
<td>1</td>
</tr>
<tr>
<td>0x59</td>
<td>1</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
</tbody>
</table>

## Page of PT (@PPN: 0x92)

<table>
<thead>
<tr>
<th>PPN</th>
<th>valid</th>
</tr>
</thead>
<tbody>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>-</td>
<td>0</td>
</tr>
<tr>
<td>0x55</td>
<td>1</td>
</tr>
<tr>
<td>0x45</td>
<td>1</td>
</tr>
</tbody>
</table>

20-bit address:

- **Outer Page**: 4 bits
- **Inner Page**: 4 bits
- **Page Offset**: 12 bits

Translate 0x01ABC
**PROBLEM WITH 2 LEVELS?**

Problem: page directories (outer level) may not fit in a page

Solution:
- Split page directories into pieces
- Use another page dir to refer to the page dir pieces.

How large is virtual address space with 4 KB pages, 4 byte PTEs, (each page table fits in page)

1 level: 4KB / 4 bytes → 1K entries per level
2 levels: 
3 levels:
FULL SYSTEM WITH TLBS

On TLB miss: lookups with more levels more expensive

Assume 3-level page table
Assume 256-byte pages
Assume 16-bit addresses
Assume ASID of current process is 211

<table>
<thead>
<tr>
<th>ASID</th>
<th>VPN</th>
<th>PFN</th>
<th>Valid</th>
</tr>
</thead>
<tbody>
<tr>
<td>211</td>
<td>0xbb</td>
<td>0x91</td>
<td>1</td>
</tr>
<tr>
<td>211</td>
<td>0xff</td>
<td>0x23</td>
<td>1</td>
</tr>
<tr>
<td>122</td>
<td>0x05</td>
<td>0x91</td>
<td>1</td>
</tr>
<tr>
<td>211</td>
<td>0x05</td>
<td>0x12</td>
<td>0</td>
</tr>
</tbody>
</table>

How many physical accesses for each instruction? (Ignore ops changing TLB)

(a) 0xAA10: movl 0x1111, %edi

(b) 0xBB13: addl $0x3, %edi

(c) 0x0519: movl %edi, 0xFF10
INVERTED PAGE TABLE

Only store entries for virtual pages w/ valid physical mappings

Naïve approach:
- Search through data structure \( <\text{ppn}, \text{vpn}+\text{asid}> \) to find match
- Too much time to search entire table

Better:
- Find possible matches entries by hashing \( \text{vpn}+\text{asid} \)
- Smaller number of entries to search for exact match

Managing inverted page table requires software-controlled TLB
Consider a virtual address space of 16KB with 64-byte pages.

1. How many bits will we have in our virtual address for this address space?

2. What is the total number of entries in the Linear Page Table for such an address space?

3. Consider a two-level page table now with a page directory. How many bits will be used to select the inner page assuming PTE size = 4 bytes?
### Quiz 12

The page directory contains entries for various pages. Each entry includes an outer page (4 bits), an inner page (4 bits), and a page offset (12 bits). The valid column indicates whether the page is active.

#### Page Directory

<table>
<thead>
<tr>
<th>PPN</th>
<th>Valid</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x3</td>
<td>1</td>
</tr>
<tr>
<td>- 0</td>
<td>-</td>
</tr>
<tr>
<td>- 0</td>
<td>-</td>
</tr>
<tr>
<td>- 0</td>
<td>-</td>
</tr>
<tr>
<td>- 0</td>
<td>-</td>
</tr>
<tr>
<td>0x10</td>
<td>1</td>
</tr>
<tr>
<td>0x23</td>
<td>1</td>
</tr>
<tr>
<td>0x80</td>
<td>1</td>
</tr>
<tr>
<td>0x59</td>
<td>1</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>0x92</td>
<td>1</td>
</tr>
</tbody>
</table>

#### Page of PT (@PPN: 0x3)

<table>
<thead>
<tr>
<th>PPN</th>
<th>Valid</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x10</td>
<td>1</td>
</tr>
<tr>
<td>0x23</td>
<td>1</td>
</tr>
<tr>
<td>0x80</td>
<td>1</td>
</tr>
<tr>
<td>0x59</td>
<td>1</td>
</tr>
</tbody>
</table>

#### Page of PT (@PPN: 0x92)

<table>
<thead>
<tr>
<th>PPN</th>
<th>Valid</th>
</tr>
</thead>
<tbody>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>- 0</td>
<td>0</td>
</tr>
<tr>
<td>0x55</td>
<td>1</td>
</tr>
<tr>
<td>0x45</td>
<td>1</td>
</tr>
</tbody>
</table>

### Translation Example

- Translate the 20-bit address 0xFEED0.

#### 20-bit address:

<table>
<thead>
<tr>
<th>Outer Page (4 bits)</th>
<th>Inner Page (4 bits)</th>
<th>Page Offset (12 bits)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Summary: Better Page Tables

Problem: Simple linear page tables require too much contiguous memory.

Many options for efficiently organizing page tables:
- If OS traps on TLB miss, OS can use any data structure:
  - Inverted page tables (hashing)
- If Hardware handles TLB miss, page tables must follow specific format:
  - Multi-level page tables used in x86 architecture
    - Each page table fits within a page.
NEXT STEPS

Project 3: In progress

Next class: Swapping!