- Project 1b done!

- Project 2a is out! Next Friday Feb 14, 10pm

- Reminder: Midterm 1 makeup

- Discussion section: Process API, Project 2a
AGENDA / LEARNING OUTCOMES

Memory virtualization
  What is paging and how does it work?
  What are some of the challenges in implementing paging?
RECAP
MEMORY VIRTUALIZATION

**Transparency:** Process is unaware of sharing

**Protection:** Cannot corrupt OS or other process memory

**Efficiency:** Do not waste memory or slow down processes

**Sharing:** Enable sharing between cooperating processes
ABSTRACTION: ADDRESS SPACE

- **Program Code**: The code segment where instructions live.
- **Heap**: The heap segment contains malloc'd data and dynamic data structures. It grows downward.
- **Stack**: The stack segment contains local variables, arguments to routines, return values, etc. It grows upward.

### Virtual Address Space

- **0KB**
  - Program Code
  - Stack
  - Heap
- **1KB**
  - (free)
- **2KB**
  - (free)
- **15KB**
  - Stack
- **16KB**

### Physical Address Space

- **0KB**
  - Operating System (code, data, etc.) (free)
- **64KB**
  - Process C (code, data, etc.)
  - (free)
- **128KB**
  - Process B (code, data, etc.)
  - (free)
- **192KB**
  - Process A (code, data, etc.)
  - (free)
- **256KB**
- **320KB**
- **384KB**
- **448KB**
- **512KB**
  - (free)
### Review: Segmentation

- **Segmentation**

0x0010: `movl 0x1100, %edi`

0x0013: `addl $0x3, %edi`

%rip: 0x0010

<table>
<thead>
<tr>
<th>Seg</th>
<th>Base</th>
<th>Bounds</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0x4000</td>
<td>0xfff</td>
</tr>
<tr>
<td>1</td>
<td>0x5800</td>
<td>0xfff</td>
</tr>
<tr>
<td>2</td>
<td>0x6800</td>
<td>0x7ff</td>
</tr>
</tbody>
</table>

1. Fetch instruction at logical addr 0x0010
   - Physical addr: \[0x4000 + 0x010 = 0x4010\]

2. Exec, load from logical addr 0x1100
   - Physical addr: \[0x5800 + 0x100 = 0x5900\]

3. Fetch instruction at logical addr 0x0013
   - Physical addr:

4. Exec, no load
FRAGMENTATION

Types of fragmentation
- **External**: Visible to allocator (e.g., OS)
- **Internal**: Visible to requester

Definition: Free memory that can’t be usefully allocated

- Not Compacted
  - Operating System
  - (not in use)
  - Allocated
  - (not in use)
  - Allocated
  - (not in use)
  - Allocated
  - 64KB

Internal
- useful
- free
PAGING
Goal: Eliminate requirement that address space is contiguous
Eliminate external fragmentation
Grow segments as needed

Idea:
Divide address spaces and physical memory into fixed-sized pages

Size: $2^n$, Example: 4KB
How to translate logical address to physical address?

- High-order bits of address designate page number
- Low-order bits of address designate offset within page

No addition needed; just append bits correctly!
**ADDRESS FORMAT**

Given known page size, how many bits are needed in address to specify offset in page?

<table>
<thead>
<tr>
<th>Page Size</th>
<th>Low Bits (offset)</th>
</tr>
</thead>
<tbody>
<tr>
<td>16 bytes</td>
<td>4 bits</td>
</tr>
<tr>
<td>1 KB</td>
<td>10 bits</td>
</tr>
<tr>
<td>1 MB</td>
<td>20 bits</td>
</tr>
<tr>
<td>512 bytes</td>
<td>9 bits</td>
</tr>
<tr>
<td>4 KB</td>
<td>12 bits</td>
</tr>
</tbody>
</table>

\[
\text{log}_2 (\text{Page Size}) = \text{number of bits in offset}
\]

Page size fixed configuration

Page size

1024 x 1 KB
Given number of bits in virtual address and bits for offset, how many bits for virtual page number?

<table>
<thead>
<tr>
<th>Page Size</th>
<th>Low Bits (offset)</th>
<th>Virt Addr Total Bits</th>
<th>High Bits (vpn)</th>
</tr>
</thead>
<tbody>
<tr>
<td>16 bytes</td>
<td>4</td>
<td>10</td>
<td>10 - 4 = 6</td>
</tr>
<tr>
<td>1 KB</td>
<td>10</td>
<td>20</td>
<td>20 - 10 = 10</td>
</tr>
<tr>
<td>1 MB</td>
<td>20</td>
<td>32</td>
<td>12</td>
</tr>
<tr>
<td>512 bytes</td>
<td>9</td>
<td>16</td>
<td>7</td>
</tr>
<tr>
<td>4 KB</td>
<td>12</td>
<td>32</td>
<td>20</td>
</tr>
</tbody>
</table>
Given number of bits for \( \text{vpn} \), how many virtual pages can there be in an address space?

<table>
<thead>
<tr>
<th>Page Size</th>
<th>Low Bits (offset)</th>
<th>Virt Addr Bits</th>
<th>High Bits (( \text{vpn} ))</th>
<th>Virt Pages</th>
</tr>
</thead>
<tbody>
<tr>
<td>16 bytes</td>
<td>4</td>
<td>10</td>
<td>6</td>
<td>( 2^6 = 64 )</td>
</tr>
<tr>
<td>1 KB</td>
<td>10</td>
<td>20</td>
<td>10</td>
<td>( 2^{10} = 1024 )</td>
</tr>
<tr>
<td>1 MB</td>
<td>20</td>
<td>32</td>
<td>12</td>
<td>( 2^{12} = 4096 )</td>
</tr>
<tr>
<td>512 bytes</td>
<td>9</td>
<td>16</td>
<td>7</td>
<td>( 2^7 = 128 )</td>
</tr>
<tr>
<td>4 KB</td>
<td>12</td>
<td>32</td>
<td>20</td>
<td>( 2^{20} \approx 1 \text{ million} )</td>
</tr>
</tbody>
</table>
How should OS translate VPN to PPN?

Number of bits in virtual address need not equal number of bits in physical address

Addr Mapper

VPN offset

PPN offset

Physical memory
What is a good data structure?

Simple solution: Linear page table aka array

Size of a page table = \text{Num entries} \times \text{size of entry}

Page table entry

PTE 4 bytes → 20 bits

\text{PFN}
PER-PROCESS PAGETABLE

Virt Mem

Phys Mem

P1

P2

P3
Description

1. one process uses RAM at a time
2. rewrite code and addresses before running
3. add per-process starting location to virt addr to obtain phys addr
4. dynamic approach that verifies address is in valid range
5. several base+bound pairs per process

Name of approach

- Time sharing
- Static relocation
- Base
- Base + Bounds
- Segmentation

Candidates: Segmentation, Static Relocation, Base, Base+Bounds, Time Sharing
Consider a 32-bit address space with 4 KB pages. Assume each PTE is 4 bytes.

How many bits do we need to represent the offset within a page?

\[ 12 \text{ bits} = \log_2 (4\text{KB}) \]

How many virtual pages will we have in this case?

\[ 20 \text{ bits VPN} = 2^{20} = 1 \text{ million} \]

What will be the overall size of the page table?

\[ \text{Num Virt Page} \times \text{size (PTE)} = 1 \text{ million} \times 4 \text{ bytes} = 4 \text{ MB} \]
WHERE ARE PAGE TABLES STORED?

Implication: Store each page table in memory

Hardware finds page table base with register (e.g., CR3 on x86)

What happens on a context-switch?

Change contents of page table base register to newly scheduled process

Save old page table base register in PCB of descheduled process
What other info is in pagetable entries besides translation?

- valid bit
- protection bits
- present bit (needed later)
- reference bit (needed later)
- dirty bit (needed later)

Pagetable entries are just bits stored in memory

- Agreement between HW and OS about interpretation
MEMORY ACCSESSES WITH PAGING

14 bit addresses

0x0010: movl 0x1100, %edi

Assume PT is at phys addr 0x5000
Assume PTE's are 4 bytes
Assume 4KB pages
How many bits for offset? 12

2 bits VPN

Simplified view of page table

0x5004:

2
0
80
99

Access page table to get ppn for vpn 0
Mem ref 1: read 0x5000
Learn vpn 0 is at ppn __

Fetch instruction at logical addr 0x0010
Exec, load from logical addr 0x1100
Access page table to get ppn for vpn 1
Mem ref 3: 0x5004
Learn vpn 1 is at ppn __

Movl from 0x0100 into reg (Mem ref 4)
Memory Accesses with Paging

14 bit addresses

0x0010: movl 0x1100, %edi

Assume PT is at phys addr 0x5000
Assume PTE’s are 4 bytes
Assume 4KB pages
How many bits for offset? 12

Simplified view of page table

<table>
<thead>
<tr>
<th></th>
<th>2</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>80</td>
<td></td>
<td>99</td>
</tr>
</tbody>
</table>

Fetch instruction at logical addr 0x0010
Access page table to get ppn for vpn 0
Mem ref 1: ___0x5000___
Learn vpn 0 is at ppn 2
Fetch instruction at ___0x2010___ (Mem ref 2)

Exec, load from logical addr 0x1100
Access page table to get ppn for vpn 1
Mem ref 3: ___0x5004___
Learn vpn 1 is at ppn 0
Movl from ___0x0100___ into reg (Mem ref 4)
ADVANTAGES OF PAGING

No external fragmentation
   – Any page can be placed in any frame in physical memory

Fast to allocate and free
   – Alloc: No searching for suitable free space
   – Free: Doesn’t have to coalesce with adjacent free space

Simple to swap-out portions of memory to disk (later lecture)
   – Page size matches disk block size
   – Can run process when some pages are on disk
   – Add “present” bit to PTE
DISADVANTAGES OF PAGING

Internal fragmentation: Page size may not match size needed by process
  - Wasted memory grows with larger pages
  - Tension?

Additional memory reference to page table → Very inefficient
  - Page table must be stored in memory
  - MMU stores only base address of page table

Storage for page tables may be substantial
  - Simple page table: Requires PTE for all pages in address space
    Entry needed even if page not allocated?
For each mem reference:

1. extract **VPN** (virt page num) from **VA** (virt addr)
2. calculate addr of **PTE** (page table entry)
3. read **PTE** from memory
4. extract **PFN** (page frame num)
5. build **PA** (phys addr)
6. read contents of **PA** from memory into register

Which steps are expensive?
```c
int sum = 0;
for (i=0; i<N; i++)
{
    sum += a[i];
}
```

Assume ‘a’ starts at 0x3000
Ignore instruction fetches and access to ‘i’
STRATEGY: CACHE PAGE TRANSLATIONS

- CPU
- RAM
- Translation Cache
- MMU
- memory interconnect

load 0x100
load...
load 0x100C
TLB: TRANSLATION LOOKASIDE BUFFER
**Fully associative**

Any given translation can be anywhere in the TLB

Hardware will search the entire TLB in parallel
int sum = 0;
for (i = 0; i < 2048; i++){
    sum += a[i];
}

Assume following virtual address stream:
load 0x1000
load 0x1004
load 0x1008
load 0x100C
...

What will TLB behavior look like?
TLB ACCESSES: SEQUENTIAL EXAMPLE

Virt

| load 0x1000 | load 0x1004 | load 0x1008 |
| load 0x100c |
| ...         | load 0x2000 |
| load 0x2004 |

CPU’s TLB

<table>
<thead>
<tr>
<th>Valid</th>
<th>VPN</th>
<th>PPN</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

PTBR

PT  pagetable

<table>
<thead>
<tr>
<th>1</th>
<th>5</th>
<th>4</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
</tbody>
</table>

PT

0 KB  PT
4 KB  PT
8 KB  P1
12 KB P2
16 KB P2
20 KB P1
24 KB P1
28 KB P2
### TLB Accesses: Sequential Example

#### CPU's TLB

<table>
<thead>
<tr>
<th>Valid</th>
<th>VPN</th>
<th>PPN</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
<td>5</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>4</td>
</tr>
</tbody>
</table>

#### PTBR

<table>
<thead>
<tr>
<th>P1 pagetable</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 5 4 ...</td>
</tr>
</tbody>
</table>

#### TLB Accesses

**Virt**
- load 0x1000
- load 0x1004
- load 0x1008
- load 0x100c
- load 0x2000
- load 0x2004

**Phys**
- load 0x0004 (TLB hit)
- load 0x5000 (TLB hit)
- load 0x5004 (TLB hit)
- load 0x5008 (TLB hit)
- load 0x500C
- load 0x0008 (TLB hit)
- load 0x4000 (TLB hit)
- load 0x4004
Consider a processor with 16-bit address space and 4kB page size. Assume Page Table is at 0x2000 and each PTE is of 4 bytes.

<table>
<thead>
<tr>
<th>VPN</th>
<th>PPN</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>7</td>
</tr>
<tr>
<td>5</td>
<td>8</td>
</tr>
<tr>
<td>3</td>
<td>9</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
</tr>
</tbody>
</table>

Virtual Addresses
0x3000: load 0x5320, %eax
0x3004: load 0x4004, %ebx
0x3008: mul %ecx, %eax, %ebx
0x300C: store %ebx, 0x5324
0x3010: load 0x5328, %ebx

Total number of memory accesses
### Simplified view of the PT

<table>
<thead>
<tr>
<th>VPN</th>
<th>PPN</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>7</td>
</tr>
<tr>
<td>5</td>
<td>8</td>
</tr>
<tr>
<td>3</td>
<td>9</td>
</tr>
<tr>
<td>2</td>
<td>1</td>
</tr>
</tbody>
</table>

### Virtual Addresses
- 0x3000: load 0x5320, %eax
- 0x3004: load 0x4004, %ebx
- 0x3008: mul %ecx, %eax, %ebx
- 0x300C: store %ebx, 0x5324
- 0x3010: load 0x5328, %ebx

### Memory accesses

<table>
<thead>
<tr>
<th>Valid</th>
<th>VPN</th>
<th>PPN</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>2</td>
<td>6</td>
</tr>
<tr>
<td>0</td>
<td>7</td>
<td>23</td>
</tr>
<tr>
<td>0</td>
<td>2</td>
<td>5</td>
</tr>
<tr>
<td>0</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>89</td>
</tr>
</tbody>
</table>
Performance of TLB?

Miss rate of TLB: \#TLB misses / \#TLB lookups

\#TLB lookups? number of accesses to a = 2048

\#TLB misses?

  = number of unique pages accessed
  = 2048 / (elements of ‘a’ per 4K page)
  = 2K / (4K / sizeof(int)) = 2K / 1K
  = 2

Miss rate? = 2/2048 = 0.1%

Hit rate? (1 – miss rate) = 99.9%

Would hit rate get better or worse with smaller pages?
How can system improve hit rate given fixed number of TLB entries?

Increase page size:
Fewer unique page translations needed to access same amount of memory

TLB Reach: Number of TLB entries * Page Size
Workload Access Patterns

**Workload A**

```c
int sum = 0;
for (i=0; i<2048; i++) {
    sum += a[i];
}
```

Sequential array accesses almost always hit in TLB!

**Workload B**

```c
int sum = 0;
srand(1234);
for (i=0; i<1000; i++) {
    sum += a[rand() % N];
}
srand(1234);
for (i=0; i<1000; i++) {
    sum += a[rand() % N];
}
```
WORKLOAD ACCESS PATTERNS

Spatial Locality
Sequential Accesses

Temporal Locality
Repeated Random Accesses
**Spatial Locality:** future access will be to nearby addresses

**Temporal Locality:** future access will be repeats to the same data

What TLB characteristics are best for each type?

**Spatial:**
- Access same page repeatedly; need same vpn → ppn translation
- Same TLB entry re-used

**Temporal:**
- Access same address near in future
- Same TLB entry re-used in near future
- How near in future? How many TLB entries are there?
OTHER TLB CHALLENGES

How to replace TLB entries? LRU? Random?

TLB on context switches? HW or OS?
Project 2a is out!

Discussion today: Process API, Project 2a

Next class: More TLBs and better pagetables!