## **MEMORY: PAGING AND TLBS**

Shivaram Venkataraman CS 537, Spring 2023

## **ADMINISTRIVIA**

- Project 2 done!
- Project 3 is out! Start early?

## AGENDA / LEARNING OUTCOMES

Memory virtualization

What is paging and how does it work? What are some of the challenges in implementing paging?

## RECAP

## MEMORY VIRTUALIZATION

Transparency: Process is unaware of sharing

Protection: Cannot corrupt OS or other process memory

Efficiency: Do not waste memory or slow down processes

Sharing: Enable sharing between cooperating processes

## **RECAP: WHAT IS IN ADDRESS SPACE?**



#### **REVIEW: SEGMENTATION**

Divide address space into logical segments

Each segment corresponds to logical entity in address space (code, stack, heap)

Each segment has separate base + bounds register

How does process designate a particular segment?

- Top bits of logical address select segment
- Low bits of logical address select offset within segment

## **EXAMPLE: SEGMENTATION**

0x0010: movl 0x1100, %edi

I. Fetch instruction at logical addr 0x0010Physical addr:

%rip:0x0010

| Seg | Base   | Bounds |
|-----|--------|--------|
| 0   | 0x4000 | 0xfff  |
| Ι   | 0×5800 | 0xfff  |
| 2   | 0×6800 | 0x7ff  |

2. Exec, load from logical addr 0x1100 Physical addr:

## QUIZ 8! https://tinyurl.com/cs537-sp23-quiz8



| Segment | Base   | Bounds | RW  |
|---------|--------|--------|-----|
| 0       | 0x2000 | 0x6ff  | 10  |
| 1       | 0x0000 | 0x4ff  | 1 1 |
| 2       | 0x3000 | 0xfff  | 1 1 |
| 3       | 0x0000 | 0x000  | 00  |

Remember:

I hex digit  $\rightarrow$  4 bits

Translate logical (in hex) to physical

0x0240:

0x1108:

0x265c:

0x3002:

## HOW DOES THIS LOOK IN X86

Stack Segment (SS): Pointer to the stack Code Segment (CS): Pointer to the code Data Segment (DS): Pointer to the data

Extra Segment (ES): Pointer to extra data F Segment (FS): Pointer to more extra data G Segment (GS): Pointer to still more extra data

## NOTE: HOW DO STACKS GROW ?



Stack goes  $16K \rightarrow 12K$ , in physical memory is  $28K \rightarrow 24K$ Segment base is at 28K

Virtual address 0x3C00 = 15K
→ top 2 bits (0x3) segment ref, offset is 0xC00 = 3K
How do we make CPU translate that ?

Negative offset = subtract max segment from offset = 3K - 4K = -1KAdd to base = 28K - 1K = 27K

## **ADVANTAGES OF SEGMENTATION**

Stack and heap can grow independently

- Heap: If no data on free list, dynamic memory allocator requests more from OS (e.g., UNIX: malloc calls sbrk())
- Stack: OS recognizes reference outside legal segment, extends stack implicitly

Different protection for different segments

- Enables sharing of selected segments
- Read-only status for code

Supports dynamic relocation of each segment

## **DISADVANTAGES OF SEGMENTATION**

Each segment must be allocated contiguously

May not have sufficient physical memory for large segments? 16KB

**External Fragmentation** 

Not Compacted 0KB **Operating System** 8KB (not in use) 24KB Allocated 32KB (not in use) Allocated 40KB **48KB** (not in use) **56KB** Allocated 64KB

## PAGING

# PAGING

Goal: Eliminate requirement that address space is contiguous Eliminate external fragmentation Grow segments as needed

Idea:

Divide address spaces and physical memory into fixed-sized pages

Size: 2<sup>n</sup>, Example: 4KB



## **TRANSLATION OF PAGE ADDRESSES**

How to translate logical address to physical address?

- High-order bits of address designate page number
- Low-order bits of address designate offset within page



No addition needed; just append bits correctly!

## **ADDRESS FORMAT**

Given known page size, how many bits are needed in address to specify offset in page?

| Page Size | Low Bits (offset) |
|-----------|-------------------|
| l6 bytes  |                   |
| I KB      |                   |
| I MB      |                   |
| 512 bytes |                   |
| 4 KB      |                   |

## **ADDRESS FORMAT**

Given number of bits in virtual address and bits for offset, how many bits for virtual page number?

| Page Size | Low Bits(offset) | Virt Addr<br>Total Bits | High Bits(vpn) |
|-----------|------------------|-------------------------|----------------|
| 16 bytes  | 4                | 10                      |                |
| I KB      | 10               | 20                      |                |
| I MB      | 20               | 32                      |                |
| 512 bytes | 9                | 16                      |                |
| 4 KB      | 12               | 32                      |                |

## **ADDRESS FORMAT**

Given number of bits for vpn, how many virtual pages can there be in an address space?

| Page Size | Low Bits (offset) | Virt Addr Bits | High Bits (vpn) | Virt Pages |
|-----------|-------------------|----------------|-----------------|------------|
| 16 bytes  | 4                 | 10             | 6               |            |
| I KB      | 10                | 20             | 10              |            |
| I MB      | 20                | 32             | 12              |            |
| 512 bytes | 9                 | 16             | 7               |            |
| 4 KB      | 12                | 32             | 20              |            |

# VIRTUAL $\rightarrow$ PHYSICAL PAGE MAPPING



How should OS translate VPN to PPN?

## PAGETABLES

VPN

What is a good data structure ?

Simple solution: Linear page table aka array



| 1 | 31 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7   | 6 | 5 | 4   | 3   | 2   | 1   | 0 |
|---|-------|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|---|---|-----|---|---|-----|-----|-----|-----|---|
|   |       |    |    |    |    |    |    |    | PF | ۶N |    |    |    |    |    |    |    |    |    |    |    |   | σ | PAT | ۵ | ۷ | PCD | PWT | U/S | R/W | ٩ |

## **PER-PROCESS PAGETABLE**



## FILL IN PAGETABLE



# QUIZ 9

#### https://tinyurl.com/cs537-sp23-quiz9



#### Description

- I. one process uses RAM at a time
- 2. rewrite code and addresses before running
- 3. add per-process starting location to virt addr to obtain phys addr
- 4. dynamic approach that verifies address is in valid range
- 5. several base+bound pairs per process

#### Name of approach

Candidates: Segmentation, Static Relocation, Base, Base+Bounds, Time Sharing

## QUIZ9: HOW BIG IS A PAGETABLE?

Consider a **32-bit** address space with 4 KB pages. Assume each PTE is 4 bytes

How many bits do we need to represent the offset within a page?

How many virtual pages will we have in this case?

What will be the overall size of the page table?

## WHERE ARE PAGETABLES STORED?

Implication: Store each page table in memory

Hardware finds page table base with register (e.g., CR3 on x86)

What happens on a context-switch?

Change contents of page table base register to newly scheduled process

Save old page table base register in PCB of descheduled process

# **OTHER PAGETABLE INFO**

What other info is in pagetable entries besides translation?

- valid bit
- protection bits
- present bit (needed later)
- reference bit (needed later)
- dirty bit (needed later)

Pagetable entries are just bits stored in memory

Agreement between HW and OS about interpretation

## **MEMORY ACCESSES WITH PAGING**

14 bit addresses

0x0010: movl 0x1100, %edi

Assume PT is at phys addr 0x5000 Assume PTE's are 4 bytes Assume 4KB pages How many bits for offset? 12 Fetch instruction at logical addr 0x0010

Access page table to get ppn for vpn 0 Mem ref I: Learn vpn 0 is at ppn \_\_\_\_

Fetch instruction at \_\_\_\_\_ (Mem ref 2)

Simplified view of page table



## **MEMORY ACCESSES WITH PAGING**

14 bit addresses

0x0010: movl 0x1100, %edi

Assume PT is at phys addr 0x5000 Assume PTE's are 4 bytes Assume 4KB pages How many bits for offset? 12

0

80

99

Exec, load from logical addr 0x1100

Access page table to get ppn for vpn I Mem ref 3:

Learn vpn I is at ppn \_\_\_\_\_

Movl from \_\_\_\_\_ into reg (Mem ref 4)

Simplified view of page table

## **MEMORY ACCESSES WITH PAGING**

14 bit addresses

0x0010: movl 0x1100, %edi

Assume PT is at phys addr 0x5000 Assume PTE's are 4 bytes Assume 4KB pages How many bits for offset? 12

Simplified view of page table



Fetch instruction at logical addr 0x0010 Access page table to get ppn for vpn 0 Mem ref 1: \_\_\_\_0x5000\_\_\_\_ Learn vpn 0 is at ppn 2

Fetch instruction at \_\_\_\_0x2010\_\_\_ (Mem ref 2)

Exec, load from logical addr 0x1100

Access page table to get ppn for vpn I

Mem ref 3: \_\_\_\_0x5004\_\_\_\_

Learn vpn I is at ppn 0

Movl from \_\_\_\_\_\_ into reg (Mem ref 4)

## **PROS/CONS OF PAGING**

No external fragmentation

Any page can be placed in any frame in physical memory

#### Fast to allocate and free

- Alloc: No searching for suitable free space
- Free: Doesn't have to coalesce with adjacent free space

Internal fragmentation

- Page size may not match process needs
- Wasted memory grows with larger pages

Additional memory reference to page table  $\rightarrow$ 

- Page table must be stored in memory
- MMU stores only base address of page table

Storage for page tables may be substantial

- Requires PTE for all pages in address space
- Entry needed even if page not allocated ?

### **SUMMARY: PAGE TRANSLATION STEPS**

For each mem reference:

- I. extract **VPN** (virt page num) from **VA** (virt addr)
- 2. calculate addr of **PTE** (page table entry)
- 3. read **PTE** from memory
- 4. extract **PFN** (page frame num)
- 5. build PA (phys addr)
- 6. read contents of **PA** from memory into register

Which steps are expensive?

### **EXAMPLE: ARRAY ITERATOR**

```
int sum = 0;
for (i=0; i<N; i++){
    sum += a[i]; load 0x3000
}
load 0x3004
Assume 'a' starts at 0x3000 load 0x3008
Ignore instruction fetches
and access to 'i' load 0x300C
```

What physical addresses?

load 0x100C load 0x7000 load 0x100C load 0x7004 load 0x100C load 0x7008 load 0x100C load 0x700C

#### **STRATEGY: CACHE PAGE TRANSLATIONS**



## **TLB: TRANSLATION LOOKASIDE BUFFER**

### **TLB ORGANIZATION**

#### **TLB Entry**

Tag (virtual page number) Physical page number (page table entry)



#### Fully associative

Any given translation can be anywhere in the TLB Hardware will search the entire TLB in parallel

## ARRAY ITERATOR (W/ TLB)

| int sum = 0;                                                                   | Assume following virtual address strea |  |  |  |  |  |  |
|--------------------------------------------------------------------------------|----------------------------------------|--|--|--|--|--|--|
| for (i = 0; i < 2048; i++){<br>sum += a[i];                                    | load 0x1004                            |  |  |  |  |  |  |
| }                                                                              | load 0x1008                            |  |  |  |  |  |  |
| Assume 'a' starts at 0x1000<br>Ignore instruction fetches<br>and access to 'i' | load 0x100C<br>                        |  |  |  |  |  |  |

What will TLB behavior look like?

## **TLB ACCESSES: SEQUENTIAL EXAMPLE**

Virt



## **TLB ACCESSES: SEQUENTIAL EXAMPLE**



## QUIZ 10: TLBS

https://tinyurl.com/cs537-sp23-quiz10

Consider a processor with 16-bit address space and 4kB page size. Assume Page Table is at 0x2000 and each PTE is of 4 bytes.



| VPN | PPN |
|-----|-----|
| 4   | 7   |
| 5   | 8   |
| 3   | 9   |
| 2   | 1   |

Virtual Addresses 0x3000: load 0x5320, %eax 0x3004: load 0x4004, %ebx 0x3008: mul %ecx, %eax, %ebx 0x300C: store %ebx, 0x5324 0x3010: load 0x5328, %ebx Memory accesses



Total number of memory accesses

## QUIZ10: TLBS

#### Simplified view of the PT

| VPN | PPN |
|-----|-----|
| 4   | 7   |
| 5   | 8   |
| 3   | 9   |
| 2   | 1   |

Virtual Addresses 0x3000: load 0x5320, %eax 0x3004: load 0x4004, %ebx 0x3008: mul %ecx, %eax, %ebx 0x300C: store %ebx, 0x5324 0x3010: load 0x5328, %ebx

| Valid | VPN | PPN |
|-------|-----|-----|
| 0     | 2   | 6   |
| 0     | 7   | 23  |
| 0     | 2   | 5   |
| 0     | 3   | 2   |
| 0     | I   | 89  |

#### Memory accesses

## PERFORMANCE OF TLB?

Miss rate of TLB: #TLB misses / #TLB lookups

#TLB lookups? number of accesses to a = 2048

**#TLB** misses?

- = number of unique pages accessed
- = 2048 / (elements of 'a' per 4K page)

Would hit rate get better or worse with smaller pages?

Miss rate? = 2/2048 = 0.1%

Hit rate? (I – miss rate) = 99.9%

## **TLB PERFORMANCE**

How can system improve hit rate given fixed number of TLB entries?

Increase page size:

Fewer unique page translations needed to access same amount of memory

TLB Reach: Number of TLB entries \* Page Size

#### **WORKLOAD ACCESS PATTERNS**

#### Workload A

```
int sum = 0;
for (i=0; i<2048; i++) {
    sum += a[i];</pre>
```

Sequential array accesses almost always hit in TLB!

#### Workload B

```
int sum = 0;
srand(1234);
for (i=0; i<1000; i++) {
    sum += a[rand() % N];
}
srand(1234);
for (i=0; i<1000; i++) {
    sum += a[rand() % N];
}
```

## **WORKLOAD ACCESS PATTERNS**



### WORKLOAD LOCALITY

Spatial Locality: future access will be to nearby addresses Temporal Locality: future access will be repeats to the same data

What TLB characteristics are best for each type?

Spatial:

- Access same page repeatedly; need same vpn  $\rightarrow$  ppn translation
- Same TLB entry re-used

Temporal:

- Access same address near in future
- Same TLB entry re-used in near future
- How near in future? How many TLB entries are there?

## **OTHER TLB CHALLENGES**

How to replace TLB entries ? LRU ? Random ?

TLB on context switches ? HW or OS ?

## **NEXT STEPS**

Project 3 is out!

Next class: More TLBs and better pagetables!