#### ADVANCED TOPICS: VIRTUAL MACHINES

Shivaram Venkataraman

CS 537, Fall 2024

#### **ADMINISTRIVIA**

Project 5 happened? \_\_\_\_ | wk week

Project 6 – last project!

- Early deadline this week!
  Final deadline end of next week

Shivaram office hours

-TODAY at Ipm!

#### AGENDA / LEARNING OUTCOMES

How to virtualize a machine underneath the OS?

#### PERSISTENCE RECAP

- Managing I/O devices significant part of OS
- Disk Drives, SSDs (pages, blocks)
- File Systems: OS provided API to access disk
- Simple FS: FS layout with supberblock, bitmaps, inodes, datablocks
- Fast File System: Key idea put inode & data close together, namespace locality
- FSCK, Journaling Handling/Preventing data inconsistencies
- Log Structured File System Organize data based on writes

7 Project 6

#### VIRTUAL MACHINES



#### VIRTUAL MACHINE USE CASES

Share mainframe systems (1970s)

gos/gos personal computing tenants
unning different OS Computer

#### Cloud Computing

- Consolidate multiple tenants running different OS
- Strong Isolation

Ly kenonts do not interfere with each

Datacenter

Run applications that only exist for specific OS

Testing, Debugging

works on a diff OS.

#### **DEFINITIONS**

A virtual machine is a complete compute environment with its own isolated processing capabilities, memory, and communication channels.



#### VIRTUAL MACHINE MONITORS

Bare-metal Hypervisor (type-I) direct control of all resources

Hosted Hypervisor (type-2) operates as part of or on top of an existing host OS

KVM -> part of Linux



#### **GOALS**

• Equivalence – The exposed resource is equivalent with the underlying computer.

• Safety – Isolation requires that the virtual machines are isolated from each other as well as from the hypervisor.

 Performance – The virtual system must show at worst a minor decrease in speed.

#### CAN WE VIRTUALIZE? (POPEK GOLDBERG 1974)

The processor's system state, called the processor status word (PSW) consists of the tuple (M, B, L, PC):

the execution level  $M = \{s, u\}$  (superuser or user mode)
the segment register (B,L); (Segmented Memory Model) and Bo base

the current program counter (PC), a virtual address

A virtual machine monitor may be constructed if the set of sensitive instructions for a computer is a subset of the set of privileged instructions.

> {control-sensitive} ∪ {behavior-sensitive} ⊆ {privileged}.
>
> instructions which
>
> can be
>
> can change modified by
>
> state
>
> state con only be run in pernel mode

#### VIRTUALIZING THE CPU

#### Limited Direct Execution

How to handle privileged instructions (e.g., traps for system calls)?

enter emulate

the how this

VMM happens on

real hard



#### **BEFORE: SYSTEM CALL FLOW**



Transfer control to trap handler. Execute appropriate syscall routine

#### **NEW: SYSTEM CALL**



# USER MODE, KERNEL MODE?

MIPS architecture:

- Guest OS runs in "supervisor" mode
- No privileged instructions, some extra memory

Run Guest OS in user mode

How to protect Guest OS data structures?

Guest OS

→ Supervisor

→ Remel

guest OS Lata structure

## **QUIZ 20**

Log structured SSD consisting of 3 blocks and 10 pages per block. Each page holds a single character.

The state of each page (i, v, or E), the data stored at each page, and an indicator if a page is currently live (i.e. has a mapping in the FTL).

- read(page#) -- if page is live returns the character at the page, otherwise error
- write(page#,char) -- writes character to logical page #
- erase(page#) -- removes logical page # from the FTL mapping





If a write(0,'q') is now performed by the OS on the SSD state from the last question, what underlying SSD operations must be performed in order to accomplish this write?



#### VIRTUALIZING MEMORY

Challenge: Who manages physical memory allocation?

How do we share physical memory across Guest OSes?



Page tables Virtual >> Physical Extra level of indirection!

VPN 0 to PFN 10 VPN 2 to PFN 03 VPN 3 to PFN 08

PFN 03 to MFN 06 PFN 08 to MFN 10 PFN 10 to MFN 05 VMM

Virtual Address Space

"Physical Memory"

**Machine Memory** 



#### **BEFORE: SOFTWARE TLB HANDLER**





## **NEW: SOFTWARE TLB HANDLER**



TLB miss

2 traps

Trap into VMM Call OS Handler

OS walks pagetable
Get Virtual → Physical
Update TLB using
privileged instruction

Trap handler

Physical → Machine

Update TLB

#### TLB MISS OVERHEADS

Extra trap into VMM for Physical  $\rightarrow$  Machine mapping

Avoid using Software "TLB" in VMM to cache Virtual → Physical

Part of
Part o

VMM maintains Shadow page table per of Virtual → Machine

Trap when OS tries to update PTE (e.g., lcr3)

for every guest 0s — update PTE

for every guest 0s — update PTE

pt in guest 0s — update PTE

trap into

the VMM

create a shadow PT

# SO, CAN WE VIRTUALIZE X86? expose hardware / processor state

Table 2.2: List of sensitive, unprivileged x86 instructions

| Group                                                       | Instructions                                       |  |  |
|-------------------------------------------------------------|----------------------------------------------------|--|--|
| Access to interrupt flag                                    | pushf, popf, iret                                  |  |  |
| Visibility into segment descriptors lar, verr, verw, lsl    |                                                    |  |  |
| Segment manipulation instructions                           | pop <seg>, push <seg>, mov <seg></seg></seg></seg> |  |  |
| Read-only access to privileged state sgdt, sldt, sidt, smsw |                                                    |  |  |
| Interrupt and gate instructions                             | fcall, longjump, retfar, str, int <n></n>          |  |  |

## PARA VIRTUALIZATION, X86 EXTENSIONS

So far: No change to the guest OS. No changes to the hardware.

Downside: Overheads can be quite high?

Para virtualization

Can we make (small?) modifications to the guest OS for efficiency?

Hardware

Instruction set extensions (Intel, AMD)

×86 Virtualization friendly update huest read XEN early 2000s

Modify guest OS: simply undefine all of the 17 non-virtualizable instructions! Alternate interrupt architecture

|          |              | Memory Management                                                               |   |
|----------|--------------|---------------------------------------------------------------------------------|---|
|          | Segmentation | Cannot install fully privileged segment descriptors and cannot overlap with the |   |
|          |              | top end of the linear address space.                                            |   |
|          | Paging       | Guest OS has direct read access to hardware page tables, but updates are        |   |
| (        |              | batched and validated by the hypervisor. A domain may be allocated discontin-   |   |
|          |              | uous machine (aka host-physical) pages.                                         |   |
|          |              | CPU                                                                             |   |
|          | Protection   | Guest OS must run at a lower privilege level than Xen.                          |   |
|          | Exceptions   | Guest OS must register a descriptor table for exception handlers with Xen.      |   |
|          |              | Aside from page faults, the handler remains the same.                           |   |
|          | System calls | Guest OS may install a "fast" handler for system calls, allowing direct calls   |   |
|          |              | from an application into its guest OS and avoiding indirection through Xen on   | ) |
| <u> </u> |              | every call.                                                                     |   |

#### INTEL VT-X EXTENSIONS

True Hardware Support meeting Popek / Goldberg Criteria Do not change the semantics of individual instructions, instead duplicate the entire visible state and introduce a new mode of execution: the root mode.

- Hypervisor is in root mode, Guest OS in non-root mode.
- Special new instructions for detecting mode (only available in root mode, otherwise a trap is caused).
- New mode only used for virtualization
- Each mode has own address space CR3 > each mode has its

  Each mode has own interrupt flag

  own PTBR



Next class: Multi-CPU scheduling

Thanksgiving break!