* Source of increased TLB miss experienced by a workload in DISCO?
	- OS code is remapped --> TLB miss
	- MIPS support ASID, but now DISCO flush TLB on virtual CPU switch, because
	it doesn't virtualize ASID
	
**DISCO: Running Commodity Operating Systems on Scalable Multiprocessors**
==========================================================================

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$


# Goals: 
provides software systems that run on scaleable SMP machine with focus
ons scaleability

# Approach: VMM
	+ small code, less buggy
	+ scalable
	+ fault containment
	+ flexible: support wide variety of workload 
		> specialized guest OS, etc
		> share ...
	

# Challenges
- overhead: 
  + time overhead: since now, provide a level of indirection
 	~ additional exception processing, instruction execution such as:
    ~ privileged instruction cannot be directly exposed, but need emulation
  	~ access to IO devices must be intercepts and remapped by the monitor
  + space overhead 
    ~ additional memory to run multiple independent machines
    ~ code and data of each os and app are replicated across machine
    ~ large memory structure are also replicated, like file system buffer cache
- Resource management:
  + no there is a layer between OSes and the hardware, the is a loss of
    information or semantic gap between Apps and the monitor
  + lack of information available to monitor to make good policy
    ~ idle loop or lock busy-waiting: like some important calculation
      Implication: monitor may schedule resource for useless computation
    ~ zero paging, and its implication
    ~ monitor does not know when a page is no longer being active so that
      the monitor can reclaim the page
- Communication and sharing: e.g how apps in different OS can explicit share
  resource... without support from the monitor it is difficult
  (e.g, monitor can provide a monitor call for apps to explicitly set it up)

# Disco seeks to reduce those overheads
- Deal with time overhead: fast translation when TLB miss
	+ second-level software TLB for caching
	+ to ease TLB invalidation by backmaps pointer and memmap
- Deal with memory space overhead
	+ source:
		> file client / server 
		> kernel code, file system buffer cache
	+ solution:
		> copy on write disk by intercepting DMA
			- first VM load kernel code by DMA
			- Disco keep tracks of mapping <disk address, machine address>
			- second VM request to same disk sector, Disco found it in mem
			all it need to do is the change the mapping of virtual machine's
			physical address
		> share page between client and server

# What need to be change
- Why changes?
	+ for virtualization purpose
	+ for optimization purpose
- Changes for virtualization
	+ MIPS bypasses the TLB with unmapped kernel segment
	==> hence relink os code/data to mapped region (by changing the layout)
	+ Change to device drivers to match Disco's *monitor call* interface
	(why having monitor call? to reduce the complexity of emulating all IO
 	instruction)
- Changes for optimization
	+ rewrite frequently used privileged instruction accessing privileged 
	registers by non-trapping load/store instruction to special memory address
	==> reduce overhead of trapping to Disco
	+ solve the semantic gaps by adding some monitor calls to pass high-level
	resource utilization info to Disco, so that Disco can make better decisions
		> zeroed page request (page is cleared by Disco)
		> inform a Disco a page is free, hence Disco can reclaim this page
		> when run an idle loop, inform Disco, so that Disco can deschedule


$$$$$$$$$$$$$$$$$$$$$$$$$$

# *Problem*: 
extending modern OS to run efficiently in large-scale SMP machine,
hence obtaining the goal of *scalability*
- Alternative 1: modify existing OS
	+ costly (OS contains millions line of code)
	+ buggy (due to large modification)
	+ late delivery
	+ incompatibility
- Alternative 2: Virtual Machine Monitor
	+ small piece of code
	+ less effort to implement and make it right
	+ less bugs, and less incompatibility problems

# *Benefits* of VMM:
- support a wide variety of workload (specialized OS, etc)
- opportunity to employ global policies like:
	+ load balancing
	+ moving memory between VM (to avoid paging to disk)
- scaleability:
	+ VM is unit of scaleability
	+ only VMM and distributed system protocols need to scale to machine size
- fault-containment: failure of one VM does not affect others
- solve Non Uniform Memory Access problem by:
	+ careful page placement
	+ page migration
	+ page replication
- apps in different VMs can share memory by setting shared region using monitor
  call
# *Challenges*
- Overhead of hardware virtualization
	+ Time overheads:
		> emulating (instead of directly executing) privileged instruction 	
		(TLB invalidate, shadow page table ...)
	+ Memory overhead:
		> replication of file system buffer cache, code and data of OS and apps
		at each VM
- Resource management: Gaps between guest OS and VMM, hence difficult to make
good global policy.
	+ spin lock vs. busy loop
	+ double paging: VMM does not know which page of a guest OS no longer active
- Difficult Communication and Sharing (?)


# *Virtualizing CPU*
- use direct execution, 
	+ i.e set real machine registers to those of virtual CPU
	+ jump to current PC of that virtual CPU 
	hence, most instruction run at the same speed like they run on bare hw
	+ when switching/multiplexing between VM, need to do a machine switch
- for each virtual CPU, keep a data structure (much like process table entry)
	+ save registered
	+ TLB contents
	+ privileged register
- challenges: emulating privileged instruction like TLB modification
	+ OS runs in supervisor mode
		> can use additional memory to store its own data structure
		> but can not directly execute privileged instruction
		(trap to Disco for privileged instruction)
- scheduling
	+ virtual CPUs are time-shared across physical CPU
	+ affinity scheduling to increase data locality

# *Virtualizing Memory*
- add another level of abstraction, mapping from physical page to machine page
- hence, there are 2 level of mapping
	+ from virtual address (app) to physical address (VM)
	+ from physical address to real machine address
- hence, for each virtual CPU (i.e VM), Disco keeps a *pmap* structure
	+ each entry contain mapping <physical page, corrected TLB entry>
		where corrected TLB entry: <virtual page, machine page>
	+ backmaps to virtual addresses that are used to ease TLB invalidation
	(when a page is take away from VM by Disco)
- Problem specific to MIPS:
	+ MIPS bypassed TLB for unmapped segment
	==> need to relink OS code/data to mapped region
	(this incurs chance of TLB miss)
	+ MIPS use tagged TLB entry
		> on context switch (switch between processes of a VM), TLB is not
		 flushed
		> on machine switch, flush TLB
			- eliminate complexity to virtualizing address space ID (ASID)
			- but incurs chances for TLB miss
- Time overheads incur when
	+ TLB miss
		==> use caching: cache recently-used virtual-to-machine translation
		on a second-level software TLB
		
On TLB miss
1. Process: load from memory, TLB miss, trap to Disco
2. Disco: TLB miss handler
	- look at second-level software TLB
	- if found, put cached the mapping in the TLB, return to process
	- otherwise, call OS specific TLB handler
3. OS TLB handler: 
	- look for mapping VPN-PFN in page table, 
	- if found, get PFN and update TLB with privileged instruction, 
	which trap to Disco
4. Disco: trap handler
	- see somebody is trying to update TLB with VPN-PFN mapping
	- look into the pmap to find corresponding VPN-MFN mapping
	- update the TLB
	- jump back to the OS
5. OS: TLB handler:
	- return from trap
6. Disco: trap handler
	- do real return to process

	+ TLB invalidation
		==> backmaps

# NUMA Memory Management
- for performance only, because correctness is ensured by software
- page replication
	+ replicating read-shared page to nodes that access it most
	+ how?
		> down grade TLB (mark it read only)
		> copy to local node
		> update TLB entry 
- page migration
	+ migrate a page to node that updates it most
	+ how?
		> invalidate TLB mapping in source node 
		(or every pmap entry that currently pointing the this machine page)
		(using memmap to find pmap entries, from pmap, using backmap
		to find virtual address that map to this TLB entry, using to
		invalidate mapping)
		> copy page to local machine
		(> change TLB mapping, and physical-to-machine mapping)
		Do i really need to update the TLB again? Not necessary, 
		i think it will be update on later usage
		Why? because reestablish the TLB entries is costly, you need to
		update the pmap structure, and the Second Level TLB, and the hw TLB
- memmap: used to ease invalidation

# Virtualizing IO
- copy on write disk
	+ for sharing
	+ identify share opportunity by intercepting DMA
- virtual network interface
	+ global disk cache
- 

# 0. Short summary
===============
- pmap: the shadow page table for fast translation of virtual-machine address
- smart memory allocation to reduce cache miss latency:
	+ dynamic page migration
	+ page replication
- main goal: scalability
	+ but the solution is VMM, 
	+ trying to reduce the effort to modification

# 1. Introduction
=================
- problem: existing OS does not scale with SMP multiprocessor, why?
	+ need extensive modifications
	+ large size and complexity, making modification hard
- solution: virtual machine monitor
	+ allow guest OS to efficiently cooperate and share resources
	+ prototype on NUMA architecture
- feature: a lot of features to eliminate the problems with old VMM
	+ minimize overhead of virtual machines
 	+ enhance resource sharing between virtual machines (sharing disk and mem)
	+ global buffer cache: transparently shared among virtual machines
	
# 2. Problem Description
========================
- need to provide software for SMP NUMA
- current OS does not scale, need extensive modification

# 3. A return to Virtual machine monitor
- VMM: 
	+ do not need significant changes to existing OS
	+ virtualize all resources of the machine, exporting conventional 
	hardware interface to the OS
	+ use global policies managing all resources of the machines
		> virtual processors are scheduled on physical processors for load
		balancing
		> move memory between virtual machines
- VMM can be reliable, require small implementation effort
	+ because VMM is small piece of code
- Satisfy a variety of workload: by specialized OS
- *scalability*
- fault-containment: Virtual machine is unit of fault-containment
- solve NUMA memory management issue:
	+ careful page placement
	+ page migration
	+ page replication
- Challenges:
	+ Overheads: 
		> additional exception processing
		> instruction execution
		> more memory for virtualizing hardware
		> replication of memory of OS and App
	+ resource management:
		> lack of information for monitor to make good policy decision:
		e.g: idle loop = busy wait for lock (from monitor view)
		e.g: an unused page by a VM is invisible for the monitor
	+ difficult communication and sharing

# 4. Disco: A virtual machine monitor
NOTE: this is the most important part
- for NUMA architecture: 
	+ each node has a processor, memory and io devices
	+ smp view for software (but NUMA)

# 4.1 Disco's Interface: that is abstraction exported to upper level
- Processors:
	+ abstraction of MIPS R10000 processor
	+ does not support complete virtualization of kernel address space
	(???? What is this?)
	+ emulate all instruction, mem unit, disable/enable interupt
- Physical Memory:
	+ abstraction of main memory in a contiguous physical address space
	+ deal with NUMA nature:
		> dynamic page migration
		> replication
		==> making likely to be the uniform access
- IO Devices:
	+ VM has its own IO devices: disk, network interface, timer, clock...
	+ VM has exclusive access to IO devices
	==> DISCO emulate and intercept all communication from/to I/O devices
	+ Virtual disk
		> can be accessed by any VM
		> private or shared
		> persistent or non-persistent
	+ HOW ABOUT NETWORK???

# 4.2 Implementation of Disco
- used multi-threaded shared memory program
- pay attention to:
	+ NUMA memory placement
	+ cache-aware data structure
	+ interprocessor communication patterns
- improve NUMA locality:
	+ replicate code to node local memory nodes to satisfy a instruction miss
	+ partition machine-wide data to increase locality
- few locks for shared data structure --> improve performance
- communicate mostly by shared memory

# 4.2.1 Virtual CPUs
- for most instruction, use direct execution
	+ i.e set real machine's registers to those of virtual CPU
- emulate privileged instruction: 
	+ TLB update
	+ direct access to I/O devices, physical memory
- for each virtual CPUs, keeps:
	+ registers and other state
	+ TLB contents
	+ privileged registers
- disco run in kernel mode, OS run in supervisor mode, and app run in user mode
	+ when need to execute privileged instruction, os traps to disco
- scheduling: 
	+ time-shared across physical processors
	+ cooperate with memory management for affinity

# 4.2.2 Virtual Physical Memory
- 2 level of mappings:
	+ virtual address to physical address (from virtual machine's mind)
	+ physical address to machine address (this is the *real* physical address)
- pmap: a shadow page table for each VM, this is a mapping from physical 
address to machine address
	+ a TLB miss: OS try to update the TLB entry by trapping to disco
	+ disco translate physical address to machine address (using pmap),
	and update TLB entry
	+ pmap entry contains a real mapping + a back pointer to virtual address
	==> to ease TLB invalidate, when disco takes a page way from virtual machine
- on MIPS, OS can reside on unmapped memory region
==> disco has to relink the OS code/data to mapped region 
==> But, this cause the problem of increasing TLB miss
Solution: second-level software TLB
- there is a tag TLB with ASID (address space ID): VERIFY THIS???
	+ on a context switch within VM no need to flush TLB (ASID unchanged)
	+ on a VM context switch, need to flush TLB

# 4.2.3 NUMA Memory Management
- cache coherence in hardware
- leverage the cache miss counting facility of FLASH
	+ Once FLASH detect a hot page (i.e. a lot of cache miss?, access a lot?)
	  Disco chooses between migrating or replicating the hot page based on
	  the cache miss counter
- reduce the cache miss latency in NUMA using
	+ page migration: migrate write a page to a node that write it a lot
	+ page replication: copy the page to nodes that read it (read-shared)
- how to migrate a page: transparently change the physical-to-machine mapping
	+ invalidate any TLB entries mapping the old machine page
	+ copy data to local machine page
	+ update TLB
- how to replication:
	+ downgrades TLB entries to read only
	+ copy data to the local machine page
	+ update TLB
- memmap: mapping each machine page to list of VM used to access them
==> ease the TLB shootdowns process in migration and replication

- PAGE FAULT: what happens???

# 4.2.4 Virtual I/O Devices
- intercept all access to IO devices from virtual machines and eventually 
  forward them to the physical devices
- However, DISCO add special device drivers to operating systems. 
  DISCO device makes defines *a monitor call* use by the device drivers to pass    
  all argument in a single trap 
  (i.e device drivers at OS need to be modified to use this monitor call 
  interface)                      

Why do this? 
- Reduce overhead: To avoid multiple trap to Disco, which is very costly
- Reduce complexity of interaction with specific devices
- Hence, the complexity is left to virtual machine monitor (DISCO)
- Disco uses IRIX's original device drivers.

"Disco replaces old device driver (with many privileged instructions) with new 
version that calls directly to Disco" (Andrea)

- DISCO intercept DMA, because need to translate physical address to machine 
  address. 
- For devices access by a single machine, DISCO only needs to guarantee 
  exclusivity, and provide translation from physical address of DMA to machine
  address ==> In this case, DISCO does NOT virtualize IO resource
  (but only provide translation of address) 
- Since DMA is intercept, chance for sharing disk and memory resources

*Copy-on-write* DISK
- because disco intercept DMA, chance for sharing
	+ share kernel code, global buffer cache, etc
- use copy-on-write
	+ when write, log the modified sectors to special private partition
	  so that the copy-on-write disk never get modified
	+ only apply to non-persistent disk, and try to kept modified sectors
	  in memory as much as possible
	+ hence, at the end of the session, all changes are discarded
Why do this? To provide isolation for virtual machines when they issue write
- for persistent disk
	+ only single virtual machine mount to at a give time
	+ hence, changes go directly to that disk as normal
	As a result, DISCO does NOT need to virtualize the layout of disk.

*Virtual Network Interface*
- VM communicate via virtual subnet
- try to avoid replication   
- properly aligned message fragment that span a complete page are always
  remapped, rather than copy (make sharing possible)
- leverage copy-on-write:
	+ NFS server and Client can share a page
	+ create a global disk cache
	+ however, this can suffer from high cache miss (because of sharing)
	==> replication can solve this problem

	
# 4.3 Running Commodity Operating Systems
- need to change OS a bit, because on MIPS, support direct access to memory
- now, need to change it to unmapped region 

this section is really important, because it solve both performance and resource 
management (semantic gaps) problems                

- Why changes?
	+ for virtualization purpose
	+ for optimization purpose
- Changes for virtualization
	+ MIPS bypasses the TLB with unmapped kernel segment
	==> hence relink os code/data to mapped region (by changing the layout)
	+ Change to device drivers to match Disco's *monitor call* interface
	(why having monitor call? to reduce the complexity of emulating all IO
 	instruction)
	Why do this? 
	- Reduce overhead: To avoid multiple trap to Disco, which is very costly
	- Reduce complexity of interaction with specific devices
	- Hence, the complexity is left to virtual machine monitor (DISCO)
	- Disco uses IRIX's original device drivers.
- Changes to the Hardware Abstraction Level for optimization
	+ rewrite frequently used privileged instruction accessing privileged 
	registers by non-trapping load/store instruction to special memory address
	  ~ reduce overhead of trapping to Disco
	  ~ what are these instructions: synchronization routines, trap handlers...
	+ solve the semantic gaps by adding some monitor calls to pass high-level
	resource utilization info to Disco, so that Disco can make better decisions
		> zeroed page request (page is cleared by Disco)
		> inform a Disco a page is free, hence Disco can reclaim this page
		> when run an idle loop, inform Disco, so that Disco can deschedule
		  But DISCO detect a CPU is idle by checking it power consumption mode
- Other changes:
  + change to bcopy and mbuf to leverage the remap-based page sharing and 
    copy-on-write


# 6. Related works
- Disco is different with others VMM:
	+ support NUMA: page migration, replication
	+ COW disk, global buffer cache


DISCUSSION:
- if you have 2 VM accessing two different disks
no way to detect if they are the shared, reading same contents
VMWare solution (i.e content-based sharing)
- I need to reread this pager, pay attention to the experiment part,
probably read the ACM transaction version
- read Remzi notes about VMMonitors