**MEMORY RESOURCE MANAGEMENT in VMWARE ESX Server** =================================================== DIG MORE: - structure of page table entry - vs. sharing mechanism of DISCO + In DISCO, OSes need to be modified, say hooking into some system call like bcopy and stuff + in ESX Server: no need to modify OS - # 0. Take away *Couple of challenges in with virtual machine* - the semantics gaps (i.e VMM does not know which pages to flush out that is best for OS), hence chance for double paging What is double paging: 1) VMM take a page away from an guest OS, 2) The OS later on wants to swap out that page, but OS needs to page that page in in order to page it out. - how to exploit sharing to reduce memory overcommit, hence accommodate more OS on top of VMM (since different VM/OS may have different characteristic/ important) - how to make memory management efficient, because, indeed, VMM adds another level of indirection, hence slows thing down - how to not modify OS code *Solution in this paper* 1) ballooning ==> solve the semantic gaps (VMM just ask, OS picks page to evict) 2) content-sharing ==> share pages based on content (similar to DISCO, but no modified OS) 3) Shared-based allocation: that is OS gains allocations proportional to it share, but again, solve the semantics gap (idle/active memory), by + idle sampling + idle memory tax 4) Allocation policy 5) Memory remapping (???, why we need to do this) NOTE: we see a separation of mechanism vs. policy here 1, 2 (perhaps 5) are mechanism rest is policy # 1. The Basic, and Problem THE CRUX: how to virtualize memory in virtual machine works, in general? Well, apps see virtual memory OSes see "physical memory", maintain mapping from Virtual --> physical *VMM* maintains mapping from "physical" --> machine (PPN->MPN) There is an "shadow page table" mapping from virtual to machine, to speed up But: we want to run a lots of get OSes for server consolidation ==> hence machine memory gets *over-committed* What should VMM do? ==> paging: pick page, and *swap it out* But, we have *problem with paging* ==> How to know *which* page to page out? VMM doesn't generally know this (i.e what policy OS is using, what is least preferred) ==> Worse, if it does know, still can have problems of "double paging" E.g of double paging: VMM wants to reclaim a page from OS1, its pick one, page out, and give it to OS2. Then OS1 runs, decides to page, it it possible that OS1 picks the same page to page out, hence OS1 *touches that page* only to page it out There is a semantics gap here, and we don't want to modify the guest OS Then how do we solve the problem? # 2. Solution: Ballooning - Can't change OS, but *can* load driver to it, called a ballooning driver - When VMM server wants memory, the driver *inflate* + allocate pinned pages in memory > if lots of free memory, mem from free list is given to driver > if not, OS pages some out, gives to driver (this is coerce action, but the OS makes decision) + list of "physical" pages is passed to VMM + VMM uses those corresponding machine pages for another OS - OS should not touch physical pages allocated to balloon driver + why? because those corresponding machine pages may be given to other OS ==> conflict + what if it touches: The VMM already annotates the page entry, so there will be a trap if the OS tries to touch the page ==> crashes, reboot + this is OK, we can use h/w protection - VMM Server may deflate the balloon: it needs to zero the pages to avoid leaking information between VMs *Disadvantages of ballooning* - the balloon driver may be uninstalled, disable explicitly, unavailable while a guest OS is booting - It may not fast enough to satisfy current system demands - The upper bound of balloon sizes may be imposed by various guest OS limitation *When ballooning is not possible or insufficient* - choose a VM, choose pages to swap out, as traditional case - no guest OS involvement - do it randomly # 3. Sharing Memory - Goal: share pages across OS's when possible - *How*: + DISCO: modify OS, change IRIX to remap (not copy) when read same page from disk, just remap - ESX Server: use *contents* of page + don't have to modify OS + may find new opportunities for sharing (give example?) But *Question*: how to find pages with same content? - Option 1: compare all pages --> too costly (O(n*n)) - Option 2: use *scan hashing* The hash table: contains all Copy-On-Write pages (and "hint" pages) Scan: compute hash, check the table - if match hash: do full compare, if match again: use COW remapping - if no match: turn to COW page? + no, too costly, because every write result in a copy + instead, add as *hint* page > recompute hash upon later match, if real match, then marks as COW (why need to recompute hash? because the page might have changed since the last calculation) *How do you pick the pages to share* - Current implementation: randomly - additional optimization: attempt to share a page before paging it out to disk (because we can save IO if that page can be shared) # 4. Shared-based Allocation: - Question: when guest OS's compete for memory, who get the pages? - Solution: use ticket (i.e shares) + compute: shares/# pages (S/P) = p + revoke from OS that has smallest (fewest) p ==> intuition: when revoking, make it from guest OS who has > few share (S <<) > large number of allocated pages - Problem: *idle memory* + again, a semantic gap + i active but *rich* client (guest OS) could hoard memory while poor/active thrash - Solution: idle memory tax + idea: take pages from idle clients + assume f is fraction of active pages for client now p is recompute p = S / P' where P' = P * (f + (1-f) * 1 / (1- t)) k = 1 / (1-t) is the idle page cost (cost you have to pay for the amount of idle pages) where t is the tax rate > t = 0, i.e. no tax ==> similar to standard share based > t --> 1, more aggressive reclamation of idle memory default t = 3/4 - Problem: how to determine idle memory + could rely on OS stats (driver could query, pass down) ==> not practical... why? + diverse activity metrics tend to focus on per-process working sets + those metrics tend to focus on per-process working sets (not for overall VM) + guest OS monitoring typically relies on access bits associated with page table entries, which are by passed by DMA for device + Solution: sampling 1. pick n pages (say 100), invalidate mappings 2. when access, fault, remap, and increment counter t 3. sampling period, say 30 seconds ==> t/n: estimate of fraction of pages in use Question: 1) How to pick size of n, period of sampling? e.g if period sampling is to large, then probably that all pages get access 2) how to pick which n pages? Randomly... Why? if we pick 100 code page --> get execute all the times # 5. Allocation policy (How to allocate memory to guest OS) - parameters: each guest OS has 3 parameter: + min: guaranteed allocated memory (i.e. machine memory) + max: amount of "physical" memory + shares: - admission control: need sufficient resources before power on (min + overhead memory). What are those overheads? > additional data structure like pmap, shadow page table > VM graphics frame buffer - dynamic reallocation: + recompute memory allocations dynamically in response to event (e.g: parameter change system wide or per VM, VM boot, shutdown) + VMM maintain minimum amount of free memory, but have 4 *thresholds* > high 6% ==> all OK > soft 4% ==> use balloons to reclaim > hard 2% ==> forcible paging > low 1% ==> halt VMs above target allocs # 6 I/O page remapping - Some processors support high memory addressing - I/O involving high memory above 4GB boundary involves copying the data though a temporary bounce buffer in "low" memory, which is costly - Problem: page in guest OS (configured with less than 4GB of physical mem) can be mapped to machine pages that reside in high memory. IO to this page is costly because of extra copy operation - Solution: 1) track hot pages in high memory that are involved in repeated I/O operations 2) Remap those hot pages to low level memory Question: how to know which page is hot? Use a counter