MYNOTE's, 2nd REVISION ====================== * What are the real problem that this paper trying to solve? - portability of a memory management system - can run in any *paged* uniprocessor and multiprocessor systems - regardless of type of hardware support: + per-task page table + inverted page table + hybrid: segment & page table and so forth * What are the major features that Mach support? - large, sparse virtual address space (we will see how this can be obtained) - copy-on-write virtual copy operations + mechanism shadow memory object + problem: object chains. Solution? - copy-on-write and read-write memory sharing between task + shadow object (for COW) + sharing maps (for read-write sharing) - memory-mapped files + pager is file system + indeed, a memory object in Mach resembles a memory-mapped file in Unix - user-provided backing store objects and pagers + why useful: user-app knows all, end-to-end + user apps knows best about page replacement policy ~ e.g. database: scan --> MRU best * Mach provide a machine-independent virtual memory management system. What are the challenges? Assumption? What is the high-level idea of Mach to tackle this problem? - The big challenges is to build a portable memory management system, regardless of underlying hardware support. + facing the performance penalty of *the curse of generality* - High-level idea: + make least assumption about underlying hardware support as much as possible + separate software management from hardware support ~ split system data structure into two part: machine-INDEPENDENT data structures machine-Dependent data structures: small, easy to change - So what assumption: + paged architecture + able to handle and recover from page fault * What does the address space of Map task looks like? - Well since Mach make little assumption about hardware memory support an address spaces in Mach consist of an order collections of mappings to memory objects. - A typical task address space consist of five mappings: + code memory object + heap + stack + initialized data + uninitialized data - You can think of these mappings like the descriptor segment in a Multics process, and memory objects is similar to a Multics segment, but memory objects is more general. - This types of mapping enable a sparse, large virtual address space * List and briefly describe data structures in Mach memory management system? - Resident page table: + One table system-wide + keep track of information about machine independent pages + Each entry may be linked to several list at the same time ~ a memory object list: > telling which memory object that page belongs to > to speed up deallocation, virtual copy Note: a memory object may additionally contain page in backing store ~ a memory allocation queue: ~ a object/offset hash bucket > for fast lookup of a physical page associated with an memory object/offset NOTE: Byte offset in memory object are used, to avoid dependency in hardware support - Address Map: + per task + represent a task address space + contains a sorted linked list of address map entries, each maps to a contiguous area of a memory object NOTE: sharing is easy, to share the same memory object, two processes can has it map entry pointing to that memory object. + does not need to worry about backing store + each entry contains protection, inheritance attributes - Memory Object: + collection of data provided and managed by a server (could be user task, or kernel task) which can be mapped into the address space of a task + Logically, memory object is just a *repository* of data, index by byte ~ resembles a Unix files (just a byte interface object) + Physically, data of memory object can resides both in memory and in backing store ~ kernel manage in-memory pages ~ pagers manage page-in, page-out of page to backing store ~ kernel and pagers communicate to well define interface + memory object data structure: ~ a unique paging_name ~ 2 port: for kernel and pager communication ~ list pages currently in primary memory - pmaps = physical address maps + (in VAX, it is a page table, in x86: set of allocated segments register) + contains machine dependent code (page level operations) + implement the machine-independent/machine-dependent interface + not need to keep full accounting (like all currently valid mapping) ~ info can be constructed at fault time in the machine-independent code ~ optimization can be made > delay invalid operation + kernel mapping must be always complete and accurate * Why saying that the primary memory in Mach is treated as cache for the contents of virtual memory objects? - The way a process see memory is through the address map, is just a set of mapping entry into a contiguous range of memory object - To a process, memory objects resemble a Unix file, or a *repository* of data, with read/write operations - In term of implementation, memory object's data can be in main memory or in backing store. This is resemble an Unix file, where part of the file content may be in the buffer cache in memory. * What is the benefit of separate pager for each memory object? - reduce overhead of kernel - more important, apps know best, and have best strategy for page replacement, therefore can improve performance. - Example: ... * Describe where garbage collection is done in Mach? - First, for each memory object, there is a refcount, representing number of address map entries currently reference to the object. If RefCount == 0, kernel will garbage collect the memory object. - Second, repeated copy-on-write can creates a lot of intermediate shadow objects. Kernel will garbage collect these objects when it finds that intermediate shadow is no longer need. * Mach on Multiprocessor. One of the challenges is handle TLB inconsistency. Why the problem of arise and how Mach handle it? - Why the problem? Short: lack of hardware support + None of the multiprocessors running Mach support TLB consistency. + In order to guarantee such consistency when changing virtual mappings, the kernel must determine which processors have an old mapping in a TLB and cause it to be flushed. + Unfortunately, it is impossible to reference or modify a TLB on a remote CPU on any of the multiprocessors which run Mach. - Solution: employed in different setttings + simply allow temporary inconsistency + forcibly interrupt all CPUs so that their TLB may be flushed ~ use when change is critical and must propagate at all cost + postponed use of changed mapping until all CPUs have taken a timer interrupts (and had a chance to flush the TLB) * What are the weakness of Mach? - Copy-on-write through shadow memory object is complex - sometimes it is hard to detect and garbage collect intermediate shadow that is no longer needed * Mach tries to make memory management portable, i.e. can run in multiple hw with little hard support assumption. You learned from exokernel that in order to improve performance, OS should exposes hardware to applications. Otherwise, OS will suffer from the curse of generality. But it seems that Mach is trying to provide a general solution which will works for all type of hardware support. Why (at least shown in the paper), performance overhead is not a problem in Mach. - Mach supports a general VM interface, regardless of underlying hardware support. Mach is also a microkernel, where almost operation is done by message passing. - But memory manage in Mach is still efficient because: + virtual memory management is integrated with message-oriented communication facility. This integration allows a large amount of data including whole files and even whole address space to be sent in a single message with the efficiency of simple memory remapping + But more important, each memory object in Mach manage by a pager, therefore, apps is flexible to choose the best page-in, page-out policy to meet their need. * Give a simple example about share-mapping in Mach? - 2 tasks want to access memory-object in a read/write sharing mode - i.e. the tasks can read or write to that object without triggering a copy on write (for example, too task reading the same code, and a memory mapped file) - Let call this memory object is M. - Now in each task address map, there will be an entry mapped to a *sharing map SM*. - So the address map of task 1 look like this { ... entryX --> SM ... } - The address map of task 2 looks like this { ... entryY --> SM ...} - And the content of sharing map SM is: SM --> M &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& **Machine-independent Virtual Memory management** **for paged Uniprocessor and Multiprocessor Arch** ================================================== # 0. Take away: *Main contribution: - portability: + separation of machine independent and dependent memory management code + fixed, simple hardware interface (16 requires pmap routine) - integration: memory management and communication (message passing) + make messing passing efficient *Minor contribution: - capability: + advanced memory management in software - customization: + user-defined memory handler *Strength: - Porting requires altering only a single module, and has been successfully accomplished by a novice C programmer - Sharing facilities improve parameter passing possibilities - External memory objects with paging services support distributed systems as well as arbitrary backing storage - Message passing is efficient because of integration *Weakness: - Shadow chains can be lengthy and redundant. Kernel garbage collection may be incomplete and requires "complex" locking for multiprocessors (suggesting overhead and possible contention). - Paging services at user level require context switching away from kernel fault handler and then back again - TLB consistency issues # 1. Mach OS Basic: Five abstractions - task: resource allocation unit + think about UNIX process + paged virtual address space - thread: CPU utilization unit + each has independent PC within task - port: protected queue for messages + like reference objects + operation on object is done by sending message to port - message: + typed collection of data object use in communication + any size (you will see the integration with memory management) - memory object: managed data collection + can be mapped into task address space *Problem*: message passing is generally slow --> how Mach make it fast? Solution: integration of message passing with virtual memory management: + whole address space can be sent in a single message (with the efficiency of simple memory remapping) # 2. Virtual memory feature: - each task has address ranges mapping to memory objects - support standard operation: + allocate/deallocate + set protection + inheritance (when creating child) ... - Copy-on-write sharing between (unrelated) tasks - Read/write sharing with child task (through inheritance attribute) + shared: read and write + copy: copy by value + none: page is not passed to child - protection is per page - "Pager" tasks associated with memory objects control faults and page-outs # 3. Data structure - separation of machine-independent and machine-dependent structures - memory paging: + physical page != hardware page > Physical size a boot time parameter > Must be power of 2 multiple of hardware size + Physical pages != memory object pages > Object handlers may use their own policies (???) + Physical pages = machine independent pages (virtual pages) - Four basic data structures: 1. resident page table: + indexed on physical page number (i.e machine independent pages) 2. address maps + virtual ranges to object regions 3. memory objects + backing storage, managed by kernel or user 4. pmap + *machine dependent mapping data structure* Note: As we can see, the implementation is split between machine *independent* and machine *dependent* sections. - Dependent code: implements only operation necessary to create, update and manage hw required mapping data structure - Independent code: implement VM operation, maintain VM information # 3.1. Resident page table - each entry keeps information about physical (machine-independent) pages + modified and reference bits + pointers (to link it to several list, see below) + etc - each entry can be linked into 3 list + object list: link entries for each memory object (to speed-up object deallocation and virtual copy operations) + allocation queue: for free, reclaimable and allocated page (used by Mach paging daemon) + Hash buckets based on object & offset > maintain mapping > offset is used to avoid linking implementation i.e to avoid awareness of hardware page size. physical pages (boot time configurable) are independent of hw page size Note: this is NOT like a "normal" page table (which contains VPN-->PFN mapping) This is more like a status table which keep information about each physical page (in some sense, it is like an inverted page table) # 3.2. Address maps (per task) - Doubly linked list of entries sorted by virtual address - Each address map entry: + map virtual address range to byte offsets within a memory object + contains inheritance and protection attributes - Last fault "hint" pointers may be kept + for fast lookup on faults (These hints allow the address map list to be searched from the last entry found for a fault of a particular type.) - like a normal page table for each task (i.e process) + each entry is like a mapping provided by a segmentation ==> hence, size of a address map is small For example, a particular entry looks like this 0 - 15: Code memory object 16 - 31: Heap mem obj 32 - 47: Stack mem obj Note: this is a level of indirection, because, we don't know the detail of hardware page size, we just map a range of virtual address to memory object, which again is an abstraction of physical memory ... Hence, address maps does not need to keep track of backing store (page in, page out, etc). This task is done by implementation of memory object # 3.3. Memory Object - Reference counters allow garbage collection + Cache maintained for rapid reuse + Pagers may request caching - Kernel manages pages in primary memory - Pager handles store and fetch of pages NOT in primary memory - Standard ports and fixed interface govern pager--kernel communication *Question*: is this (having separate pagers handling page fault, that may be outside of the kernel) a good idea? 2) I assume there will be a mapping to keep track which pages belong to which memory object (Isn't this each memory object is now like a read page table?) - ok, found, kernel maintain this mapping # 3.4. Sharing - Copy-on-write: "shadow object" created + If no write, shadow object only contains link to source (i.e original) + On write, contains update (those modified) and link (to those unmodified) + Chain develops with repeated copies + Kernel collects unneeded intermediates ==> this is not appropriate for read/write sharing (Why?) - Read/write: indirection via “sharing map” + Equivalent to address map entry + Pointed to by address maps of sharing tasks (hence address map entry can point to either sharing map or memory object) *Question*???: Sankar, i don't understand this much (the sharing map) Note: - Copy-on-write lead to shadowed object ==> there can chain, some garbage collector can optimize, detect unnecessary shadow object - Read/write sharing leads to sharing map (see Leo's picture) # 3.6: Machine Dependance: the pmaps - pmaps = physical address maps + (in VAX, it is a page table, in x86: set of allocated segments register) + contains machine dependent code (page level operations) + implement the machine-independent/machine-dependent interface - not need to keep full accounting (like all currently valid mapping) + info can be constructed at fault time in the machine-independent code + optimization can be made > delay invalid operation - kernel mapping must be always complete and accurate # 4. Porting VM Mach - not important # 5. Deal with uniprocessor/multiprocessors This section describes how Mach VM deal with different hardware support for memory (to illustrate Mach's PORTABILITY) # 5.1. Uniprocessor - VAX: support large page table per task Mach: keep page tables in memory, only construct those parts needed for currently use - RT: not use per-task page table, but instead inverted page table (mapping physical page --> virtual page) + drawback: difficult to share pages, physical pages shared by multiple tasks can cause extra page faults, with each page being mapped and remapped for the last task which referenced it ==> Mach: treat inverted page table like cache for TLB - Sun: combination of segmentation and page tables (i.e page table per segment) # 5.2. Multiprocessor - TLB cache consistency, deal by Mach different setting + simply allow temporary inconsistency + forcibly interrupt all CPUs so that their TLB may be flushed + postponed use of changed mapping until all CPUs have taken a timer interrupts