#Exokernel: An OS architecture for Application Level Resource Management # 0. short summary: - the application knows all, hence fixed abstractions at the kernel level limit the app performance and flexibility - The question: how to maximize application freedom - Solution: put the abstraction (low-level interface) as close to hardware as possible, and give the application freedom to do resource management hence move the resource management our of the kernel. - How they do it? separate protection from management + export a very low-level interface accessing hardware resources to apps + multiplexing access for protection This whole paper is about why app level resource management is good and how they provides a protected low-level interface, and how they export information to the application, so that the app knows what is best to do # 1. Introduction ================= - Traditional OS: + hide info about machine resources through high level abstraction (process, file, ipc, address spaces) + hard for domain-specific optimization (e.g LRU is not good for DB) + discourage changes to the implementation of existing abstractions + restrict flexibility of apps builders, since new abstraction can only be added by awkward emulation on top of existing ones (one e.g, user-threads implementation on top of a process) - Solution: *application-level resource management* + move resource management to untrusted app level + securely multiplex hardware resource to upper libOS + libOS implements the policies, using interface given by exokernel - Why this helps? + increase application performance: end-to-end arguments, app knows all, apps have more control of how resource are uses ~ e.g. LRU in memory management reduce performance of certain db workload + lower-level primitive can be implemented efficiently ~ e.g secure bindings mostly requires table to track ownership) ~ highly reliable, easy maintenance because of small kernel + more flexible for implementors of high-level abstractions - How they do it? + goal: separating protection from management + exports hardware resources rather emulating (as like in virtual machine) Why? Because Virtual Machine has high overhead (time, space, etc...) ~ secure bindings: app form secure bindings to resource, track ownership for every access ~ visible revocation: why? because app needs to participate in revocation protocol in order to behave correctly and efficiently. Example: OS wants to revoke a page, app need to know that... ~ abort protocol: in case app fails to response the revocation timely, OS forcibly breaks the binding # 2. Motivation of Exokernels ============================= - Traditional OS has centralized resource management via set of fixed abstraction which can not be changed by untrusted software - Problem of fixed high-level abstraction (3 points) + hurt application performance: no single fixed abstraction good for all app ==> not one size fit all any more ~ e.g.: database and LRU ~ e.g.: Pilot, initial IO access is done by memory mapped file only this is not good for reading and writing extremely large file. + hidden information from OS: ~ hidden info: low-level exception (page fault), timer interrupt, access to raw I/O device, etc. ~ hence, difficult to implement app own Resource management ~ example, thread user library lack of system integration, doesn't know if there is a page fault (like in Scheduler activation paper) + limit functionality of apps: ~ because fixed abstractions are only interface apps can use ~ change to these abstraction occur rarely - End-to-end argument: + apps know better about goal of their resource management decicisons + hence, OS should give as much control as possible to apps + Exokernel: multiplex and export physical resources securely through low level primitive + LibOS implements high-level abstraction on top of low-level primitives that best fit apps' goals functionality and performance ~ e.g: page-table structure can vary among libOS: app can select a library with a particular implementation of a page table that is most suitable to its need. - Lib OS: + implement high level abstraction ~ simpler and more specialized than in-kernel implementation, because need not multiplex a resource among competing apps with widely different demands + not trusted by exokernel, but free to trust the apps ~ e.g: if an app pass wrong argument to libOS, only that app be affect + run at app address space ==> minimize the number of kernel crossings + can provide portability and compatibility - Exokernel can provide backward compatibility (i.e. when you want to run existing OSes rather than libOS on top of exokernel) + binary emulation of the OS and its program + implementing its hardware abstraction layer on top of an exokernel + re-implementing OS's abstractions on top of an exo-kernel # 3. Exokernel Design ====================== - Goal: give libOSs freedom in managing physical resource, but need to protect them from each other - Solution: separate protection from management through low-level interface + tracking ownership of resource (for secure bindings) + ensure protection by guarding resource usage + revoking access to resources - How: + secure binding: lib OS can securely bind to machine resources + visible revocation: allow libOS participate in revocation protocol + abort protocol: exokernel break secure bindings of libOS by force # 3.1 Design principles - securely export hardware resources, i.e, privileged instructions + Example: ~ TLB, physical mem, cpu, disk, memory ~ exception, interrupt, cross-domain calls. + avoid resource management: ~ only manage resources to extent required by protection ~ hence let apps manage resource, more flexible - expose allocation: + allow Lib OS to explicitly request physical resources + e.g: ~ libOS can ask for specific physical pages ~ traditionally, apps ask for a page, it is the OS to decide which page should be given, hence apps not aware of where that page physically locate + implication: ~ no resource implicitly allocated ~ seems to get rid of indirection + but now apps need to be aware of physical resources, how? expose names - expose names: + export physical names --> efficient because of no indirection + expose booking data structure: ~ e.g. free-lists, disk arm position, and cached TLB ~ apps can tailer their allocation request to available resources - expose revocation: + allow libOS to choose which instance of a specific resource to relinquish + libOS knows all, show it can deal with it efficiently *Policy*: + implemented mostly in libOS (management of resource) + but what if a libOS is malicious, and try to gain all resources from others + hence, exo-kernel enforce some policies to arbitrate among libOSes ~ e.g: allocation and revocation of resource # 3.2 Secure bindings: - decouple authorization from the actual use of resources i.g, check for authorization at bind times, not at usage time - for securely multiplexing resources - improve performance: why? + simple protection checking operation, hence quick + decoupling management from protection: e.g: understand resource semantics at bind time, and efficiently check for access (w/o knowing the semantic) at access time (how, use ownership tables) ==> fast - Need primitives to express protection check? implemented using 3 techniques: *hardware support* ------------------ - allows secure bindings to be couched as low-level protection operations such that later operations can be efficiently checked without recourse to high-level + capabilities for physical pages ~ file server buffer data in memory ~ file server allow apps to access data by giving capability to the pages ~ exokernel enforce capability checking without needing any authorization information from file server + frame buffer associates ownership tag with each pixel ~ apps can access frame buffer hardware directly ~ because hardware checks the ownwership tag when IO takes place + we see some similar thing with tag TLB, or TLB entry itself *software caching* ------------------ - cache bindings in exokernel (where hardware cannot accommodate). - e.g. software TLB can be viewed as cache of frequently-use secure bindings *downloading code into kernel* ------------------------------ - code is invoked on every resource access or event + to determine ownership and actions that kernel should perform + e.g packet filter: ~ interpreting the semantic of packet, hence now the destination apps. ~ otherwise, kernel has to poll every apps in the system - avoid expensive crossing: code can run even the app is not scheduled - but risk: what if code is malicious - solution: + type-safe languages + sandboxing Below are examples: *Multiplexing physical memory* - libOS request a physical page, exokernel creates a binding, recording ownership and read/write access capabilities - guard every access to physical page, check for valid capabilities - large software TLB can be used to improve performance - using capabilities, apps can grant access to other apps --> easy sharing Note: if underlying hardware defines a page-table interface, the implementation may different, but principles do not change: privileged machine operations such as TLB loads and DMA must be guarded by an exokernel. How to break secure binding, flush all TLB mappings *Multiplexing network* - hard, message interpretation depends on protocol - solution: download packet filters to kernel? is this code trusty :)? + filters: implementation of secure bindings for apps code + improve performance: ~ avoid kernel crossing ~ packet filter can run even if app is not scheduled, since the execution of downloaded code can be readily bounded. ~ kernel can use packet filters to demultiplex messages irrespective of what application is scheduled (otherwise, has to schedule each potential consumers) - problem: + code may not be trusty, hence need to: ~ bound runtime ~ runtime checks: to cope with wild memory references and unsafe operations + filter can "lie" and accepts packet destined to another process ~ need to trust - Application-specific safe handler (ASH): + participate in message processing + corresponding to a packet filter, run on packet reception + can initiate a message --> reduce round trip latency (by decouple sending from scheduling) # 3.3 Visible Resource Revocation - Invisible revocation (like in traditional OS) + fast because of no application involvement + but, libOS cannot guide deallocation, it has no idea about which resources are scare --> affect the performance of running app + good if revocation is very frequent (because it fast) - Visible revocation: + libOS participate in revocation protocol + good for app performance, because libOS guide the deallocation + good if revocation is not very frequent (since it requires involvement..) Example: exokernel revokes physical page "5", since this is visible, LibOS knows that, and it update any of its table entries refer to this page # 3.4 Abort protocol - exokernel revoke resource forcibly if revocation is not responded from LibOS - abort: + kill libOS? No, hard to detect the realtime-bound + instead, break all existing secure bindings to resource and inform libOS - use repossession vector to inform libOS --> libOS take any appropriate action - what about state-full resource, say a page is dirty? + libOS can provide exokernel name and capabilities for disk block that use for backing store - each LibOS has a small number of resources that will not be repossess Why? need to guarantee vital resources for libOS to store bootstrap info # 4. Status and Experimental Methodology ======================================== not that important # 5. Aegis: an Exokernel: detailed implementation, how they do it ======================== # 5.1.1 Processor Time Slices - CPU as a linear vector, each element correspond to a time slice - round-robin through the vector time slice - use timer interrupts to tell app of imminent context switch, so that app is responsible for context switch: save register, release lock, etc ==> more freedom for app over context switch - Position: use to trade off latency for throughput, libOS can allocate appropriate time slice + For example, along-running scientific application could allocate contiguous time slices in order to minimize the overhead of context switching, while an interactive application could allocate several equidistant time slices to maximize responsiveness. - Fairness: bound the time app use to save its context # 5.1.2 Processor Environment - structure stored information to deliver to app: + exception + interrupts + protected control transfer + addressing context: guaranteed mapping # 5.2 Base Costs Aegis primitive operation is fast (compared to Ultrix), why? - Aegis does not map it data structure, that is use physical address directly, (Now page table, ... hence avoid TLB exception) ==> remove a level of indirection, hence faster # 5.3 Exception - exokernel dispatches exception to application - application immediately resume execution after handling exception without the need to entering the kernel ==> this reduce # context switch but how you can do that? app needs do directly access to exception state --> all register that are saved must be at user-accessible memory locations - again, exokernel does not map its data structure --> hence fast # 5.4 Address Translation - provide guaranteed mappings: TLB miss will be handle by Aegis - apps' address space is divided into two segment: + first: for normal application data and code + second: for lib OS code and data structure - On a TLB miss: + If address is in first segment, dispatch to application. Otherwise, Aegis check if this entry is guaranteed mapping, if yes, Aegis update TLB (with software TLB, checking is done here too) + otherwise, Aegis dispatches the exception to libOS + libOS looks up it page-table, if not found -> seg fault + if found, libOS update TBL entries by calling Aegis primitive (why? because app is untrusted) + Aegis checks for valid capability, and go ahead update the TLB - Software TLB is used for optimization NOTE: this is the MIPS machine, so software update TLB? How this can be implemented in X86, where TLB is updated by hardware? # 5.5 Protected Control Transfers - substrate for implementation of IPC abstraction - what it do: + change the PC to that in the callee + donate current time slice to the callee + install required elements of the callee's processor context (seems similar to LRPC) - 2 types: + synchronous: donate current and future time slices + asynchronous: donate only the remainder of the current time slice # 5.6 Dynamic Packet Filter (DPF) - message demultiplexing: determine which application to deliver the message - DPF: dynamically generated and inserted into kernel - checked for safety (again: remember ASH is additional function, to initiate the message) NOTE: with DFP and ASH, decouple message demultiplexing/sending with scheduling of application # 5.7 Summary: EXOKERNEL primitives are fast, WHY? - keeping track of ownership is simple, hence, fast - kernel provides very small functionality of multiplexing --> small and lean - do not map its data structure --> get rid of indirection - use software TLB for optimization - DPF, dynamic code generation --> secure binding for network is efficient # 6. ExOS: A Library Operating System demonstrates that basic system abstractions can be implemented at application level in a direct and efficient manner. # 6.1 IPC Abstraction show pipe and lrpc in ExOS is extremely faster than that of Ultrix, why? Because Ultrix is built around a set of fixed high-level abstractions, new primitives can be added only by emulating them on top of existing ones. Specifically, implementations of lrpc must use pipes or signals to transfer control. The cost of such emulation is high. * pipe implementation uses a shared- memory circular buffer. Writes to full buffers and reads from empty ones cause the current time slice to be yielded by the current process to the reader or writer of the buffer, respectively. * # 6.2 Application-level Virtual Memory # 6.3 ASH Downloading code thus allows applications to decouple latency-critical operations such as message reply from process scheduling. ASHS are untrusted application-level message-handlers that are downloaded into the kernel, made safe by a combination of code inspection [18] and sandboxing, and executed upon message arrival. The issues in other contexts (e.g., disk I/0) are similar. See figure 2 for more: Since processes are scheduledin “round-robin” order, latency increases linearly. On Ultrix, the increase in latency was more erratic, ranging from .5 to 4.5 milliseconds with 10 active processes. While the exact rate that latency increases will vary depending on algorithm used to schedule processes, the implication is clear: decoupling actions such as message reception from scheduling of a process can dramatically improve performance. # 7. Extensibility with ExOS Show implementing high-level abstraction at libOS level can improve performance # 7.1 Extensible RPC - Most RPC systems do not trust the server to save and restore registers - tlrpc: that trusts the server to save and restore callee-saved registers. - hence faster, why? because server can directly save and restore register, without crossing into the kernel # 7.2 Extensible Page-table structures - ExOS supports inverted page table: - Applications that have a dense address space can use linear page tables, - while applications with a sparse address space can use inverted ones. - Hence, improve performance # 7.3 Extensible Schedulers - leverage yield primitive of exokernel (donates current time slice to another) - The ExOS implementation maintains a list of processes for which it is responsible, along with the proportional share they are to receive of its time slice(s). On every time slice wakeup, the scheduler calculates which process is to be scheduled and yields to it directly. Note: i bet there is a question of exokernel and lotery # Question: 1) What is exokernel pro and con? TODO: compare this to VM, to microkernel 2) Vs. VM - Virtual machine: virtualizing/EMULATING the whole machine + base machine can be complicated --> hence expensive and difficult + apps are hidden from *actual* resources --> hence app performance may hurt + Don't expose allocation/revocation ... 3) Vs. microkernel - both: for increase extensibility - but: exokernel + pushes kernel interface much closer to hardware, hence greater flexibility + avoid shared server, because it limit extensibility (e.g, difficult to change buffer management policy of file server) + allow apps to define virtual memory and IPC abstraction + support visible revocation, abort protocol ... In summary, exokernel give apps greater control of resources