* Key points: - fast communication: fixed size message (kernel support), request-response model... - naming: + each object manager implements name for its set of object + name is self-descriptive. given name, client no who to contact + stale detection: on use Notes about previous v-kernel paper in the list: http://pages.cs.wisc.edu/~sschang/OS-Qual/distOS/Vkernel.htm **SUMMARY** =========== How about a node crash and restart? Security in V Kernel? RPC, NFS, AFS? *The minimalist philosophy and goal* ------------------------------------ - network-transparent abstraction of address space (seems like: process A not aware of process B running in different machine) - lightweight process, and IPC - Goal: high performance (Nucleus: flexible, extensible OSes) + fast exchange of significant amount of data over network: e.g. file access + protocols is the crux, hence how to design protocol that provide HP FAST IPC -------- Achieved by: (similar to RPC) - simple and basic ipc primitives - transport protocol designed to support these primitives - optimize for common cases Example 1: Client actions of sending a request and receiving response are combined into single Send primitive - one kernels operation for the common case of remote procedure call - reduce rescheduling overhead and simplify buffering. Why? because request data can be left in the client's buffer, and response data can be delivered directly into this buffer... similar to Nucleus - simplify transport level protocol: no need to explicit ack + reply is ack for request, and authorize for new request + no explicit connection set up or teardown + response is ack for request + communication state is updated on request --> opportunity of state caching + fixed size message --> easy to process Example 2: 2 type of message: short 32 bytes message and data segment message - optimize handling of short message in kernel, because this is the common case + reduce overhead of preparing message with kernel support (see Figure 5) + 32-bit message (apps) --> processor registers ---> process descriptor (kn) --> network output queue Interesting, Faster IPC now has an implication: "with the level of performance of interprocess communication we have achieved the system performance appears more de- pendent on other factors, such as the effectiveness of local and remote file caching. For example, with only a 15.1 millisecond difference between accessing a 8 kilo- byte block locally versus remotely, it is *faster to access a copy of the block at a remote server that has the data in its RAM cache than to read it from a local disk...*" Why synchronous IPC? -------------------- Pros - Synchronous request-response model makes programming easy due to its similarity to procedure call - Distinction between small message (send, receive, reply, ...) and a separate data transfer (moveto, movefrom, ...) is good - Synchronous communication (stop-and-wait), small fixed message: make buffering easy -> leads small kernel - Direct copy between user spaces: no extra copies between user & kernel space Cons - Message is even smaller than the min packet of Ethernet -> leads padding -> inefficient use of bandwidth - Stop-and-wait: reduce parallelism - Separate data transfer command -> increase the number of operations going: send(msg) -> moveto -> reply + Reason why ReceiveWithSegment & ReplyWithSegment are introduced Process groups and multicast communication (Figure 6) ------------------------------------------ Naming ------ an interesting topic UIO interface ------------- Services built based on Naming protocol and UIO interface --------------------------------------------------------- **THE V DISTRIBUTED SYSTEM** ============================ # 0. Take way -------------- - an OS for cluster of workstation + performance of IPC is key issue + protocol and interface (not software modules) define the system + distributed kernel provides based for distributed system - IPC - naming protocol - UOI interface - what else??? # 1. Introduction ----------------- - V distributed system: + designed for cluster of workstations connected by high-speed network + include: > small "distributed" kernel > service modules > libraries - Why V kernel? How to design an OS for a cluster of workstation? + personal computer approach: > fragments hardware and software based ??? > waste hardware resource > difficult to manage + mainframe approach: (is this time share approach?) > less extensible, less reliable, less cost effective - V kernel design philosophy + high speed communication is the most critical facility > high performance > simpler, no need for sophisticated technique to improve performance + protocol is the crux, not software > challenge: design protocol that is fast and general, provides performance, reliability, security requirement for the system + small kernel like a software backplane, i.e it implements basic protocol, provide network transparency, on which upper level service/module can run --> we see a lot of this design, don't we - Challenge in distributed system + shared state: how to maintain consistency with low cost + group support: sharing, communication - retains conventional programming model, the underlying implementation is different to support transparency + e.g: get byte: first go to local buffer, if not there --> IPC call # 2. THE V KERNEL AS A SOFTWARE BACKPLANE ----------------------------------------- - V kernel provides basic network-transparent abstraction: + address space + lightweight process (why lightweight?) + interprocess communication ==> this is not new (like NUCLEUS), but the crux is: *how to make it fast?* # 2.1 Interprocess communication - make it fast by: + simple and basic primitives --> easy to implement efficiently + transport protocol supporting those primitives + make common case fast (e.g fix-sized message) + efficiently structure the kernel - simple primitives + send request and receive response are combined in Send primitive ==> reduce scheduling overhead and simplify the buffering + fixed-sized messages, i.e 32-bytes ==> easy to handle ==> works if we have a lot of short messages - optimized transport protocol: + no explicit connection set up or teardown + response is ack for request + communication state is updated on request --> opportunity of state caching + fixed size message --> easy to process - structuring the kernel to minimize communication cost + process descriptor contains VMTP template header, initialized when creating process --> speed up preparation for Send, no need to allocate... + fixed size message (32 bytes) from app --> registers --> process descriptor --> queued for network transmission ==> no need to queue message at kernel + local ipc is much faster because there is no network transmission # 2.2 Process Groups and Multicast Communication - process group: group of processes identified by a group identifier - a group can have any number of members, in any number of hosts - a process can belong to multiple group - group operation: + send to a group of processes + receive multiple responses + well know GID can be used - example usage: + map character string names in naming protocol + send clock synchronization information + distribute load information as part of distributed scheduling + atomic transaction - NOTE: more multicast, more complex, and less performance ==> still used because they are useful in some cases - message can have *qualifier* to restrict the recipients # 2.3 Kernel Servers - kernel modules: time, process, memory, communication, device management + replicated in each host + access using standard IPC interface - benefit: + operation on local objects are fast (make common case fast) + simply implementation since each instance only manage local objects + clients access kernel servers the same as other servers, because of common IPC ~ allow run-time libs supporting high level protocol ~ minimize kernel mechanism for accessing remote kernel servers + avoid additional kernel traps + separation of IPC from kernel service ~ IPC performance is independent, hence can be tune independently + extensible: can add more kernel server easily - There should be some mechanism to make a module known to whole system: + on module creation, register with IPC system # 2.3.1 Time - get, set time - delay, wake up - synchronization of time is implemented outside the kernel # 2.3.2 Process management - create, destroy, query, modify, and *migrate* processes - How to make it fast: 1. separate process initiation from address space creation/initialization ==> creation of process = allocating and initializing process descriptor (memory management takes care of address space creation) 2. simplify process termination + kernel maintain few resources + process level server takes care most of the work (check for unused resources) + kernel does not inform servers when a process terminate (serves need to figure it out on there own: checking, garbage collector) ==> move the complexity/work to the upper level server 3. simplify scheduling: + only implement priority based at kernel level + higher-level policy is implemented outside the kernel + scheduler in charge if its local ready queue + migration of process is done to ensure load balancing (i.e higher process is guaranteed to executed first) 4. simplify exception handling: + exception is not handle in kernel but in exception server - how to migrate: + query process info in one processor + start a new process with the same info in other + freeze: simply reduce the process's priority - V kernel support thread (lightweight process) --> but it seems like they have trouble with mapping user-thread/kernel-thread ==> reduce parallelism (READMORE about this part???) # 2.3.3. Memory Management - protection is supported by VM - regions: like a memmaped region - on a page fault: send a read requesting data block to the server... - problem of consistency: a block may be store in multiple page frame caches + solution: ownership protocol + locking manager - demand paging: + program execution: > create an address space descriptor > bind the program file to this address space > pages on disk is brought into memory on demand ==> no special programming loading mechanism - support file-like read/write using UIO interface ==> use by debugger - speed up using file caching and process-level cache directory (which cache open files) - single cache for both file and program pages ==> eliminate overhead of copying between caches, duplicating data ... - mapped IO and read/write directly use kernel access path, hence, fast NOTE: file system (disk allocation, directory management, access control... is implemented at user level) # 2.3.4. Device Management - device server module interface between clients and driver modules - implement UIO interface, hence client can use this to access devices - kernel control some privileged operation (like DMA) + to protect itself from malicious process # 2.3.5 Kernel Design Mistake - "logical host" subfield in process identifier + simplify allocation of PI and mapping of PI to the right host + but restrict process migration: all processes in a logical host need to be migrated together (??? WHY) + complexity in handling multiple logical hosts per physical host. - GetPID using multicast: + mostly use to locate name server - "local groups": set of process in same group local to one host + but with migration, this notion is not true nay more + hence, need to remove "local groups", hence simply the code - couple of other mistake, which I CANNOT UNDERSTAND??? # 3. IO - just a higher-level protocol used with the IPC facilities - IO is implemented in application level modules (not in the kernel) - use unified interface (for extensibility, because of distributed system) + UIO is used as system-level I/O interface + UIO object is created, and read/write to this object (like a file) + UIO interface specifies the semantics of operation, mapping of parameter to IPC message + high level call like getc putc is implemented on top of UIO using run-time library - UIO interface is different from conventional + block oriented instead of byte streams + "stateful": ~ UIO object must be created prior to operation, ~ must be reclaimed when end, ~ must be recovered when failure ==> hence, support atomic transaction + 3 kind of functionality: > compulsory: read-only, write-only stream > optional: extensible function and be added > exceptional: invoke specialized IO operation, such as device-specific - separation of system-level interface from application interface, hence fast: + adding run-time (app level) lib reduce frequency of making remote calls + migrating processing load from server (kernel) to the client by adding run-time lib improve overall performance + cheaper memory, hence offloading functionality from servers to apps-lib make sense # 4. NAMING (may be important...) ----------- - need to provide efficient naming structure in distributed system - 3 level model: + character-string names: user level + object ID: os level (processes, address space, communication port, fd...) + entity ID # 4.1 Character-string names - decentralized approach - each object manager implements names for its own set of objects - name is generally mapped as part of an operation on the object + file server implements its own directory system + no need to communicate with a name server + given the name, and the client knows which object manager to contact Hence, the name should be self descriptive (e.g, V kernel process ID contain host id, hence know which host to send...) - advantage: + consistency between objects and directory entries for object is simplified, because both are implemented by same server + object directory is replicated to the same degree as the objects being named, because the directory is replicated when an object manager is replicated ==> once the name server die, client still has access to object manger (why? because object manager understand the name) + extensible, incorporate "foreign" services So at a high level, each object manager implements name for its own set of objects, and there should be some form of name servers associated... which they call directory ... in V-kernel, there are a lot of individual object manager (naming) directory ==> need mechanism for *system wide name space and directory system* The V naming protocol ---------------------- - each object manager mounts its object directory into global name space + pick unique global name prefix for the object directory + add itself (the manager) to name handling (process) group - How to find appropriate object manager for a given character string name: + Client *multicasts* using QueryName operation + name look up fail if either network or the object manager in charge fail - name binding (prefix --> object manager) is cached at client - how to detect stale data: on-use + implication: different clients may have different mapping for a prefix - what happens if a multicast request is never replied? ==> combine decentralized with *resilient global naming system* Vs. Shared-memory multi-processor machine: - analogues in the sense: + there is primary site storage for bindings + cache at client - different in the way it handle consistency + SMP: by hardware + in V-Kernel: by software, detect stale data (i.e. cache entry) on use # 4.2 Object ID - character string name is mapped to an object - object ID is used in subsequent operations, + to avoid overhead of lookup and character-string handling each time - object ID contains: + manager-id: IPC identifier specifying the manager object ~ hence, efficient, client can use this directly + local-object-id: specify the object relative to this object manager - use for: open file, address spaces, contexts or directories - object ID's lifetime does not exceed the lifetime of service entity ID (that is the object manager ID, because when an object manager crashes and restart, it is assigned new entity ID) + so unique ID is mechanism to detect crash and reboot Tricky example: - user present a file identifier at some time after the file had been deleted - file server would have avoid reuse this identifier to avoid the confusion - for group of object manager instances, use group ID + individual ID to find a particular instance + avoid multicast + can bind to another instance if: ~ the object manager migrated + the object manager crashed # 4.3 Entity, Process, and Group ID - entity ID + fixed length + used to identify: > processes > group of processes > transport level endpoint + host address independent: migrated process does not has to change entityID > require a mapping from entity ID to host address ==> this is done use cache and multicast mechanism > allocation problem: require uniqueness, hence kernels must cooperate > ID cannot be reused too quickly to avoid confusion This is problem of naming in a distributed system # 5. V SERVICES --------------- Mostly implemented based on naming protocol and UIO Interface (this imply that naming protocol and UIO interface are the important?) - pipe - internet server (TPC/IP implementation) - file server - printer server - team server If There can be a qual question: build XYZ server on top of V kernel? - remember naming protocol - and UIO interface # 6. THREE CLASSES OF V APPLICATION ----------------------------------- - attempt to handle all three classes of application: + time sharing + batch processing + real-time control The CRUX: HOW??? - time sharing: + multi-user cluster can be shared among users + user sitting at workstation can run program at local machine or run it transparently in other nodes + distributed load by running each program in a least loaded node ==> load distribution - what in V kernel support time sharing: + fast, network-transparent IPC + address space protection + guest programs are executed with lower priority ==> minimize the interference + migration of processes ==> load balancing # 6.1 Distributed Parallel Machine - multi satellite start model + may suffer from communication overheads - Distributed read-time control + problem: shared state + solution: each has timer, periodally exchange info # 6.2 Distributed realtime control - V kernel has + datagram message + prioritized message delivery + priority-based scheduling + accurate time service CRUX: isn't this one size fit all OS not good?