**LIGHTWEIGHT REMOTE PROCEDURE CALL** ===================================== SUMARY ------ - The whole idea is MAKE THE COMMON CASE FAST, i.e. RPC in local machine - Context: small kernel operation system (i.e. micro kernel) - LRPC: fast + safety and transparency - What is the COMMON CASE: 1. Local domain call + simple control transfer: use the same thread + simple data transfer: use same stack to store arguments and return result + simple stub: optimize for local transfer + design for concurrency: avoid shared data structure, use domain caching to make use of multiprocessor 2. Arguments are Small, fix-sized, often know at compile time + client and server share A-stack + (RPC paper: simple call, V kernel: 32-byte message, also use same idea to optimize for common case) - How to make it fast: 1. Do everything in advance e.g. allocating A-stack: the argument size and return value are known before hand hence, we know the size of A-stack e.g. setting up dispatch (no dynamic dispatch in server) 2. Remove Unnecessary Copy - Arguments as well as result is put on A-stack i.e both server and client can access A-stack - What about reference-object: + are copied to A-stack by client stub + PRINCIPLE: Client stub does the work, Kernel verifies - Note, still deal with the uncommon case (that is real RPC) + detect it at bind time, by a bit in *binding object* - Some issues: 1. copy safety: + since client and server shared A-stack, client can overwrite A-stack while server accessing it + SOLUTION: ~ Server can copy/verify data only if needed ~ Not needed for opaque (e.g. buffers) parameters ~ More efficient to have stubs to copying than kernel > Server can integrate validity checks with copying ~ Adds at most one extra copy (on top of initial 1) > from A-stack to server stack 2. Reliability: + what if server crash? ~ client in charge - Some limitation: 1. Relies on argument stack pointer to avoid copying / changing protection on execution stack 2. Assumes no per-thread application state (don't get it, Mike) BINDINGS -------- 1. Server exports interface to LRPC clerk 2. Client binds to interface by call into kernel 3. Kernel call to server LRPC clerk 4. LRPC clerk returns to kernel PD list: NOTE: PD contain entry address in server domain of corresponding procedure IMPLICATION: kernel can directly call server, no dynamic dispatching 5. For each PD, kernel: + allocate A-stack shared by client/server (for argument and return result) + allocate *linkage record* for each A-stack > containing caller's return address NOTE: 1. Make thing fast by do every thing in advance (at bind time), 2. A-stack do avoid unnecessary copy (but also raise some issues) 6. Kernel returns to client a *Binding Object*, and an A-stack list for each procedure NOTE: Binding Object is like a capability + client must present to kernel to make a call + unforgeable CALLING: -------- i. Client stub grabs A-stack off queue ii. Push arguments on A-stack iii. Pass a-stack, binding object, procedure identifier in registers to kernel iv. Kernel 1. Verify binding, procedure identifier 2. Locates procedure description 3. Verify A-stack & locate linkage for A-stack 4. Verify ownership of A-stack 5. Record caller's return address in linkage 6. Push linkage onto thread (so can nest calls) 7. Find execution stack (e-stack) for server to execute (from pool) 8. Update thread to point at E-stack 9. Change processor address space 10. Call into server stub at address in PD NOTE: 1. works for calling convention: use separate argument pointer + otherwise, need to copy argument to E-stack (Client --> kernel --> E-stack) 2. For reference parameters: PRINCIPLE: client stub copies by-reference objects to A-stack Server stub: + create a reference to the data, + places the reference in E-stack WHY? don't trust the client, what if client pass bad address RETURN: ------- - server stored result in A-stack - no need extra checking/verification by kernel, because the return address of caller's is stored in linkage section at call time - hence, fast, and simple LRPC ON MULTIPROCESSOR ---------------------- Problem: context switch at LRPC (client domain to server domain) is expensive - code to update hw'virtual memory register (page table, for example) - TLB miss SOLUTION: - cache domain contexts on idle processor - on call to server, search for idle processor in the context of server domain then switch to that processor - on return, do similar thing, but search for context of client domain GAIN: low latency, and high call throughput COPY SAFETY ----------- PROBLEM: Client and server share A-stack, it is possible for client to corrupt A-stack while server executes. SOLUTION: 1. Server can copy/verify data only if needed 2. Not needed for opaque (e.g. buffers) parameters e.g.: write buffers, server doesn't care about the content 3. Server can integrate validity checks with copying e.g.: for type safe language 4. Adds at most one extra copy (on top of initial 1) server can just copy the argument to E-stack THE UNCOMMON-CASE ----------------- 1. Real Remote Call + identify at bind time, by a bit in Binding Object 2. A-stack + Argument size are too large to fit in A-stack --> use memory segment + Number of A-stack are too small: ~ can wait ~ can allocate more: but validation during calls takes more time 3. Reliability: What if client/server crashes (or terminate) + PRINCIPLE: Clean up binding and share structure. ~ revokes Binding Objects: hence no more in-call and out-call ~ invalidate any active records + server terminates, must return to client domain ~ scans domain's thread, looking for active threads that had been running on behalf of LRPC call ==> restart the thread in client with *call-failed* exception + client terminates, outstanding LRPC call, when finished, must not be allowed to return to its originating domain ~ when a (server) thread returns a LRPC call, its follow linkage record to return to client ~ if it found an invalidated linkage record, know that client terminate -------------------------------------------------------------------------------- *Motivation* ------------ - Monolithic kernel is hard to debug, modify and validate. - One alternative is fine-grained protection with capabilities. This approach: + relies on protected procedure call (protected means: trap to kernel and may be verified by kernel) + hard to implement efficiently (think about hydra, which has some extra function, but not optimize for common case, hence really complex and slow) - Other alternative: small kernel + borrow the idea of large-grained protection domain (based on machine boundaries) + relies on user-spaces servers for most functionality + use message passing (RPC) for inter domain communication, hence > modular structure > easing system design, implementation, maintenance > reliable, fault isolation ... + use distributed programming model. RPC is used for communication across domain ==> the PROBLEM: most of the call is between domains in the same machine, and the parameters are small and fixed size Hence using RPC cause a lot of overhead *WHY USE RPC FOR STRUCTURING SYSTEM* ------------------------------------ 1. Easy to use compared to alternatives - compiler handles most of details 2. Easy to build a protected subsystem 3. Allows moving components out of kernel if fast enough 4. More reliable 5. Easier to extend 6. Faster RPC makes it possible to structure systems differently; brings up issue of evaluating new capabilities *Problem with RPC* ------------------ High overhead (we can do a lot better if the inter-domain call is on the same machine) - stub overhead: ~ curse of generality, use for both cross-machine and cross-domain ~ hence, expensive - message buffer overhead: RPC can require 4 copy operations (2 on call, 2 on return: domain1 <-> kernel <-> domain 2) (LRPC use shared A stack, hence reduces # copy operations) - access validation: kernel validate message sender on call and again on return (LRPC: kernel validates once on call only, also, server validates argument, true?) - message transfer: sender must enqueue the message, which must later be dequeued by receiver (LRPC is different? How? ) - scheduling: in RPC from the programmer point of view, there is one abstract thread, but in fact, underlying control transfer mechanism involves *concrete threads* fixed in theirs own domain signaling one another. Abstraction comes at cost: slow performance (the scheduler must block client thread and select one of the server's for execution) (LPRC uses one thread for controls transfer, i.e, client's thread executes the requested procedure in the server's domain ==> direct context switch) - context switch: the must be a virtual memory context switch from the client's domain to the server's on call and then back again on return (LRPC: there is no context switch?, but use domain caching on multiprocessor) - dispatch: receiver thread must interpret the message and dispatch a thread to execute. LRPC: kernel direct calls to address of server, hence no dynamic dispatch at server Some RPC implementation solves some of the problem above by: - allocate messages out of a region specially mapped into both kernel and user domains (thus get rid of copying to intermediate kernel) - handoff scheduling, i.e, direct context switch - pass few, small argument in registers --> eliminate buffer copying - trade security for performance (like in SRC RPC): + use global shared mem for message, remove access validation Still the overhead in: - heavy weight stub and run-time support - dynamic buffer management - multilevel dispatch - interaction with global scheduling state *LRPC high level* ----------------- - Optimize for common case (i.e local domain call) + simple control transfer: use the same thread + simple data transfer: use same stack to store arguments and return result + simple stub: optimize for local transfer + design for concurrency: avoid shared data structure, domain caching to make use of multiprocessor - LRPC: combination of *protected* and *remote* procedure call + in term of execution model: protected procedure call > call to server made through a kernel trap > kernel validate the call, create a call linkage, dispatch the concrete thread to server > procedure returns though kernel back to the point of the client' call > NOTE: the linkage contains return address at the client + semantics: remote procedure call > server exports interfaces > client binds interfaces > clients and servers reside in large-grained protection domains *LRPC details: Binding* ----------------------- 1) Procedure Descriptor (PD) 2) Linkage record 3) Binding Object Conceptually, like traditional RPC. But the details are different due to high degree of cooperation among client, kernel, and server. - server exports interface to a clerk in RPC runtime. This clerk wait for import from client - client import an interface by calling the kernel, the kernel notifies the clerk. The clerk replies to the kernel (which will be pass to client) a *Procedure Descriptor* (PD) for each procedure in the interface. - PD includes: + entry address in server domain (in Real RPC: this is in index to a table, storing the address, why? security reason...) + number of simultaneous calls (n) + size of A-stack (in which argument and return values are stored) - Kernel allocates a number of A-stack (same as number of simultaneous calls) and mapped them shared by both domains (Note: procedures in the same interface having A-stack of similar size can share A-stacks, reducing the storage need) - Kernel creates *linkage record* storing the caller's return address (only known to the kernel) for each A-stack - Kernel returns *binding object* and the A-stack list to client. + Client must present binding object to kernel for validation each time it calls. *LRPC details: Calling* ----------------------- QUESTION: How do they know if call is local or remote? A: at bind time, cache a bit of information - Client stub sets up A-stack (put arguments on stack, etc), and trap into the kernel (note that A-stacks given to the client at bind time is managed at a LIFO queue by the client stub) - Kernel: + verifies binding object and procedure identifier, locates PD + verifies A-stack, locates linkage record + ensures A-stack/linkage record are unused + records caller's return address and stack pointer in linkage record + push linkage record per-thread stack (Why? To allow a thread to involve in more than one cross-domain procedure) + locates execution stack (E-stack) in server's domain + update thread's stack pointer to use E-stack + changes virtual memory registers (isn't this a VM context switch?) + performs upcall in to server using the address in PD - Server: + returns by trapping to kernel + no need to verify the return thread's right to transfer back to the calling domain, because it was granted at call time + no verification of data structures (it is implicit in the return). + no explicit message passing (since result is on the A-stack) Note: - since A-stack is shared, server can directly access argument without extra copy operation. In the same way, client will find return result in A-stack. - this can be done only in a language that requires a separate argument pointer (if it is required to put arguments in the e-stack, this optimization is not possible) - Call by reference requires local reference - Why? To prevent caller from passing bad address??? (Need for explanation) - How protected transfer is obtained? Through the E-stack, since E-stack is privately mapped, only the kernel and server know it. (in contrast, RPC obtains this safety by implication, i.e, deriving separate stack from separate threads) - E-stacks dynamically associated with A-stacks + association is made on first call with given A-stack + E-stacks are reclaimed when supply of E-stack runs low + why not static? This is faster, but E-stack may have large size, hence must be managed conservatively *LRPC details: stub optimization* --------------------------------- LRPC stubs blur boundaries of traditional RPC layers - kernel directly calls server stubs, hence no message examination and dispatch - since kernel primes E-stacks with the initial call frame expected by server's procedure, server stub is able to branch to the first instruction of the procedure ==> Hence, for simple LRPC needs only one formal procedure call (into client stub) and two returns (one out of the server procedure and one out of client stub) Why? Because kernel setup everything, record server address in PD, so that it can directly call the server. Hence, no dynamic dispatch at server (like in Normal RPC) *LRPC details: LPRC on multiprocessor* -------------------------------------- - minimize the use of shared data structure on critical domain transfer + each A-stack queue has its own lock + no other locking occurs ==> hence, little interference when calls occur simultaneously - protection domains (processes) is cached on idle processors + this reduces the overhead, by eliminating the cost for update virtual memory hardware and flushing TLB + when a call is made, kernel checks for a processor idling in the context of server domain, if found, kernel exchanges the processors of the calling and idling threads, placing the calling thread on the processor where context of server domain is already located; the called server procedure can be executed without doing a context switch (i.e, change VM register...) The idling thread continues, but in the context of client's domain, on the client's original processor + context switch only performed when domain not cached --> when there is a *domain miss* (i.e an idling processor with that domain is needed but not found), kernel increment a counter for that domain for scheduling purpose (make idle processor to spin in domains showing the most LRPC activity) + this is a generalized technique of caching the blocked thread (to reduce wakeup latency) - protection domain caching vs *tagged TLB* + for multiprocessors, achieve the same result for commonly called servers + for single processors, tagged TLB still requires that mapping hardware registers to be modified (domain caching not) + preserve per-processor locality across call *LRPC details: Argument copying* -------------------------------- Mike's notes: Copying Safety - Normal RPC makes copy of arguments + Many times – up to 4 times + QUESTION: What is benefit? A: Ensures COW semantics; client changes can’t corrupt server - LRPC uses shared stacks accessible to both processes + Client can overwrite A-stack while server access it + Solution: 1. Server can copy/verify data only if needed 2. Destination address for return values private to client; no benefit in having kernel, not server, write to it Q: what would effect of doing it wrong be? A: returning incorrect value 3. Not needed for opaque (e.g. buffers) parameters 4. More efficient to have stubs to copying than kernel Server can integrate validity checks with copying 5. Adds at most one extra copy (on top of initial 1) 6. COMMENT: More like a system call, where kernel validates parameters 7. ISSUE: Complicates server QUESTION: What does this mean for safety/security? Safety: - pairwise A-stack acts like a private channel between the client and server. - this provide protection from third-party domain Argument copying: - copying is performed in stubs, not in kernel Why doing this? 1) it is faster, of course 2) for type safety. A client can pass a bad argument and crash the server. So the server need to check for that before using them. Folding this work to copy operation can result in less work than if the value is copied by the message system (4 step), then checked by server stub - in LRPC, an argument starting at the client stub stack is copied once from the client stub's stack to the shared A-stack - in RPC, there are 4 copy operation: 1) from client stub stack to RPC message 2) from RPC message in client's domain to kernel 3) from message in kernel's domain to message in server's 4) from message in server's domain to server's stack QUESTION: What is benefit? A : Ensures COW semantics; client changes can’t corrupt server The problem with LRPC: *asynchronous changes because of shared memory* i.e it is possible for a client or server to change the values of arguments in A-stack once control has transferred across domain. ==> Client can overwrite A-stack while server access it This does not happens with RPC due to extra copying operation. Solution: - for immutable argument, make extra copy operation 1) from client stub's stack to A-stack 2) from A-stack to server's stack - for mutable argument, e.g a write with an array of bytes (in this case, server does not care about the value of array), stay the same with 1 copy op *Reliability*: from Mike's notes ------------- - What do you do if a server thread crashes? - Question: what is the key problem? A: client thread has been taken over for the server, can’t just timeout because server is actively using it - Solution: duplicate client thread state into a new thread *The Uncommon cases* -------------------- - LRPC still supports cross-machine RPC + detected in first instruction of client stub - A-stack size and number: + can be static (know at compile time) or size of ethernet packet (when size of argument is unknown) + when arguments are to large to fit in A-stack, large out-of-band memory segment is used + A-stacks at bind time are allocated contiguously (for quick validation: like using a simple range check). But it is possible that if pre-allocated A-stacks is too few, client need to wait for available one. This new A-stack may not be in contiguous with previous one. Hence validation for this may take more time. - Domain termination (Need to REREAD this section) To maintain the procedure call semantic: - if terminating domain is a server handling an LRPC request, the call, completed or not, must return to the client domain - if terminating domain is client with currently outstanding LRPC request to another domain, the outstanding call, when finished, must NOT be allowed to return to its originating domain. ==> hence the solution: 1) revoke the binding objects. This prevent any more out-calls from the domain and prevents other domains from making any more in-calls. 2) Threads returns to client domain (raising exception) 3) Linkage records of terminating domain invalidated