**LIGHTWEIGHT REMOTE PROCEDURE CALL**
=====================================

SUMARY
------

- The whole idea is MAKE THE COMMON CASE FAST, i.e. RPC in local machine
- Context: small kernel operation system (i.e. micro kernel)
- LRPC: fast + safety and transparency
- What is the COMMON CASE:
  1. Local domain call
     + simple control transfer: use the same thread
     + simple data transfer: use same stack to store arguments and return result
     + simple stub: optimize for local transfer
     + design for concurrency: avoid shared data structure, use domain caching 
       to make use of multiprocessor
  2. Arguments are Small, fix-sized, often know at compile time
     + client and server share A-stack
     + (RPC paper: simple call, V kernel: 32-byte message, also use same
       idea to optimize for common case)
- How to make it fast:
  1. Do everything in advance
     e.g. allocating A-stack:
          the argument size and return value are known before hand
          hence, we know the size of A-stack
     e.g. setting up dispatch (no dynamic dispatch in server)
  2. Remove Unnecessary Copy
     - Arguments as well as result is put on A-stack
       i.e both server and client can access A-stack
     - What about reference-object:
       + are copied to A-stack by client stub
       + PRINCIPLE: Client stub does the work, Kernel verifies 
- Note, still deal with the uncommon case (that is real RPC)
  + detect it at bind time, by a bit in *binding object*
- Some issues:
  1. copy safety:
     + since client and server shared A-stack, client can overwrite A-stack
       while server accessing it
     + SOLUTION: 
       ~ Server can copy/verify data only if needed
       ~ Not needed for opaque (e.g. buffers) parameters
       ~ More efficient to have stubs to copying than kernel
 		 > Server can integrate validity checks with copying
       ~ Adds at most one extra copy (on top of initial 1) 
         > from A-stack to server stack
  2. Reliability:
     + what if server crash? 
       ~ client in charge
- Some limitation:
  1. Relies on argument stack pointer to avoid copying / changing protection on 
     execution stack
  2. Assumes no per-thread application state (don't get it, Mike)
  
BINDINGS
--------
1. Server exports interface to LRPC clerk
2. Client binds to interface by call into kernel
3. Kernel call to server LRPC clerk
4. LRPC clerk returns to kernel PD list:
   NOTE: PD contain entry address in server domain of corresponding procedure
   IMPLICATION: kernel can directly call server, no dynamic dispatching
5. For each PD, kernel:
   + allocate A-stack shared by client/server (for argument and return result)
   + allocate *linkage record* for each A-stack
     > containing caller's return address
   NOTE: 1. Make thing fast by do every thing in advance (at bind time),
         2. A-stack do avoid unnecessary copy (but also raise some issues)
6. Kernel returns to client a *Binding Object*, and an A-stack list for each 
   procedure
   NOTE: Binding Object is like a capability
   + client must present to kernel to make a call
   + unforgeable

CALLING:
--------
i.   Client stub grabs A-stack off queue
ii.  Push arguments on A-stack
iii. Pass a-stack, binding object, procedure identifier in registers to kernel
iv.  Kernel
	 1.   Verify binding, procedure identifier
	 2.   Locates procedure description
	 3.   Verify A-stack & locate linkage for A-stack
 	 4.   Verify ownership of A-stack
	 5.   Record caller's return address in linkage
	 6.   Push linkage onto thread (so can nest calls)
 	 7.   Find execution stack (e-stack) for server to execute (from pool)
	 8.   Update thread to point at E-stack
 	 9.   Change processor address space
	10.   Call into server stub at address in PD

NOTE:
1. works for calling convention: use separate argument pointer
  + otherwise, need to copy argument to E-stack (Client --> kernel --> E-stack)
2. For reference parameters:
   PRINCIPLE: client stub copies by-reference objects to A-stack
   Server stub: 
   + create a reference to the data, 
   + places the reference in E-stack
   WHY? don't trust the client, what if client pass bad address

RETURN:
-------
- server stored result in A-stack
- no need extra checking/verification by kernel, because the return address
  of caller's is stored in linkage section at call time
- hence, fast, and simple


LRPC ON MULTIPROCESSOR
----------------------
Problem: context switch at LRPC (client domain to server domain) is expensive
  - code to update hw'virtual memory register (page table, for example)
  - TLB miss

SOLUTION: 
  - cache domain contexts on idle processor
  - on call to server, search for idle processor in the context of server domain
    then switch to that processor
  - on return, do similar thing, but search for context of client domain

GAIN: low latency, and high call throughput


COPY SAFETY
-----------
PROBLEM: Client and server share A-stack, it is possible for client to corrupt
A-stack while server executes.
SOLUTION:
  1.   Server can copy/verify data only if needed
  2.   Not needed for opaque (e.g. buffers) parameters
          e.g.: write buffers, server doesn't care about the content 
  3.   Server can integrate validity checks with copying
          e.g.: for type safe language
  4.   Adds at most one extra copy (on top of initial 1)
          server can just copy the argument to E-stack

THE UNCOMMON-CASE
-----------------
1. Real Remote Call
   + identify at bind time, by a bit in Binding Object
2. A-stack
   + Argument size are too large to fit in A-stack --> use memory segment
   + Number of A-stack are too small:
     ~ can wait
     ~ can allocate more: but validation during calls takes more time
3. Reliability: What if client/server crashes (or terminate)
   + PRINCIPLE: Clean up binding and share structure.
     ~ revokes Binding Objects: hence no more in-call and out-call
     ~ invalidate any active records
   + server terminates, must return to client domain
     ~ scans domain's thread, looking for active threads that had been running 
       on behalf of LRPC call ==> restart the thread in client with 
       *call-failed* exception
   + client terminates, outstanding LRPC call, when finished, must not be 
     allowed to return to its originating domain
     ~ when a (server) thread returns a LRPC call, its follow linkage record
       to return to client
     ~ if it found an invalidated linkage record, know that client terminate
                               

--------------------------------------------------------------------------------

*Motivation*
------------
- Monolithic kernel is hard to debug, modify and validate. 
- One alternative is fine-grained protection with capabilities. This approach:
  + relies on protected procedure call (protected means: trap to kernel
    and may be verified by kernel)
  + hard to implement efficiently
    (think about hydra, which has some extra function, but not optimize
    for common case, hence really complex and slow)
- Other alternative: small kernel
  + borrow the idea of large-grained protection domain (based on machine
    boundaries)
  + relies on user-spaces servers for most functionality
  + use message passing (RPC) for inter domain communication, hence
     	> modular structure
   		> easing system design, implementation, maintenance
		> reliable, fault isolation
		...
  + use distributed programming model. RPC is used for communication across
    domain
==> the PROBLEM: most of the call is between domains in the same machine,
    and the parameters are small and fixed size
Hence using RPC cause a lot of overhead

*WHY USE RPC FOR STRUCTURING SYSTEM*
------------------------------------
1. Easy to use compared to alternatives - compiler handles most of details
2. Easy to build a protected subsystem
3. Allows moving components out of kernel if fast enough
4. More reliable
5. Easier to extend
6. Faster RPC makes it possible to structure systems differently; brings up 
   issue of evaluating new capabilities


*Problem with RPC*
------------------
High overhead (we can do a lot better if the inter-domain call is on the
same machine)
- stub overhead:
  ~ curse of generality, use for both cross-machine and cross-domain
  ~ hence, expensive
- message buffer overhead: RPC can require 4 copy operations (2 on call, 2 on
  return: domain1 <-> kernel <-> domain 2)
  (LRPC use shared A stack, hence reduces # copy operations)
- access validation: kernel validate message sender on call and again on return
  (LRPC: kernel validates once on call only, 
   also, server validates argument, true?)
- message transfer: sender must enqueue the message, which must later be 
  dequeued by receiver
  (LRPC is different? How? )
- scheduling: in RPC from the programmer point of view, there is one abstract
  thread, but in fact, underlying control transfer mechanism involves 
  *concrete threads* fixed in theirs own domain signaling one another.
  Abstraction comes at cost: slow performance (the scheduler must block client
  thread and select one of the server's for execution)
  (LPRC uses one thread for controls transfer, i.e, client's thread executes
  the requested procedure in the server's domain ==> direct context switch)
- context switch: the must be a virtual memory context switch from the client's
  domain to the server's on call and then back again on return
  (LRPC: there is no context switch?, but use domain caching on multiprocessor)
- dispatch: receiver thread must interpret the message and dispatch a thread
  to execute.
  LRPC: kernel direct calls to address of server, hence no dynamic dispatch
  at server

Some RPC implementation solves some of the problem above by:
- allocate messages out of a region specially mapped into both kernel and user
  domains (thus get rid of copying to intermediate kernel)
- handoff scheduling, i.e, direct context switch
- pass few, small argument in registers --> eliminate buffer copying 
- trade security for performance (like in SRC RPC): 
  + use global shared mem for message, remove access validation

Still the overhead in: 
- heavy weight stub and run-time support
- dynamic buffer management
- multilevel dispatch
- interaction with global scheduling state

*LRPC high level*
-----------------
- Optimize for common case (i.e local domain call)
  + simple control transfer: use the same thread
  + simple data transfer: use same stack to store arguments and return result
  + simple stub: optimize for local transfer
  + design for concurrency: avoid shared data structure, domain caching to 
  make use of multiprocessor

- LRPC: combination of *protected* and *remote* procedure call
  + in term of execution model: protected procedure call
    > call to server made through a kernel trap
	> kernel validate the call, create a call linkage, dispatch the concrete
	  thread to server
	> procedure returns though kernel back to the point of the client' call
	> NOTE: the linkage contains return address at the client
  + semantics: remote procedure call
	> server exports interfaces
	> client binds interfaces
	> clients and servers reside in large-grained protection domains
	
*LRPC details: Binding*
-----------------------
1) Procedure Descriptor (PD)
2) Linkage record
3) Binding Object
       

Conceptually, like traditional RPC. But the details are different due to
high degree of cooperation among client, kernel, and server.
- server exports interface to a clerk in RPC runtime. This clerk wait for 
import from client
- client import an interface by calling the kernel, the kernel notifies the 
clerk. The clerk replies to the kernel (which will be pass to client) 
a *Procedure Descriptor* (PD) for each procedure in the interface.
- PD includes:
	+ entry address in server domain
	(in Real RPC: this is in index to a table, storing the address,
	 why? security reason...)
	+ number of simultaneous calls (n)
	+ size of A-stack (in which argument and return values are stored)
- Kernel allocates a number of A-stack (same as number of simultaneous calls)
and mapped them shared by both domains
(Note: procedures in the same interface having A-stack of similar size can
share A-stacks, reducing the storage need)
- Kernel creates *linkage record* storing the caller's return address
(only known to the kernel) for each A-stack
- Kernel returns *binding object* and the A-stack list to client.
  + Client must present binding object to kernel for validation each time it  
    calls.


*LRPC details: Calling*
-----------------------
QUESTION: How do they know if call is local or remote?
A: at bind time, cache a bit of information

- Client stub sets up A-stack (put arguments on stack, etc), and trap into 
the kernel (note that A-stacks given to the client at bind time is managed 
at a LIFO queue by the client stub)
- Kernel:
	+ verifies binding object and procedure identifier, locates PD
	+ verifies A-stack, locates linkage record
	+ ensures A-stack/linkage record are unused
	+ records caller's return address and stack pointer in linkage record
	+ push linkage record per-thread stack
   (Why? To allow a thread to involve in more than one cross-domain procedure)
	+ locates execution stack (E-stack) in server's domain
	+ update thread's stack pointer to use E-stack
	+ changes virtual memory registers (isn't this a VM context switch?)
	+ performs upcall in to server using the address in PD

- Server:
	+ returns by trapping to kernel
	+ no need to verify the return thread's right to transfer back to the 
	calling domain, because it was granted at call time
	+ no verification of data structures (it is implicit in the return).
	+ no explicit message passing (since result is on the A-stack)

Note:
- since A-stack is shared, server can directly access argument without extra
copy operation. In the same way, client will find return result in A-stack.
- this can be done only in a language that requires a separate argument pointer
(if it is required to put arguments in the e-stack, this optimization is not
possible)

- Call by reference requires local reference - Why? To prevent caller from
passing bad address??? (Need for explanation)
- How protected transfer is obtained? Through the E-stack, since E-stack is 
privately mapped, only the kernel and server know it.
(in contrast, RPC obtains this safety by implication, i.e, deriving separate
stack from separate threads)
- E-stacks dynamically associated with A-stacks
	+ association is made on first call with given A-stack
	+ E-stacks are reclaimed when supply of E-stack runs low
	+ why not static? This is faster, but E-stack may have large size, 
	  hence must be managed conservatively

*LRPC details: stub optimization*
---------------------------------
LRPC stubs blur boundaries of traditional RPC layers
- kernel directly calls server stubs, hence no message examination and dispatch
- since kernel primes E-stacks with the initial call frame expected by
server's procedure, server stub is able to branch to the first instruction of
the procedure
==> Hence, for simple LRPC needs only one formal procedure call (into client
stub) and two returns (one out of the server procedure and one out of client
stub)

Why? Because kernel setup everything, record server address in PD, so that
it can directly call the server. Hence, no dynamic dispatch at server
(like in Normal RPC)

*LRPC details: LPRC on multiprocessor*
--------------------------------------
- minimize the use of shared data structure on critical domain transfer
	+ each A-stack queue has its own lock
	+ no other locking occurs
	==> hence, little interference when calls occur simultaneously 
- protection domains (processes) is cached on idle processors
	+ this reduces the overhead, by eliminating the cost for update
	virtual memory hardware and flushing TLB
	+ when a call is made, kernel checks for a processor idling in the context
	of server domain, if found, kernel exchanges the processors of the calling
	and idling threads, placing the calling thread on the processor where 
	context of server domain is already located; the called server procedure
	can be executed without doing a context switch (i.e, change VM register...)
	The idling thread continues, but in the context of client's domain, on the
	client's original processor
    + context switch only performed when domain not cached
    --> when there is a *domain miss* (i.e an idling processor with that 
	domain is needed but not found), kernel increment a counter for that domain
	for scheduling purpose (make idle processor to spin in domains showing 
	the most LRPC activity)
	+ this is a generalized technique of caching the blocked thread 
	(to reduce wakeup latency)
- protection domain caching vs *tagged TLB*
 	+ for multiprocessors, achieve the same result for commonly called servers
	+ for single processors, tagged TLB still requires that mapping hardware
	registers to be modified (domain caching not)
	+ preserve per-processor locality across call

*LRPC details: Argument copying*
--------------------------------

Mike's notes:
Copying Safety

- Normal RPC makes copy of arguments
  + Many times – up to 4 times 
  + QUESTION: What is benefit?
    A: Ensures COW semantics; client changes can’t corrupt server
- LRPC uses shared stacks accessible to both processes
  + Client can overwrite A-stack while server access it
  + Solution:
1.   Server can copy/verify data only if needed
2.   Destination address for return values private to client; no benefit in having kernel, not server, write to it
     Q: what would effect of doing it wrong be?
     A: returning incorrect value
3.   Not needed for opaque (e.g. buffers) parameters
4.   More efficient to have stubs to copying than kernel
      Server can integrate validity checks with copying
5.    Adds at most one extra copy (on top of initial 1)
6.    COMMENT: More like a system call, where kernel validates parameters
7.    ISSUE: Complicates server
      QUESTION: What does this mean for safety/security?

Safety:
- pairwise A-stack acts like a private channel between the client and server.
- this provide protection from third-party domain

Argument copying:
- copying is performed in stubs, not in kernel
Why doing this?
1) it is faster, of course
2) for type safety. A client can pass a bad argument and crash the server.
   So the server need to check for that before using them.
Folding this work to copy operation can result in less work than if 
the value is copied by the message system (4 step), then checked by server stub
- in LRPC, an argument starting at the client stub stack is copied once from
the client stub's stack to the shared A-stack
- in RPC, there are 4 copy operation: 
1) from client stub stack to RPC message
2) from RPC message in client's domain to kernel
3) from message in kernel's domain to message in server's
4) from message in server's domain to server's stack
QUESTION: What is benefit?
A       : Ensures COW semantics; client changes can’t corrupt server


The problem with LRPC: *asynchronous changes because of shared memory*
i.e it is possible for a client or server to change the values of arguments
in A-stack once control has transferred across domain. 
==> Client can overwrite A-stack while server access it
This does not happens with RPC due to extra copying operation.

Solution: 
- for immutable argument, make extra copy operation
1) from client stub's stack to A-stack
2) from A-stack to server's stack
- for mutable argument, e.g a write with an array of bytes (in this case, 
server does not care about the value of array), stay the same with 1 copy op


*Reliability*: from Mike's notes
-------------
- What do you do if a server thread crashes?
- Question: what is the key problem?
A: client thread has been taken over for the server, can’t just timeout 
  because server is actively using it
- Solution: duplicate client thread state into a new thread

*The Uncommon cases*
--------------------
- LRPC still supports cross-machine RPC
	+ detected in first instruction of client stub
- A-stack size and number:
	+ can be static (know at compile time) or size of ethernet packet
	(when size of argument is unknown)
	+ when arguments are to large to fit in A-stack, large out-of-band memory
	segment is used
	+ A-stacks at bind time are allocated contiguously (for quick validation:
	like using a simple range check). But it is possible that if pre-allocated 
	A-stacks is too few, client need to wait for available one. This new 
	A-stack may not be in contiguous with previous one. Hence validation for 
	this may take more time.
- Domain termination (Need to REREAD this section)
To maintain the procedure call semantic: 
- if terminating domain is a server handling an LRPC request, the call, 
completed or not, must return to the client domain
- if terminating domain is client with currently outstanding LRPC request to
another domain, the outstanding call, when finished, must NOT be allowed to 
return to its originating domain.
==> hence the solution: 
1) revoke the binding objects. This prevent any more
out-calls from the domain and prevents other domains from making any more
in-calls.
2) Threads returns to client domain (raising exception)
3) Linkage records of terminating domain invalidated