CS 736 Reviews - Spring 2016: Lightweight Remote Procedure Call

« Implementing Remote Procedure Calls | Main | Scheduler Activations: Effective Kernel Support for the User-Level management of Parallelism. »

Lightweight Remote Procedure Call

Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Lightweight Remote Procedure Call. ACM Trans. on Computer Systems 8(1), February 1990, pp.37-55.

Reviews due Tuesday, 2/23

Posted by Michael Swift on February 22, 2016 10:18 AM | Permalink

Comments

1. Summary
This paper describes Lightweight Remote Procedure Call (LRPC) which is a communication mechanism that is optimized for communication across protection domains on the same machine. LRPC borrows the execution model of a protected procedure call and uses the programming and protection models similar to RPC, to create a faster RPC for local communication within a machine.

2. Problem
Monolithic kernel is hard to debug, modify and validate. As an alternative to fine-grained protection with capabilities which is hard to implement, small kernel OSes borrow the idea of large-grained protection and programing models from RPC used in distributed computing. Although RPC provides modular structure, failure isolations, etc. they incur a huge overhead due to stub overhead, multiple message copying, access validation, context switching, etc. This is undesirable especially when used for communication between domains in the same machine. To avoid this overhead, many developers package logically separate entities into single domain thus compromising safety and modularity for performance.

3. Contributions
The main contribution of this paper is how the authors optimize the performance of RPC for the common case which is to communicate across protection domains on the same machine, by identifying and avoiding various overheads that are associated with RPC, which has been designed and optimized for use in communication across machines. i) LRPC does certain things in advance and hence remove certain overheads from the critical path thus make things faster. Since the size of procedure’s argument stack is known in advance, the kernel pairwise allocates A-stacks mapped read-write and shared by both domains in advance. The kernel also allocates a linkage record in advance that is used to record a caller’s return address. ii) LRPC avoids unnecessary message copying by using a shared A-stack which is mapped to server and client domain. Access validation by kernel on return is avoided by using the address in linkage record. iii) LPRC makes control transfer simple and efficient by making client's thread execute the requested procedure in the server's domain. iv) LRPC avoids dynamic dispatching at the server as the Binding Object contains entry address into server’s procedure and thus a simple LRPC needs only one formal procedure call (into client stub) and two returns (one out of the server procedure and one out of client stub) v) By caching protection domains on idle processors, LRPC reduces context switching overhead, that includes cost of updating virtual memory registers and TLB misses. Further LRPC handles uncommon cases of cross-machine RPC and RPCs with large parameters correctly.

4. Evaluation
The authors motivate their work by various measurements on the then contemporary OSes of which one was a monolithic kernel, that show how only a small fraction of RPCs are used for remote communication and how most of the RPCs uses only small sized parameters of fixed size. The authors further show how the RPC performance suffers in the case cross domain communication in a single machine and show the overhead involved. This really helps motivate the case for LRPC. The authors evaluates time taken for various LRPC calls containing different argument and result sizes and show how LRPC with and without domain caching perform better when compared to prior RPC system for same machine communication. I really liked how the authors were able to reason about their performance by breaking down their times into various overheads. By further showing the call throughputs, they show how their performance optimizations for multiprocessors are effective. It would have been nice to see the performance measurements for uncommon cases too especially in the case of having large parameters that is allocated in separate memory segment. Further, the authors could have evaluated the overall effectiveness of their approach and how practical their system is by probably running few complex benchmarks.

5. Confusion
Is LRPC concept used in any of the contemporary OSes?

Posted by: Aishwarya Ganesan | February 23, 2016 08:56 AM

1. Summary
This paper describes a new communication facility known as Lightweight Remote Procedure Call (LRPC). The purpose of LRPC is to provide efficient communication between protection domains on the same machine. The authors give a detailed motivation to make a case for the need of such a facility and follow it up implementation details of the same. Lastly, they evaluate LRPC with optimized RPC and prove that it is 3 times faster than optimized RPC. The authors go further and prove that frequency of cross-domain activity is very high, and involves passing of small sized data structures.
2. Problem
Prior to this work, cross-domain communications were treated as instances of RPCs and incurred the overheads with respect to RPCs. In order to solve this issue, the various subsystems were being put in the same protection domain. Though this eliminated the overheads associated with the RPCs, it sacrificed security for performance. The authors want to come up with a solution that would ensure that security is not sacrificed upon to achieve performance.
3. Contribution
The main aim of the proposed solution is to perform better than conventional RPC systems in the context of cross-domain communication without sacrificing safety for performance. There are 4 key design techniques that contribute to the better performance of LRPC. Firstly, simple control transfer is achieved by ensuring that the client’s thread executes the requested procedure in the server domain via a context switch from the client thread to the server thread by changing the address space of the client thread. Secondly, LRPC achieves simple data transfer by the use of shared argument stacks known as A-stacks. The use of A-stacks involves copying of the data just once to the shared A-stack whereas normal RPCs involve additional copying. Thirdly, LRPCs have simple, highly optimized stubs that are generated in assembly language. The kernel on a transfer can directly invoke the server stubs and this eliminates the overheads associated with intermediate message examination and dispatch. However, the performances of LRPC stubs become more like conventional RPC stubs when dealing with complex parameters. Finally, the LRPC mechanism benefits from the speedup potential of a multiprocessor by avoiding the use of shared data structures on the critical domain transfer path. . Another feature that caught my eye was the use of domain caching on idle processors to reduce context switch overheads.
4. Evaluation
The authors evaluate the performance of LRPC on the C-VAX Firefly. They run four tests and compare the performance of LRPC with Taos RPC. The results clearly show that LRPC achieves a 3x performance gain over Taos RPC. Next, the authors provide details with respect what could be the major cause of overhead in LRPC by providing the breakdown of the time. This brings out the fact that TLB misses contribute significantly to the observed overhead. Lastly, the authors also evaluate the performance of the LRPC on multi-processors. The results prove that the throughput of LRPC is higher than SRC RPC. Even though the authors have done a fairly extensive evaluation of the proposed mechanism, there are a few other aspects that could have been evaluated. Firstly, the authors do not measure the benefits of domain caching. Secondly, a comparison of LRPC with other IPC mechanisms such as pipes would have been ideal.
5. Confusion
I did not quite understand how parameter copying could be avoided. A discussion about the same would be of great help.

Posted by: Arjun Singhvi | February 23, 2016 08:53 AM

1. summary
This paper presents LRPC (Lightweight Remote Procedure Call), a lightweight communication facility that combines the control transfer and communication model of capability systems with the programming semantics and large grained protection model of RPC systems to handle communication between protection domains on a single machine.

2. Problem
In existing RPC systems,cross domain calls are implemented in terms of facilities required by cross-machine ones with significant overheads as compared to simple procedure calls. However, the common case for communication is between domains on the same machine as opposed to across machines. Most communication is simple, involving few arguments and little data, since complex data is often hidden behind abstractions. Thus, the authors propose a lightweight solution as LRPC.

3. Contributions
The overall contribution of the paper is that it proposes and implements a mechanism that is much lighter than RPC and covers the usage of RPC and protection domains. Here are some specific contributions:
* Control transfer mechanism : LRPC allows client threads to execute the requested procedure in the server domain that minimized context switch overheads of RPC.
* The shared Argument stack is an efficient model for data transfer between the client and the server and is also secure since the kernel validates it during the call.
* LRPC can generate optimized stubs for communication.
* LRPC minimized the use of shared data structures and also caches domains on idle processors providing higher call throughput.

4. Evaluation
The authors first build up their arguments for LRPC by showing that RPC has several overheads and also that most the communication happens cross-domains within a single machine with few arguments and small argument size. For that they examine and benchmark on three operating systems: the V System, Taos and UNIX + NFS and clearly validate their claims. That also makes the overall design of the LRPC very intuitive to grasp.
For the evaluation of LRPC, the compare the performance of LRPC vs the Taos RPC on a C-VAX Firefly on four different tests by measuring the time to complete each procedure call between the client and server in different domains. The test shows that LRPC systems is around three times faster than the Taos RPC. The authors also evaluate their design to avoid locking shared for contention removal and show that doing so achieves almost 5 times more throughput than the RPC equivalent.
Overall their evaluation approach defends the argument they make for LRPC - that it is lightweight and much more efficient for cross domain communication. They also provide a detailed breakdown of specific overheads in LPRC. However their tests only include simple calls with few arguments. It would be interesting to see similar number and figures for more complex and recursive data structures such as linked list and binary trees as arguments. Also, another test on cross-machine communication would provide more insights on its performance relative to RPC .

5. Confusion
How does client discover procedures in server? For e.g. in RPC it maintains a database of server and corresponding procedures. Here it is not clear whether the “clerk” is the one that handles that (it looks more like a “Runtime” equivalent).

Posted by: Udip | February 23, 2016 08:51 AM

Summary
LRPC is a communication facility designed and optimized for communication between protection domain on the same machine. LRPC combines control model and communication model of capability systems with program semantics and large grained protection model of RPC. It achieves a factor of three performance improvement over more traditional approaches.

Problem
In contemporary small kernel operating systems existing RPC systems incur and unnecessarily high cost when used for communication that predominates-between protection domains on same machine. This cost leads system designers to coalesce weakly related subsystems in into the same protection domain hence trading safety for performance. By reducing the overhead of same machine communication LRPC encourages for performance and safety. Also, using measurements from three contemporary operating systems it was observed that only a small fraction of RPC's were truly remote and rarely passed large or complex arguments. Hence, a simple cross domain call was designed.

Contribution
i)Simple control transfer : In traditional RPCs implementation the control transfer mechanism involves concrete threads fixed in their own domain signaling one another leading to a level of indirection, in LRPC a client's thread executes the requested procedure in the server's domain.

Ii)Simple data transfer : A shared argument stack accessible to both client and server eliminates data copying. The data is copies to A-stack just once post which subsequent writes need no copying. Whereas in RPC cross-domain message transfer requires 4 copy operations for any RPC. Protection is maintained using "the binding object".

iii)Simple stubs : Since now the communication is happening within the machines stub generation is also highly
optimized and does not involve any marshalling.

Iv)Design for concurrency : Multiple processors are used to reduce LRPC latency by caching domain contexts on idle processors. This reduces the cost of context switch during an LRPC. The high cost of frequent domain crossing can also be reduced by using a TLB that includes a process tag. However, domain caching will perform even better than tagged TLBs because domain switch still requires mapped registers to be modified on critical transfer path but domain caching caches that also.

v) The data structures and control sequences of LRPC was designed to minimize TLB misses during a context switch. Context switches requires TLB invalidation on CVAX, this optimization helps to reduce that overhead.

Evaluation
To evaluate the performance of LRPC four tests were used. These tests were run on C-VAX Firefly using LRPC and Taos RPC. One of the tests was a Null call against which overhead of LRPC was measured. Three other tests namely Add, BigIn , BigOut represent calls having typical parameter sizes. LRPC is seen to be 3x better that TAOS and LRPC/MP is
A cost break down of Null call is compared to theoretically minimum cross domain call cost and it shows that approximately 18microseconds are spent in client stub and 3microsecond in the server's. Other 27 microsecond go towards kernel overhead, binding validation and linkage management. This break up gives a very clear understanding of the call and also where the overheads are.
To evaluate the benefit of avoiding locking of shared data during call and return on shared memory multiprocessor call throughput as a function of number of processors simultaneously making calls is measured with domain caching disabled. For LRPC 4 processors can give a speedup of close to 3.7 whereas Taos RPC with just 2 processors benefit levels of due to global lock that is held in large part of RPC. These tests prove the effectiveness of LRPC.
But the tests have been conducted just on variable argument sizes and measurement with further variation in complexity of data structures are not shown. Also, since now a shared address is allocated instead of copying what are memory impacts of sharing A-stacks and maintaining E-stacks. It has not been verified for just one hardware and one operating system, in my opinion it should have been tested on some more operating systems.

Confusion
How does exchanging of the processors of the calling and idle thread happen when a thread is idle in idle state.
What part of the control transfer is held with lock in case of shared memory multiprocessor? The mechanism is unclear.

Posted by: Vishakha Dhelia | February 23, 2016 08:49 AM

1. Summary
The article describes the design and implementation of Lightweight Remote Procedure Call (LRPC) which follows the transfer, communication, protection and programming semantics of RPC but is optimized for the case of communication of processes within a local system.

2. Problem
The article explores the problems in inter-process communication among the processes running on the same machine. The large-kernels had just two levels of protection hierarchy between the kernel and user-space making the hardware directly vulnerable to the software running in the kernel. Some of the contemporary small-kernel which offered fine grained control with protected procedure calls were suggested as a solution but programmers found it difficult to efficiently implement systems with such fine grained protection. So many small kernel operating systems borrowed the large-grained protection and programming semantics from the distributed systems and used RPC to communicate between processes or protected domains. This approach is convenient for both local and remote communications, but has a huge overhead for local communication. The authors show that in long running experiments only about 3% of the communication crosses the machine boundary and so the traditional RPC mechanisms adopted by the small kernels are not optimized for the common case of communication within the local subsystem.

3. Contributions
The main contribution of this work is design, development and evaluation of LRPC which follows RPC semantics and achieves significant performance gain in a cross-domain communication. The performance gain is achieved by simplifying control transfer between threads by using pre-allocated execution stacks in the server domain to quickly update the user threads stack pointer to the execution stack to start executing the procedure exported by the server, simplifying the data transfer by using a shared argument stack which can avoid a redundant copy if the Application Binary Interface(ABI) allows a separate argument pointer in the execution stack and introducing a highly optimized stub generator which can generate client and server stubs in assembly language for better performance. LRPC also introduced optimizations for multiprocessors by caching the domain contexts on idle processors to avoid context switches.

4. Evaluation
The article evaluates in cross-domain performance of LRPC with respect the existing RPC mechanisms in Taos by passing typical argument sizes to procedures, Add, BigIn and BigOut. The evaluation shows that LRPC show about 3 times improvement on a single processor system and scales better than the traditional RPC with increasing number of cores. The evaluation also shows that LRPC has less overhead when compared to the theoretical minimum time which is possible for RPC. The article claims that LRPC can also be used for cross-machine calls by branching out from LRPC to RPC stub, but do not evaluate their performance with respect to the traditional RPC mechanisms. The the interface writer is at the liberty of determining the number of A-stacks that can be allocated for the interface and by default the stub-generator only generates A-stacks, so it would have been good to evaluate the performance of LRPC for different interfaces with default number of stacks and comparing it with increasing and decreasing the number of stacks to highlight the trade-off between memory used and speed. One more thing which is critical is that the platform on which the authors test LRPC allows a pointer to the argument stack in a threads execution stack which helps to avoid the redundant copy, and it would have been great to see some performance numbers in an experiment where this feature is no longer used, because some architectures and interfaces would not allow for having a pointer to the argument in the execution stack.

5. Confusion
How does LRPC switch between inter-domain and inter-machine communication? Won't the decision have to be made during binding phase?

Posted by: Mihir Shete | February 23, 2016 08:49 AM

Summary
The authors in this paper discuss a lightweight remote procedure call designed and specialized for communication between the protection domains on the same machine.

Problem
The problem attempted to solve in that paper is to optimize the traditional RPC for the common case of inter domain calls rather just reusing the techiniques used for inter machine calls. It is observed that most remote procedure calls occur between domains on a single machine rather than between machines, most use fixed size parameters known at compile time, and most transfer a small number of bytes.

Contributions
The paper demonstrates that majority of remote procedure calls use simple arguments with a small number of bytes transferred and majorly between domains on a single machine, and so there is indeed overhead involved with the traditional approach of remote procedure calls. A lightweight remote procedure call technique is introduced which shows that much of this overhead could be easily avoided. The overheads associated with conventional RPC are in stubs, buffers used for messages, message transfers, dispatch, validation and context switch. The arguments of procedure move from stub stack to message in the client domain. The kernel domain and server domain are intermediaries and post which it pushed on to server stack. For cross-domain calls the authors present a separate argument stack associated with an execution stack for the server stub., both the domains then share this argument stack maintaining a single copy of arguments. Also since this mechanism minimizes shared data structures, even the stubs could be optimized.

Evaluation
The authors evaluated the light weight procedure calls on a uniprocessor, multiprocessor and on Taos operating system to show that majority of remote procedure calls are inter domain. With the size distribution they have shown that these calls are mostly simple calls with small size bytes packets. LRPC is shown as 3x faster over Tao RPCs. They provided breakdown of overheads in null domain calls which effectively compares with minmial cost. They prove that linkage and binding were the among the most time consuming sections. A comparative study with other procedure calls could also have been presented. Also evaluation results for the optimizations done for stubs could be shown. They have also not discussed in evaluation on any additional memory required for maintaining E-stack or memory saved by sharing variables in a single A-stack.
More experiments could have been for benchmarking.

Confusion
The explaination of argument sharing on stacks could be clarified and also about security how it gets impacted with LRPC ?

Posted by: Ankur Srivastava | February 23, 2016 08:43 AM

1) Summary: Since RPC was not optimized for common case of RPC between protection domains in same machine (cross domain) with data transferred being small and fixed in many cases, the authors propose Lightweight Procedure Call (LRPC) that combined the ideas of protected procedure calls from capability systems and communication model and large grained control from RPC. The evaluation shows LRPC has fewer overheads for cross-domain communication and performs better than RPC by a factor of 3.

2) Problem: System designers opted to coalesce subsystems into large-grained protection domains that traded safety for performance. Hence the authors in this paper set out to solve the optimization problem by designing lightweight solution instead of RPCs in the common case of communication; which is intra machine but cross domain with few arguments fixed in size. Overheads (stub generation, message passing and buffers, access control, dispatching, etc ) inherent in RPC was tried to be ameliorated in LRPC.

3) Contributions:The authors first perform experiments to show a] most times RPC is intra machine, b] data transfers in RPC is mostly small and fixed in size and c] overhead in RPC can be attributed to stubs, message buffer, access controls, message transfer, scheduling, context switching and dispatching. Hence authors propose LRPC that encompasses; a] simple control transfer- client thread runs in servers domain for a requested procedure. the control then transfers to servers execution stack after kernel validation. b] simple data transfer- a shared argument stack(A-stack) is accessible to client and server and reduces overhead of data copying, c] LRPC uses highly optimized stubs, d] binding- clerk registers interface with server name and awaits client requests/imports, Procedure Descriptor List(PDL)- has one PD for each procedure in the interface as {entry address, number of calls and size of A-stack}, kernel allocates A-stacks equal to number of calls and stores the linkage records(return address, e] calling- client initiates LRPC for domain transfer by calling stub procedure, which then places the A-stack, binding object, PD into registers and traps kernel, kernel then does the following namely validates A-stack, stores return address, and current stack pointer, updates thread stack pointer to E-stack, reloads processors virtual memory registers with server domain and calls server stub. LRPC is also demonstrated to perform well on multiprocessor systems, the reason can be attributed to use of lesser shared data structures, minimized costs associated with locks, caches domains in idle processors that reduces overhead of context switching.

4) Evaluation: Authors have carried out a reasonable amount of evaluation to validate their claims. Initially the evaluate frequency of inter machine calls, size and complexity of parameters and RPC’s performance in cross domain. The authors used 4 benchmarks and evaluated LRPC on single processor, multiprocessors and Taos. It showed a three fold increase in performance(time taken for a NULL call) compared to conventional RPCs on Taos. LRPC was shown to perform better on multiprocessors than on single processors. In the case of single processor machines 50% time overhead was added to kerne trap time, and 25% of total time was taken for TLB misses during context switch. Shared memory locking also showed improvement compared to global locking. LRPC showed linear increase in throughput as number of processors increased whereas RPCs performance hit a downhill after two processors. Even thought the authors have carried out relevant evaluation, there are few aspects I feel are missing which are: i] time and space overhead due to additional data structures A-stack, E-stack and PDL, ii] the novel idea of domain caching used to optimize multiprocessors has not been evaluated, iii] security testing and results as shared locks were limited and PDL was used for access validation and iv] evaluation on heterogeneous hardware architectures.

5) Confusion: i] Is an A-stack allocated per client call or is it one A-stack for one protection domain that can in turn have multiple clients? ii] what does caching domains actually mean?

Posted by: Shruthi Racha | February 23, 2016 08:42 AM

Summary
This paper talks about the design and implementation of an optimized lightweight RPC (LRPC) mechanism for communication between protection domains on the same machine for small-kernel operating systems. The authors demonstrate that such a communication facility can achieve a three-fold performance gain without sacrificing safety.

Problem
The overhead of using traditional RPC for communication between independent modular components running in disjoint domains within a small-kernel operating system has often resulted in poor performance, or forced programmers to trade safety for performance. Here, much of the overhead comes from the marshalling and un-marshalling of the call arguments as messages, access validation, flow control of messages and the need for a dispatcher on the server side. The authors claim that much of these overheads can be avoided by a careful re-design of the RPC facility on a single machine.

Contributions
According to me, the following are the unique contributions of this paper:
(1) A novel implementation of LRPC, which is similar to a protected procedure call, but provides programming semantics and coarse-grained protection model of a typical RPC.
(2) The concept of a shared argument stack (A-stack), in which the client copies the arguments and then the server directly operates on the A-stack to avoid additional data copying overheads.
(3) Delegating the responsibility of access validation to kernel and use of Binding Objects to verify the identity and access rights of the caller.
(4) Separation of argument-stack (A-stack) from the privately mapped execution-stack (E-stack) that enable threads to cross safely between protection domains. This A-stack/E-stack separation provides both safety and performance.
(5) Minimized use of shared data structures, where only one lock on A-stack queue is needed, providing higher call throughout and lower call latency on shared-memory multiprocessors.
(6) Caching of domains on idle processors to reduce context-switch overhead in case of multiprocessors.

Evaluation
The authors have evaluated the performance of LRPC versus Taos RPC on a C-VAX Firefly using four different types of tests to measure the average time it takes in microseconds to complete a procedure call between a client and a server running in different domains. For all of these tests, LRPC is shown to be almost three times more efficient than the RPC. The authors also demonstrate that their proposed solution to avoid locking shared for contention removal achieves higher throughput on a shared-memory multiprocessor than a typical RPC done on the same. LRPC on four processors is able to complete 23,000 calls per seconds, whereas SRC RPC levels off with two processors at 4,000 calls per second. Finally, the authors have also discussed how LRPC gracefully handles uncommon cases of variable & large A-stack sizes and pre-mature domain termination.
In my opinion, the authors have presented a fair evaluation of their proposed LRPC mechanism and presented a detailed breakdown of the exact overheads in terms of stubs & kernel transfer over the minimum possible that could be achieved. However, much of their evaluation is for simple arguments and it would have been interesting to see how gracefully LRPC handles complex recursive data types. Additionally, memory management of A-stacks and E-stacks seem like an additional overhead in terms of housekeeping for the kernel and it would have been good, if the paper had presented some numbers around this for varying inter-domain communication workload.

Confusion
The paper mentions that a call-linkage record can be quickly located given any address in corresponding A-stack. How exactly this happens and what data structures are used for this purpose?

Posted by: Saket Saurabh | February 23, 2016 08:21 AM

1. Summary
This paper describes a new mechanism for procedure calls between different protection domains on microkernels. It uses a region of shared memory to pass arguments, leading to a three-times improvement in performance, and introduces new mechanisms for thread failure.

2. Problem
When this paper was written, many small-kernel operating systems used the same mechanism both for remote procedure calls and for procedure calls across different protection domains. This often involved copying the arguments four times: from the stack for the client's stub into an RPC message, from the client's domain into the kernel, from the kernel into the server, and from that RPC message in the server into the server's stack. In addition, arguments were marshalled in the most general way possible. This created greater flexibility, allowing for the use of complex data structures, but these were rarely used in practice and the general mechanism greatly diminishing the efficiency of marshalling simple byte sequences that were used much more frequently.

3. Contributions
This paper contributes a new implementation of RPC. Like traditional RPC, it uses client and server stubs and requires a trap into the kernel. Unlike other RPC systems, the runtime for the remote procedure call is all on the server's side: the stub code handles everything for the client. This runtime maintains a descriptor for each procedure, including the size of the procedure's argument stack.

When a client calls a remote procedure, the kernel uses this size to allocate shared memory for A-stacks, which are located in the server's protection domain and are mapped to be read-write by both the client and the server. Once all steps have been completed, the kernel returns a Binding Object to the client, which is similar to a capability in allowing the client access to the server's interface.

The A-stack, rather than messages, is used to pass parameters from the client to the server. The client pushes the supplied arguments into this region of memory. This avoids the need to marshal the arguments, which can then be directly read by the server, so long as the calling conventions allow a separate argument pointer. The A-stacks are managed in a LIFO queue and can be re-used by procedures in the same interface.

The stubs for this procedure are generated as assembly code in the vast majority of cases, rather than in a higher-level language as in prior procedure call mechanisms. The stubs are not portable, but need not be so long as the stub generators are portable, as the authors expect.

Unlike in the case of true remote procedure calls, the client and server are in the same thread. If the server crashes, the client will also crash. This paper describes a technique to recover from this. If the server crashes, the client domain creates a new thread which is initialized in an error state, as if the server had just returned with an error. In addition, every Binding Object is revoked when the server crashes.

4. Evaluation
This paper evaluates this work on several microbenchmarks. The authors ran a remote call of a procedure that does nothing, which reveals the extra time taken by the remote procedure call mechanism. They also ran tests of procedure calls with small arguments and of procedure calls with large arguments. They found that all were over three times as fast. This seems to evaluate well their claims of speedup, although larger test cases would have been nice.

The authors also make claims about the portability and flexibility of their approach. They can give no evidence for portability that this approach is actually portable as they claim. They seemingly evaluate flexibility by how well the code was able to handle uncommon cases. This seems lacking.

To establish the problem that they are trying to solve, the authors also evaluated how remote procedure calls are used in contemporary systems. They measured activity on a networked multiprocessor workstation connected to a remote file server. In doing so, they found that only five percent of remote procedure calls were actually to remote machines and that very few procedure calls required the marshalling of complex arguments.

5. Confusion
I am still confused by how allowing the client access to memory in the server's domain does not introduce security vulnerabilities.

Posted by: Stephen N. Lee | February 23, 2016 08:12 AM

1. Summary
The Lightweight Remote Procedure Call mechanism tries to address the inefficiencies in normal RPC usage in the common case where small footprint calls are made on the same machine. This is done by combining the control transfer and communictaion model of capability systems with the programming semantics and large-grained protection model of RPC. The Taos implementation on the Firefly multiprocessor avoids needless scheduling, excessive run-time indirection, unnecessary access validation, redundant copying and lock contention.

2. Problem
Small kernel OS aim at placing components in separate domains. An RPC like mechanism is used for communication. But this 'one-size fits all' approach doesn't account for the real usage pattern: although the bulk of communication is between domains on the same machine, and small-simple parameters are involved, the overhead for the heavyweight machinery (marshaling, access validation, scheduling, dispatch, etc) is still paid.

3. Contribution
This paper points to a real mismatch between design and usage pattern. They provide a communication mechanism designed for concurrency, that maintains domain protection at a lower cost by dispatching a shared argument stack directly to the server domain. Linkage records are used to facilitate the transfer of control. Security is enforced by Binding Objects, which are capabilities obtained at bind-time. The main technique is to optimize the common execution path. The LRPC caller still calls into a stub, but instead of a complicated dispatch mechanism, the stub simply uses the A-stack (shared with the callee) to pass arguments, no unnecessary data copying is done. A-stacks are preallocated. Execution-stacks on the server are not associated with A-stacks at bind time, rather dynamically, so that large E-stack space explosion is avoided. E-stacks are 'activated' directly by the kernel, no dispatching overhead. A-stacks can be shared between procedures in the same interface with similar size requirements. Stubs are generated in assembly.

4. Evaluation
It shows three times faster for LRPC over conventional Tao RPCs. And LRPC for multiprocessors performed slightly better than single processor.The performance evaluation is done by testing four workloads on C-VAX Firefly using LRPC and Taos RPC.
The shared stack optimization only holds for their language environment, it would not work for the more common case of C-like languages. In my opinion, this is not a good design for the stack optimization. The mechanism fits well into a system that already uses RPC, where communication is done only through arguments/return values. It is unlikely that one could take it a step further, where local LRPC caller/callee would also share global data. In a way, they are not comparing the same thing since the assembly generated stubs for LRPC increase performance of LRPC, while Modula stubs for RPC decrease the RPC performance; this blurs comparison of the rest of the mechanisms (which would be more accurate if assembly-generated stubs were also used for old RPC). There is an upfront cost that is paid at bind time for setting up the A-stacks. Multiple A-stacks allow for concurrency, but the concurrency limit is bound at bind time. In the uncommon case, RPC like overhead (marshaling, Modula-stubs, etc) is still involved, so LRPC maintains two coexisting mechanisms.

5. Question
How is tagged TLB slower than domain caching?

Posted by: Sejal Chauhan | February 23, 2016 08:06 AM

1. Summary
This paper describes the motivation, design, implementation and performance of Lightweight Remote Procedure calls, a communication facility which combines the elements of capability for protection and semantics of RPC for performance in communication for the common case of cross domain communication on same machine. It has a much simpler control transfer, data transfer, simple stubs and can be optimized with multi-processors.

2. Problem
In the paper the authors show that most of the communication traffic in operating systems is between protection domains on the same machine(cross-domain) and the communication is much simpler and involves handles to data structures with small value parameters. The authors argue that conventional RPCs violate the tenet of system design by not isolating the common and much simpler case of cross-domain communication. Due to this overhead, small-kernel OS suffered from a loss in performance and deficiency in structure for protection. The authors perform experiments to estimate the frequency of cross-domain communication to around 95% for 3 systems. Most of these procedures have small and simple parameters. This is the motivation for designing LRPC.

3. Contributions
The main contribution of LRPC is that it achieves significant speedup in cross-domain RPC communication while retaining the qualities of safety and transparency. The execution model of LRPC is borrowed from the concept of protected procedure call. Binding is similar to RPC where a binding object is assigned to a client that acts like an authentication for communication with exported interface. LRPC has simple control transfer mechanism which allows the client’s thread to execute the procedure in server’s domain. Data transfer is also much simpler with the use of a shared argument stack between client and server to avoid redundant copying. Due to the simple data and control transfer model of LRPC, highly optimized stubs can be generated usually in assembly language. LRPC is also designed to benefit from speedup in multi-processor environment by caching domain contexts on idle processors.

4. Evaluation
The authors evaluate the LRPC mechanism by comparing it with existing RPC mechanisms using procedures with small and simple arguments like Null, Add, BigIn and BigOut. LRPC is faster than RPC by a factor of 3 on these procedures without the multi-processor optimization. The authors also evaluate the overheads incurred by LRPC due to stubs and kernel transfer which they claim is mainly due to TLB misses. A plot of throughput as the number of processors is presented showing a speedup of 3.7. Although the authors talk about implementing the technique to uncommon cases, they do not present any evaluation of LRPC mechanism on inter-machine communication. They also do not present evaluation of procedures with complex arguments. It would have been interesting to know how the LRPC compares with RPC in these cases.

5. Confusion
How is the binding mechanism for LRPC handled by kernel? How does it work in inter-machine communication?

Posted by: Anshul Purohit | February 23, 2016 08:00 AM

1. Summary
This paper describes the design and implementation of Lightweight Remote Procedure Call (LRPC), which is specifically optimized for communication between processes on the same machine, while mimicking the mechanisms and concepts of RPC.
2. Problem
While the classic RPC model might be good for cross-domain communications on different machines, it incurs high unnecessary overhead when the server happens to be on the same machine. The LRPC designed in this paper addresses this problem by optimizing away the overheads when the target server is known to be local.
3. Contributions
The optimization of LRPC over RPC in local communication cases can be divided into four aspects:
i. Simple control transfer. Given the binding object which is acquired at binding time, at client’s call to the server’s procedure, the kernel verifies the binding object and procedure, sets up the argument and execution stack, and directly transfers the client thread into the server’s domain, whose cost is basically similar to a normal function call plus the context switch.
ii. Simple data transfer. This is done by pairwise allocating the argument stack which specifically contains contents of procedure call parameters. The use of shared argument stack avoids redundant data copying when arguments are packaged and de-packaged in the stubs and passed as messages, as was done in RPC calls.
iii. Simple stubs. Since the mechanisms of control transfer and data transfers is greatly simplified, and especially the stubs no longer have the complexity of packaging arguments into messages, LRPC’s stub generator is able to directly generate stub code at assembly level instead of source code, which greatly improves the performance by avoiding the compilation of the stub code.
iv. Design for concurrency. LRPC further utilizes multi-core hardware to reduce the latency of function call and increases throughput. They introduce a technique of keeping an idle processor to avoid much of the context switch cost overhead during the control transfer.
4. Evaluation
The paper briefly mentions LRPC’s performance comparing to normal RPC call on Taos system, where it tests four different cross-domain calls. The results show that the LRPC improves the performance by a factor of four, and a closer inspection shows that the overhead comes close to the lower-bound for making a safe cross-domain call. The paper also mentions several special case which may affect the performance of LRPC.
I would suggest that future work would include tests on the memory overhead of maintaining A-stacks and E-stacks and if the kernel may suffer in performance when cross-domain calls are highly frequent, which is also a case for which this paper has aimed to optimize. While the design of the calling process seems neat, we do need to consider the pressure of the kernel to maintain the exported procedure list and different stacks when usage is intensive.
5. Confusion
The paper also mentions that a process tag may help to reduce the cost of domain crossing. How does that work exactly?

Posted by: Fujie Zhan | February 23, 2016 04:54 AM

summary~
In this paper, the authors presented the design of lightweight RPC, which aims to reduce the overhead of communication among protection domains on the same machine without sacrificing the security.

problem~
Small-kernel OS realizing the advantages like modular structure, easing design, and failure isolation through placing the different components of the system into different protection domains and relying on the communications between these components by message passing.
But conventional RPC implementations fails to exploit that most communication traffic within the OS is between domains on the same machine, rather than between domains on separate machines. As a result, local communication is handled like remote communication, and this imposes significant overheads on the communication, which hurts the performance of small-kernel OS. To accommodate this, unrelated components are packed into the same domain at the price of security and modularity.

contributions~
One major contribution is that the authors did the analysis of the use and the performance of current RPC implementations on three contemporary OS. One aspect of the analysis shows that the majority of the RPC calls are actually locally. This helps clearly identified the problem that the current system fails to recognize the common case and fails to make the common case fast.
After identified the problems in the current system, they incorporated the techniques form the protected procedure call into their design, and also remove the parts of the design in the current RPC system that is not needed for local calls. Together these increases the performance by a factor of three compared to the current system in their benchmarks. Some notable features in their design including (1) the use of a shared A-stack to avoid copying of parameters during the parameter passing steps in local calls. (2) the new scheduling technique that copes with the trend of more processor available to the system to reduce the context switch overhead. (3) the control transfer process is also simplified with the help of capabilities. (4) the using of assembly based stubs also improve the performance, and I feels like remote calls will also benefit from this.

evaluation~
Besides the studies done on the current systems mentioned before, they also did a detailed evaluation on their SRC Firefly system. I feels like their evaluation effectively validates their design goal. Specifically the null cross domain call shows the effectiveness of their design in minimizing the control transfer and binding and overhead. The BigIn and BigInOut tests justified their design effort for parameter passing. Finally they also run the test on a multiprocessor setup to test their design effort of idle processor optimization. But beyond this they go even further reasoning about the breakdown of the measurement and also reasoning through how their implementation will perform under some of the uncommon cases.

confusion~
Seems like the A-Stack is closely related to the ABI of Modula2+, how the system handles the ABI of other programming languages?

Posted by: Yudong Sun | February 23, 2016 03:57 AM

Summary:
The paper describes the design and implementation LRPC communication mechanism that’s been designed for communication between protection domains. It talks about the advantages of splitting monolithic kernels into protection domains and using LRPC for communication. It also measures the performance of LRPC on various systems and also compares it to cross domain RPC.

Problem:
Although RPCs are an effective communication mechanism for communication across systems and processes, since local cross domain communication was considered to remote communication by the traditional RPC approach, this lead to very poor performance. Moreover RPCs for inter domain communication within the same system can be further optimized by using protected procedure calls, some kernel support, shared memory.etc. This could help avoid adding all the subsystems into a single protection domain but at the same time provide enhanced protection with no additional communication overhead.

Contribution:
The paper talks about the advantage of having multiple protection domains instead of a single monolithic kernel such as modularity, easy design, failure isolation and easy distribution, hence combining the large grained protection model of RPC with the small kernel approach. Most inter-domain communication happens within the same system and are simple.
It demonstrates the performance of LRPC on Taos OS for cross domain communication in comparison with traditional RPCs and attributes this to shared argument stack, easy control transfer and optimized stub generation. It also shows that the number of cross machine communications is typically low for the common case and also that most data transfers across domains are small.
The paper describes the components of a cross domain RPC call which includes stub generation, message buffer, validation of access, message transfer, scheduling, context switch and dispatching overheads. Multiple other systems such as DASH, Mach, Taos have optimized one or more of these.
It describes the design of RPC involves 1) binding of clients to interfaces exposed by servers, 2) implementation of the call which involves calling the stub, taking an A-stack, populating it and trapping. 3) verification of binding, executing and returning by trapping. It talks about optimizations in the case of multiprocessors to reduce context switching overhead and also avoiding argument copying.
Overall I think that the paper describes the effectiveness of using these mechanisms in the case of the small kernel approach with multiple protection domains but do not make a really strong case for not using single monolithic kernels. Moreover the paper does not claim LRPC to be really efficient for cross machine communication, which further reduces its use case.

Evaluation:
Author evaluates the Taos operating system, Unix+NFS system and the V-system to show the drawbacks of the RPC model. As mentioned in the contribution section, the paper mentions about the common case where cross machine communication is significantly lower compared to cross domain communication but does not provide a clear indication about the exact type of synthetic workloads and its split up details. It also provides a split up of components contributing to the overhead in case of LRPC in comparison with RPC, which was pretty useful. NULL RPCs were used for computing individual overheads which proved that most of the time was spent in linkage management and binding. The paper explains the advantage of domain caching in multiprocessors theoretically which makes sense, but does not provide evaluation results for such critical optimizations used. The paper also mentions about the advantage of not using locks for shared data and its impact on improving scalability of the system.

Doubts:
Why do we need to copy arguments when control is anyway transferred to the server and both the server and client are running on the same machine, thus sharing memory?

Posted by: Siddharth Suresh | February 23, 2016 03:07 AM

Summary
The paper presents a light weight remote procedure call which is a communication facility designed and optimized for cross -protection domain communication on the same machine. LRPC combines the control transfer and communication model of capability systems with the programming semantics and large-grained protection model of RPC. LRPC improves performance by a factor of three over more traditional message exchange based approaches, in fact it is very close to the lower bound on cost imposed by conventional hardware.
The problem
Monolithic kernels are insulated from user programs, but few protection boundaries exist within the OS itself, which makes it difficult to modify, debug, and validate. Capability systems consist of fine-grained objects sharing an address space but with their own protection domains. These offer flexibility, modularity, and protection. It was observed that the most common case for communication is between domains on the same machine as opposed to across machines. Also majority of the communication is simple, involving few arguments and little data, since complex data is often hidden behind abstractions. Conventional RPC can handle both local and remote calls, but it fails to exploit the fact that cross domain calls are significantly less simpler than their cross machines counterparts. This incurs high overheads both in terms of performance and structure (packing logically separate entities together into a single domain increases its size and complexity). Hence the authors were motivated to develop a light weight remote procedure call.

Contributions

Simple control transfer: Conventional RPC involves multiple threads that must send signals and switch context. In LRPC, the kernel changes the address space for the caller's thread and lets it continue to run in the server domain .
Simple data transfer: LRPC removes the overhead of message passing by using pre-allocated shared argument stacks for the communication of arguments and results. This also reduces argument copying overheads as the caller needs to copy the data onto the A-stack just once, after this no other copies are required.
Simple stubs: Since caller and callee are on the same architecture, no marshaling is required.
Design for concurrency:Idle processors in multi-processor machines cache domain contexts to reduce context-switch overhead. Counters are used to keep the highest activity LRPC domains active on idle processors.

Evaluation
The performance evaluation is done by testing four workloads on C-VAX Firefly using LRPC and Taos RPC. The paper compared LRPC on single processor, LRPC on multiprocessor and conventional Taos. It shows three times faster for LRPC over conventional Tao RPCs. And LRPC for multiprocessors performed slightly better than single processor. I really like the idea of providing measurements for null domain call along with a complete breakdown of the overheads. It helped in understanding the overheads of LRPC call over the lower bound cost. The best part of the evaluation is that the authors have explained the reason behind their improved performances throughout the paper. However there are certain loopholes in the evaluation. Firstly Firefly does not support pairwise shared memory which is used for A-stacks. Secondly I would have liked a comparative analysis of LRPC with other procedure call methods like a port based one . Thirdly an analysis of the E-stack memory footprint would have been useful . Also I feel the number of workloads considered is not sufficient, a variety of benchmarks should have been considered instead of just three similar ones with varying argument size.

Confusion
Why is tagged TLB slower than domain caching?
In a multi-processor system, how can the idle thread of the server be exchanged to the client's domain?What if it wakes up there?

Posted by: Amrita Roy Chowdhury | February 23, 2016 02:50 AM

Summary
The paper points out the overheads associated with the intra-machine RPC calls and proposes a new design called Lightweight Remote Procedure Call (LRPC). It elaborates on the implementation on Firefly and evaluates it against RPC to show that the performance increases 3 fold on an average.

Problems
Remote Procedure Calls(RPC) were not designed with inter-domain calls in consideration and hence incur additional overhead and memory for its execution. Also inter-machine calls form only a small fraction of calls in real time systems. The authors also point that for most of these calls the data structures are not complex but just simple small data sets. There is also unnecessary copying between kernel and RPC stubs for data that is being used in the same machine.Thus design an RPC procedure for a most use case would make more sense than conventional RPC mechanism.

Contribution
(i) A-Stacks : The arguments stacks help in sharing the data between the client and server. They also help in reducing the number of copy operation required for execution from 4 in conventional RPC to 1 in LRPC. This saves execution time and is reflected in performance improvement. The design also selectively checks for data types if needed while executing parameter copying. The size of A stack is determined at compile time for fixed argument calls and limited to ethernet packet size for variable arguments.
(ii) The paper makes small changes to the control flow of a conventional RPC. For example, it has added an extra bit to the binding object to determine local or remote RPC. Binding is done by kernel as opposed to an external database (eg. GrapeVine) as in case of RPC. The binding is done at procedure call granularity and A stacks are assigned at this level too. The server domain now maintains an execution stack (E-Stack) as a local scratchpad for storing temporary results and for copying data pointed by a reference pointer in A-Stack. Also, in LRPC, the calls are not executed in separate server thread. Instead, the client’s protection domain is switched to server’s protection domain and executed with fewer threads.
(iii) A few changes to the stub has been made in this work. The stub generator is written for each hardware and stubs are light weight assembly code and highly optimized which are generated at runtime. Porting involves just changing the stub generator for different hardware. They have reused the idle thread execution in multiprocessor systems to avoid the latency involved in completely context switching a new thread for execution and improves the performance marginally.

Evaluation
The authors have evaluated the existing systems to show that only a fraction of RPC calls are inter-machine and majority of them are inter-domain. And the trend is observable in various kernel types. Also, they have presented the RPC size distribution to show that most of the calls have simple and small byte size packets. These evaluations are critical to back their motivation.
The LRPC implementation has been evaluated by executing 4 different RPC routines and compared the timing overhead with Taos RPC. LRPC sustains greater throughput in comparison to RPC and this reflects the support for greater concurrency in the design as the authors have claimed. This work also tabulates the breakdown of execution time within LRPC. Their evaluation is pretty much complete as they back all their claims to provide faster execution, simpler control transfer and stubs and ability to sustain concurrency. However evaluation does not talk about additional memory required for maintaining E-stack or memory saved by sharing variables in a single A-stack. These would have added to the benefit of the proposed design. Also it overlooks the ability to port to different architectures.

Confusions
Parameter passing and type checking was not explained clearly.
Is the fraction of inter-domain and inter-machine calls still the same in present day systems too?

Posted by: Bhardwaj Krishnamurthy | February 23, 2016 02:50 AM

Summary
This paper provides the design and implementation of a new communication facility named Lightweight Remote Procedure Call(LRPC). The aim is to optimize the existing RPC system for communication between different domains on the same machine. Further, the new design is also evaluated with the traditional one.

Problem
RPCs, though, facilitated a good way for communication for processes on different machines, they incurred a heavy time cost when it comes to using it for the processes with different protection domains on same machine. The authors try to address these problems faced in traditional RPC mechanisms. Further, they found that the majority of communication happens on the same machine as compared to on different machines. So, the aim here was to optimize the existing RPC mechanism by eliminating some of the complex methods used in RPCs which weren't necessary for same machine communication without sacrificing for safety/protection.

Contributions
1.The paper gives a hint on how majority of the RPCs are made on the same machine and hence show how important it is to optimize the performance for RPCs on the same machine.
2. LRPCs reduced the context switching overheads by introducing private owned E-stacks at the server domain and A-stacks at the client domain and only needed to copy the A-stack(argument) of the client onto the E-stack. Hence, providing a simple control transfer and data transfer.
3. The stubs in RPCs were optimized with the use of stub-generators that generated stub code in assembly language at run-time boosting the performance. This could be achieved because of the above mentioned simple data and control transfer.
4. LRPCs exploited multiprocessors and achieved good scaling because of the lesser use of shared data structures and reducing LRPC context switching latency by caching domain contexts on idle processors.
5. Since, E-stacks were privately mapped, this ensured protection too.
6. A remote call on the same machine could be easily differentiated from that on different machine by the use of a single bit in binding objects, hence providing a simple way to switch to using RPC for communication on different machines.

Evaluation
The performance for LRPCs were evaluated using four tests on a C-VAX Firely and were compared with Taos RPC on the same system. With the each of these tests, LRPCs show an improved performance by cutting down on it's latency. The authors also show the reduction in overheads in using LRPC over RPCs by providing a breakdown of times in each of these calls. The kernel seemed to take most of the time in LRPCs for binding validation and linkage management. The paper also shows the effectiveness of LRPCs over RPCs on multiprocessor systems with an improved scalibility as the number of processors grew. The overall evaluation system seemed good and did provide a good analysis on performance, but the authors do no really account for the implementation costs for LRPCs like the portability for stub generators.(as they generated assembly language, they needed to be modified on different machines.)

Conclusion
Did the other IPC mechanisms like pipes and sockets existed during this time ?

Posted by: Akshay Kanfade | February 23, 2016 02:38 AM

Summary:
The paper describes the design and implementation of a lightweight RPC that works with RPCs that are across protection domains but on the same machine. The paper claims that this is the most common mode of RPCs and that a lot of traditional RPC overheads could be avoided by utilizing the fact that protection domains within a machine can have shared memory.

Problem:
Traditional RPC include a lot of message copy, marshalling and flow control overhead because they assume that the caller and the callee are on remote machines. The paper demonstrates that this is not the case and proposes a LRPC design that has minimum overheads.

Contributions:
The paper has surveyed the use of RPCs on various production machines running different operating systems (either directly or quoting from literature) and filtered it based on whether they are across machines over a network or on the same machine across protection domains. This was used to come up with a common case solution to the generally high overhead RPCs.
When a RPC call is made to a server on the same machine, the arguments for the remote procedure are provided in A-stacks, which are allocated at binding time. These A-stacks are mapped in both the client and server protection domains thereby obviating the need of argument copy. Additionally, LRPC doesn’t require a receiver thread in the server protection domain to run the server code. The client thread is switched to the server protection domain to run the server code. LRPC thus doesn’t require two concrete threads. Hence, LRPC doesn’t require locks on the A-stacks thereby greatly reducing contention overheads.
LRPC stubs are lightweight as marshalling of data is not required. The overheads are very low as it involves just a bunch of move and trap instructions.
LRPC stub generator can take hints from the interface procedures if parameter copying is important. The stubs thus generated avoids copying of unnecessary parameters into the A-stack.

Evaluation:
The paper does a best-case analysis of RPCs within the same machine by just including the cost of procedure calls, traps and the context switches. This is unavoidable. The paper also demonstrates that the overheads of LRPC stubs and the data structure management in the kernel is 10 times less compared to that of the traditional RPCs. This combined with the earlier observation that cross-machine RPCs are rare, promise huge improvements to existing RPC designs. The paper also evaluates the concurrency of LRPC by showing the scalability in terms of call throughput with a number of processors on the same machine. A lot of LRPC features and optimizations such as different argument and execution stacks is enabled by the operating system and the programming environment. In my opinion, LRPC is this highly specialized for the the operating system that it runs in and may not deliver the same performance with other operating systems. The paper has not evaluated LRPC on other systems and hardware.
Confusion:
More details on type checking and parameter passing.
If most RPCs have small sized arguments, how is the need of RPC justified?

Posted by: Prashanth Balasubramanian | February 23, 2016 02:30 AM

Summary
The paper presents the implementation of Light Weight Remote procedure call which is designed for communication between protection domain on the same machine .LRPC performance is faster than normal RPC by using Argument stack, Simple stubs and utilizing of multiprocessor by caching domains .
Problem
Most of the calls in RPC is between domains on the same machine, and the parameters are small and fixed size. Hence RPC causes lot of overhead. The author proposes a light weight Remote Procedure Call which reduces these overhead.
Contributions
The major Contribution of this paper is optimizing the RPC for local domain calls by implementing the following features
a] faster binding. The client binds to interface by making a import via kernel and waits.The kernel interacts with server clerk and sets up the the argument stack which is contagious and returns to client the binding object which contains server’s function address.
b] Using of a single stack for transfer of data between the client and server.when the client makes the call it only needs to copy the arguments to the stack which are read by server.The return address at the client is also noted on to the A-stack which helps in faster return and execution at client.Hence reducing the Number of copies made for transferring data compared to the normal RPC
c] Simple Stubs. Server stubs are invoked directly from kernel , No intermediate message examination .client copies the data to A-stack , Kernel verifies it. and no dispatch needed by server.
d]Utilization of multiprocessor : caching the domains contexts on the idle CPU , and hence reducing the context switch overhead.Hence when a call is made, if an idle processor in the server context domain is found,the kernel exchanges the processors of the calling and idle threads and hence removing the context time
e]LRPC also increases throughput by minimizing the use of shared data structures.
Evaluations
The authors has done great evaluations by clearly justifying how LRPC performance is better for local domain procedure calls..The number of kernel traps,context switches compared to RPC for a NULL argument is reduced significantly which is great performance boost.The author has also shown that time taken by LRPC with and without domain caching is far better than Normal RPC,for variable size arguments . The evaluations also show that throughput also increases as the locks for shared data structures are removed. The overall performance increases is 3.7 better than normal RPC.Overall I fell the author has properly evaluated the systems by also explaining at the the end how the LRPC would perform in uncommon cases.
The only evaluation which i find missing was how would be performance of LRPC when the argument size is huge and there is not enough contagious memory available to allocate to Argument stack. I believe the performance improvement in this case would not be the same as now the A-stack chunks will be distributed in the memory.
Confusions
I am not clear about the how exactly domain termination is handled.

Posted by: Mushahid Alam | February 23, 2016 02:29 AM

Summary
The paper discusses the design, implementation and evaluation of the Lightweight Remote Procedural Call (LRPC) mechanism for small-kernel operating systems that used the large-grained protection model of RPC and interacted mainly through local calls with small-sized arguments. The LRPC design is built on considerations of simple control transfer, simple data transfer (through a shared argument stack),simple, optimized stubs and concurrent design (to facilitate multi-processor execution), which allow it to provide a level of performance higher than traditional RPC for cross-domain (local) calls while retaining the security and transparency aspects of traditional RPC.

Problem
Small-kernel operating systems that used large-grained protection and programming models of distributed computing environments to achieve various modularity benefits suffered from performance degradation as most communication was local and simple. As local (cross-domain) communication was considered in the same vein as remote (cross-machine) communication by traditional RPC, it led to a loss of performance due to various RPC overheads (stub, message buffer, access validation, dispatch etc.) and/or a loss of structure, as the high RPC overheads forced designers to coalesce logically separate subsystems in the same protection domain to trade-off performance with security. Earlier attempts either did not consider the full scope of optimizations (DASH, Mach, Taos) or compromised security for higher performance(SRC RPC).

Contribution
LRPC solves the performance problem for cross-domain calls without compromising on safety. The execution model of LRPC is borrowed from protected procedure call, while its programming semantics and large-grained protection model have been taken from RPC. LRPC binding mechanism relies on the clerk module for exporting interfaces and for accepting incoming interface bind requests from clients. The Procedure Descriptor List (PDL) and Procedure Descriptors store server domain entry points, maximum number of calls allowed from a client and the size of its argument stack. For every client-server pair, the kernel allocates an argument stack (A-stack). The A-stack is used as a shared memory between the client and server through which call arguments and return values can be directly shared without intermediate kernel copying. The linkage record maintained by the kernel is used to identify caller's return address, while the Binding Object is the client's key to accessing the server's interface. LRPC calling occurs with the client providing the server with the populated A-stack and dispatching its own concrete thread of execution on the server. The server provides an execution-stack (E-stack) to the dispatched thread for use during its execution. LRPC also uses the domain caching optimization on multi-core architectures to keep frequently used server domains cached on idle processors to allow reduced context-switch overhead. The authors were also able to identify various argument copying overheads in traditional RPC which did contribute to added security, and consequently removed them in the LRPC implementation. Uncommon cases such as remote calls, domain terminations, and managing A-stack numbers and sizes have also been factored into the implementation.

Evaluation
The authors first show the deficiencies in the RPC model by studying the V System, Taos (the Firefly Operating System) and UNIX+NFS systems. Through cited research and through their own experimentation, they demonstrate that most RPC based communication is local (cross-domain calls dominate cross-machine calls) and simple (most call arguments are lower than 200 bytes). An issue with this evaluation was that it did not mention the workload characteristics used .The null cross-domain call overhead of different RPC systems was studied and attributed to different overheads characteristics.

LRPC performance was evaluated on the Firefly machine. The three systems used for analysis were the LRPC with domain caching optimization, LRPC without domain caching optimization and plain RPC (Taos). Even without domain caching optimization, LRPC call execution time was found to outperform Taos by a factor of 3. A breakdown of call execution overhead times for a cross-domain procedure call and LRPC revealed lower overhead call overhead for LRPC. The LRPC call overhead was mainly centered around binding validation and linkage management. Another analysis compared the scaling of the LRPC design (without domain caching) over RPC as the number of processor cores increased in the system. LRPC was found to achieve almost perfect scaling due to its avoidance of locking shared data during call and return. On the other hand, RPC suffers from poor scaling due to high contention for shared-memory structures on the critical control path.

Overall, I believe that LRPC was sufficiently evaluated to demonstrate its benefits. However, the authors should have also evaluated LRPC bind times against those of RPC. The memory overhead introduced by preemptively allocating A-stacks could have also been studied to empirically guide their default A-stack list size (Their choice of creating 5 A-stacks at binding time seems ad-hoc).

Question / Confusion
1. The argument passing part of the LRPC design was not entirely clear.
2. In both LRPC and RPC (previous paper), security is defined as a major objective, but has not been objectively evaluated. As a system designer, how can I evaluate the robustness of my design objectively?

Posted by: Shantanu Bhate | February 23, 2016 02:29 AM

1. Summary
This paper resents LRPC, a communication facility designed and optimized for communication between protection domains on the same machine. LRPC (Lightweight RPC) combines the control transfer and communication model of capability systems with the programming semantics and large-grained protection of RPC. LRPC is a viable communication alternative for small kernels, by not forcing a performance-safety trade-off. It's implementation saw a threefold speedup compared to traditional message passing threads.

2. Problem
In large monolithic kernels, there are no protection boundaries within the OS itself. This makes these systems hard to profile, debug, modify, validate and makes hardware vulnerable to complicated and large OS software.
Capabilities provide fine-grained protection, protection procedure calls, and the global address space makes parameter passing easy. However, such systems were hard to efficiently implement. RPCs provide efficient and convenient communication and large-grained protection mechanisms. Conventional small kernels used this large-grained protection model and also adopted threads talking via RPC. However, a large majority of messages are cross-domain and not cross-machine, and have simple parameters, which means that the conventional RPC is not optimal for the common case. Local communication is treated as an instance of remote communication, leading to sub-optimal performance and structure. Hence, designers coalesced logically distinct cooperating subsystems into the same domain to maximize performance at the cost of design and safety.

3. Contributions
LRPC optimizes for the common case, cross-domain calls, while removing the need to trade-off performance and safety.
Execution model is similar to protected procedure calls. The client's concrete thread is switched to the serve domain.
Programming semantics and protection granularity are borrowed from RPC.
Binding :- Procedure Descriptor Lists, linkage records, A-stacks, Binding Object.
Calling :- Client Thread runs on privately mapped E-stack after kernel trap.
Stub Generation :- in assembly language for the common case. Defaults to SRC PC stub generation for complex/heavy params. Shift occurs at compile time.
Argument Copying :- Lesser copying, avoided where possible.
MultiProcessor Systems :-
LRPC avoids locking shared data during calls, which removes contention on multi-processors.
Context switching, which is a major factor in LRPC latency, is reduced by caching domain contexts on idle processors. When a call is made, the calling thread and the idle thread exchange processors. While this is not a new idea, it has been generalized by caching protection domains here.
Uncommon Cases :-
Domain Termination is a problem not present in traditional RPC, but is handled reasonably here.

4. Evaluation
Implemented on the Taos OS on the C-VAX Firefly multiprocessor workstation. LRPC vs Taos SRC RPC.
Justification of Design Choices :
Three OSes were examined to determine frequency/ratio of cross-domain and cross-machine calls. Under real-life workloads, 80% of calls on Taos had fixed size parameters known at compile time. Over 75% did not need marshalling, simple byte-copying sufficed.
Evaluation :
Three-fold speedup on synthetic workloads (mean / average of 100,000 calls).
Contention avoidance in Multiprocessor systems is proved by comparing the throughputs obtained in a single processor system and a multiprocessor system.
My Opinion :
Profiling was done on the Null LRPC, which gives us insight into the internal latencies.
Graphs and Tables are useful and well-presented.
Evaluation was done on synthetic workloads, would have been nice to have realistic workloads too.
Evaluation of uncommon cases is a good thing to have, to evaluate bad/worst case behaviour.
Portability to other systems difficult? Authors mention that LRPC would have been painful on systems other than the Firefly.

5. Confusion

A-stacks and E-stacks, and Argument copying avoidance.

Posted by: Adithya Bhat | February 23, 2016 02:15 AM

1.Summary:
This paper is about the design and implementation of Lightweight Remote Procedure Call(LRPC) for communication between protection domains on the same machine for better performance and security. This was implemented on Taos operating system of the Firefly Multiprocessor workstation.

2.Problem:
RPC system incurred high cost on small kernel operating systems for inter-domain communication. This led the system designers to merge weakly related subsystems into a single protection domain, which affected safety. On an average, it was observed that more inter-domain procedure calls were made when compared to inter-machine ones. Thus LRPC mechanism provides communication between domains with better performance, without compromising on safety.

3.Contributions:
Following are some of the key contributions:
1) Since the communication is on the same machine, the client's thread itself executes the required procedure on the server's domain through a simple control transfer. This reduces the number of the context switches required during the call, when compared to RPC.
2) Having a shared argument stack between the client and server domain for argument passing. This reduces the bottleneck of parameter passing, locks in case of shared address space and message copying. Execution stack is created on the server domain to run the requested procedure.
3) Security is maintained by verifying the client with the help of a Binding Object to access the server interface.
4) Stubs are light-weight and require no intermediate message examination and dispatch.
5) The latency due to context switch between domains can be reduced in multiprocessor systems by caching domain contexts on idle processors.

4.Evaluations:
The authors have evaluated LRPC on C-VAX Firefly machine for four different tests of different argument sizes and compared it with Taos RPC in terms of execution times. They show that LRPC is three times faster than the conventional RPC originally developed. They also observe the LRPC overhead by running single processor NULL test, where the overhead is mainly due to binding and linkage management. Despite of optimizing for TLB misses, 43 TLB misses occur during NULL call. They demonstrate the efficiency of domain caching on multiprocessor C-VAX and how LRPC can handle uncommon cases such as identification of cross-domain and cross-machine calls, A-stack management, failure handling. The authors could have additionally measured the overhead of creation and association of A-stacks/Execution-stacks on the fly. Overall, a good evaluation of the system has been presented.

5.Confusion:
I did not completely understand how unnecessary argument/parameter copying is being avoided as an optimization in the system.

Posted by: Sharanya Devaraj | February 23, 2016 02:11 AM

1. Summary
This paper presents the design of LRPC, a mechanism to speed up the procedure calls between protection domains on the same machine. The main idea is to provide fast inter-domain communication without compromising safety. The work combines the control transfer and communication model of capability systems with protection model of RPC.
2. Problem
The main observations made by the authors are - most of the communication happens between protection domains on the same machine (not across machines); majority of these procedure calls involve passing of small sized simple data structures. However, traditional generic RPC mechanisms treat these simple inter-domain intra-machine procedure calls same as the inter-machine calls, and thus incur the various overheads associated with RPC. On the other hand, if the communicating entities are made part of the same protection domain to avoid RPC overheads, it compromises on security, and makes the kernel big and complex which is against the philosophy of building small-sized operating systems. This is an undesirable trade-off that LRPC intends to eliminate.
3. Contributions
The authors made a very interesting observation regarding the majority of procedure calls being simple and intra-machine in nature. Subsequently they identify various overheads incurred in traditional RPC and then develop LRPC to optimize for the common case. They provide LRPC runtime library to specialize the binding in the case when interfaces are exported (by server) and imported (by client) on the same machine. LRPC eliminates the usual argument marshalling, and exploits the shared memory to directly pass the small sized arguments between client and server. This is achieved by bind-time allocation of Astack/linkage and runtime-allocation of Estack. LRPC has stubs that are invoked directly by kernel without the overheads observed in RPC and further makes the stubs lightweight by coding them in assembly. However, for inter-machine and complex procedure calls, traditional RPC stubs are used. Another innovative part of the design exploits the idling cores in multiprocessors - instead of doing a context switch on same core, LRPC can benefit from thread migration to an idling core if the target protection domain context is already cached on the iding core. They also have support to encourage such scenarios by tracking the usage of server domains and using idle cores to keep the most popular server domains’ contexts warmed-up. Finally, LRPC optimizes the argument passing for short-medium argument sizes by reducing the overheads of copying the arguments multiple times - LRPC copies an argument only once compared to naive RPC that does the copy operation four times. All these optimizations respect the inter-domain design approach for security.
4. Evaluation
LRPC was evaluated on C-VAX Firefly system running Taos. They ran four tests to compare the performance of LRPC/MP, LRPC and Taos RPC for varying arguments sizes, ranging from zero to 200 bytes. These tests clearly show a 3x improvement in LRPC compared to Taos RPC. Further, the authors have done a very good job of presenting the breakdown of LRPC overhead over the expected theoretical minimum. This highlights severity of penalty imposed by TLB misses due to context switches. The evaluation also highlights the low-contention locking in LRPC by showing how LRPC performance scales well with multiprocessors (tested for 5-processors machine) while RPC saturates much sooner. As mentioned, a scalability study on bigger multiprocessors would be interesting.
There are some more studies that authors could have carried out for completeness. The benefit of multiprocessor optimizations to exploit domain caching has not been evaluated well, which could help understand the benefits thread migration over context switch. Similarly, a more exhaustive study of the impact of argument size on performance would help better understand the speedup provided by argument copying optimizations. Lastly the authors have claimed that the minimal indirection in LRPC stubs to chose between LRPC and RPC does not impact the performance for inter-machine RPCs. However, a small set of experiments to back this claim would have been more convincing.
5. Confusion
Is there a global Grapevine-like database that tracks all the exported interfaces in addition to the machine-specific database accessed by LRPC runtime library (clerk).? How does the binding happen across machines?
The paper talks about no restriction of synchronous termination of all threads and thread capturing towards the end. That part is not very clear.

Posted by: Lokesh Jindal | February 23, 2016 02:07 AM

1. Summary
The authors discuss an optimized remote procedure call designed for communication between protection domains on the same machine. They realized that the existing RPC calls incur significant overhead which are not required for the previous use case. They thus implement a lightweight implementation of remote procedure calls to handle this issue and switch to the standard RPC mechanism for calls to a remote server.

2. Problem
Remote Procedure Calls incur significant overhead in terms of copying arguments multiple times, context switching and scheduling and dispatching a thread in the receiver domain. Apart from this, the RPC interface is suited for a wide variety of message sizes leading to inefficiencies. The authors observed that most RPC calls are cross domain rather than cross machine. This fact can be exploited to optimize a lot of these overheads. A lightweight RPC call can also alleviate the current issues with adoption of Capability Systems supporting fine grained protection.

3. Contribution
The initial insight from the author related to the percentage of RPC calls that end up to a server on the same machine laid the foundation of this work. They also provided an interesting analysis regarding the size and complexity of the parameters being passed. Once the general case scenario was established, the paper provides many optimizations to improve local RPC calls.They modify the binding mechanism to help identify whether a RPC call is to the same machine. For RPC calls to the same machine, they allocate A-stack (argument stack) for each procedure in the interface. These A-stack have multiple stacks within them to support multiple thread calls to the same procedure. These stacks exist in a global space and are used to communicate parameters and arguments. In order to call a procedure, it locates an execution area in the server domain and updates the current thread's stack pointer to run off this new E stack. It reloads the processor’s VM with the server domain and starts executing the from the address specified during binding. This reduces the scheduling and context switching overheads and also optimizes for scenario when a server does not have enough thread for dispatch. By pointing the E Stack to the A Stack the copying overhead of the usual RPC call was also reduced. To optimize for a multicore scenario, where we can expect to have idle cores, the authors suggest running popular servers on these idle cores so that the TLB and Cache warmup overheads could be reduced. The caller thread can then “migrate” to this core in order to run the corresponding procedure with required protection domain switch. This avoids the usual context switch overhead. Apart from the performance optimization of local RPC calls, this paper can also help push an OS design philosophy that believes in simpler multiple protection domain rather than a single monolithic kernel.

4. Evaluation
In the initial evaluation to provide motivation, I would have preferred seeing some more description about the type of workloads running on the Tao system. To evaluate the final prototype, they run 4 tests on the C-VAX Firefly using LRPC and Taos RPC. These tests, represent RPC calls with varying parameter size. The authors compare their design to a highly optimized RPC mechanism by the DEC SRC Tao. The four tests results show that LRPC is 3X faster than SUN RPC. They provide a break-up of the latency components for LRPC and show the overheads are minimal. The optimizations provided by the author are really good and it would have been interesting to see the impact of each optimizations individually. Some notion of the footprint of E Stack would also help provide a better analysis. Finally, the authors evaluate the scalability of this method and the corresponding benefit of avoiding locks for shared data, Apart from this, the authors discuss some uncommon case and their impact on the system. Although Remote Procedure Calls should only see a penalty of an added branch instruction, they do not expect to see a huge performance impact. They discuss scenarios where parameter sizes are larger than A-Stack size and domain termination and methods to handle them.

5. Questions
The paragraph discussing methods to avoid argument copying when actual parameter passed was not important to the server was difficult to understand.

Posted by: Urmish Thakker | February 23, 2016 01:56 AM

1. Summary: This paper is about lightweight RPC, which is an implementation of RPC, specialized to accelerate local machine inter-domain calls by making use of optimizations like same argument stack, etc. The authors compare the performance of their implementation with some of the most optimized RPC implementations, and justify their motivation.
2. Problem: Contemporary operating systems provided only a shallow protection hierarchy, leading to the development of capability-based systems, which established a client-server model among the modules of the kernel (like Mach); and used message passing to interact. But the level of fine-grained protection provided by these capability systems demanded a radical change in application development, and impeded their easy use. With the advent of RPC and its large-grained protection model, these messages could be replaced by RPC calls, making the system scalable too. But these RPC calls were unable to differentiate between cross-domain calls, and cross-machine calls doing unnecessary protection checks and marshalling in the former case. The authors of the paper call the inter-domain communication the ‘common case’, and show that these constitute 97% of all calls in some systems.They identify this problem, and provide an optimized implementation of RPC for this common case. They argue that the latencies added to the actual inter-machine calls because of their implementation is negligible.
3. Contribution: Some of the highlights of their work include:
i. simple data transfer with the use of : A-stacks to decrease unnecessary copies and E-stacks for references. They also add optimizations like A-stack sharing for procedures in same interface, A-stack size calculation for fixed size arguments, etc.
ii. simple control flow with the use of PDLs to facilitate easy managing of stacks, and use of binding objects, linkage records preventing unnecessary access checks. This also includes a direct context switch to server’s domain.
iii. optimizations on multiprocessors by using domain caching
Remarkably, All of this optimization is done while maintaining compatibility to generic RPC calls. In the first instruction of the client stub, it can be identified if the call is cross-machine, and a different execution path is followed. They also use an identical binding methodology to keep LRPC similar to RPC, and so both the server and the client export/import interfaces as in RPC. Finally, all of this is done while maintaining protection (via intelligent use of A-stack, E-stack and explicit argument copying), unlike efforts like SRC RPC, which lose out on safety and protection. Thus, the authors achieved what they targeted!
4. Evaluation: They do provide results to show that their implementation is faster by measuring LRPC overhead compared to the minimum time for a NULL call. They also compare the performance of RPC and LRPC, with and without domain caching optimization on Taos systems. But these evaluations did not implement the pairwise shared memory required for A-stacks. This will be important because it might affect the latency. I would also have liked to see the latency caused due to E-stacks, since it is central to safety in LRPC. I would also have liked to see the effect of domain caching, since it is central to their multiprocessor optimization. The affect of optimization on latency for cross-machine RPC calls is also missing.
5. Confusion: What are out-of-band memory segments? Does there system have a single virtual address space? If so, what does context mean? Why would you need tagged TLB for such a system? How is this system slower than domain caching?

Posted by: Mohit | February 23, 2016 01:51 AM

1. Summary To improve the efficiency of coarse-grained protection in microkernel architectures, the authors present a lightweight version of RPC which eliminates the message copying behavior of network-oriented systems in favor of a kernel-mediated function prologue and epilogue, combined with a capability-based security mechanism.

2. Problem Protection and isolation are challenging to implement correctly and efficiently in systems. By eliminating the need to transfer control between different components, monolithic systems are comparatively efficient. However, since any encapsulation between components occurs purely at the language level, modification and validation is potentially difficult. By isolating kernel components into separate protection domains which interact via well-defined interfaces, security and correctness become easier maintain across code changes. RPC presents a straightforward means of communication between separate components, but incurs a significant overhead, as messages must be marshaled, queued, and passed across an interface designed for client-server interactions on a network. Once an RPC is received by the server, the RPC runtime must select and ready a thread for scheduling and execution, and the results must be copied back through the kernel buffer. This overhead significantly increases the latency of communication across protection domains.

3. Contribution The Lightweight RPC framework optimizes for the most common case of cross-domain procedure invocation - the invocation of a procedure on the same machine as the caller. To improve functionality, the authors clarify the central notion of an RPC - instead of treating it as a form of IPC, they authors reformulate it as a transfer of control. Rather than copying parameters in and out of an in-kernel buffer, and then scheduling and executing a procedure on a different server process, lightweight RPC functions more like a standard procedure invocation. More specifically, the kernel maintains a queue of argument and execution stacks for each procedure; when a client invokes a LRPC, the client stub dequeues an argument stack, and pushes the procedure arguments into the stack, and hands control to the kernel. The kernel then verifies the caller's capability, a "Binding Object", locates a free execution stack, and performs a standard function prologue using the A/E stacks within the server domain, context switching to the server state, and then upcalling into the server. Upon return, the server stub places the return value on the A-stack, and performs a standard function epilogue. Functionally, this is identical to a standard procedure invocation, except with an intermediate jump/return through the kernel for security, stack management, and context switching. To ensure low-latency calls, the kernel statically preallocates a number of A stacks for each procedure as well as a number of E-stacks for each domain. Moreover, to reduce the cost of context switches and cold TLBs, the kernel attempts to cache frequently used domain contexts on idle CPUs.

4. Evaluation I liked that the authors begin the paper by empirically justifying the need for their work; They show that the vast majority of RPCs in real systems are within the same machine, that parameter sizes are generally small, and that existing RPC frameworks incur at least 700% overhead beyond that of a standard procedure invocation. I liked that they chose from a diverse set of systems, and examined a "no-op" RPC call; it certainly appears that they were not constructing straw man situations to justify their work. To evaluate their work, the authors implement the new version of RPC on the Taos operating system, and compare it with previous RPC mechanism across a variety of argument sizes, showing an impressive speedup, with the only performance penalties being attributable to the context switch and the cost of a cold TLB in the server context. They also show that the idle-CPU-based caching provides a modest improvement in performance. However, I was disappointed that, despite discussing strategies for managing A and E- stacks, they did not examine whether stack contention affected performance in a significant manner.

5. Confusion
I was not sure how the Binding Object was implemented, or how the kernel would prevent capability forgery attacks. Additionally, I don't understand the utility or effectiveness of caching server domains on idle CPUs - this functionality seems contingent on workloads that waste hardware.

Posted by: Michael Vaughn | February 23, 2016 01:49 AM

1. Summary
The paper introduces LRPC, specifically designed communication facility between protection domains on same machine. Authors evaluate that LRPC performs well for the majority of cross-domain procedure calls by avoiding needless scheduling with single thread of control, redundant copying by sharing argument stack, lock usage to reduce shared memory contention and access validation for simpler stubs.
2. Problem
Monolithic kernels are difficult to modify, debug or validate due to lack strong fire walls which is worsen with shallow protection hierarchy. Fine-grained protection system are difficult to efficiently implement. Alternatively, large grained RPC based small kernel OS require logically separate entities to be bundled together in a single domain, thus increasing its size and complexity. Also, per authors’ observation majority of RPCs happen cross-domain and very rarely are the complex parameters passed in local operations in which case, RPC complexity of marshaling complex arguments is not worth.
3. Contributions
Exploited capability based system by introducing Binding Object which acts as a capability to the kernel for accessing server’s interface. LRPC optimized RPC by having limited, less weight privately mapped E-stacks with auto kernel reclamation instead of scheduling overhead in deriving separate stacks from different threads. With this, LRPC was also able to invoke server stubs directly by the kernel with no validation thus reducing the number of procedural calls and returns. Simplicity of the stub generator ensure direct production of stubs in assembly language thus improving performance, however complicated data structures still required marshaling which was again optimized by generating it in compile time. Exploited multiple processors with increase in throughput and decrease in latency by minimizing use of shared data structures on domain transfer by limiting locks and context switches by catching domains on idle processors. Parameter copying is avoided to a limit such that correctness and safety is ensured by recognizing its irrelevancy, its immutability or size of its values. Transparency is preserved. Binding object has bit to indicate that call is to remote server and uses LRPC or RPC respectively.
4. Evaluation
Authors evaluated performance by running tests on C-VAX firefly using LRPC and Taos RPC. The main problem with this configuration is that firefly do not support pairwise shared memory which is primarily used for allocation of A-stacks and hence do not evaluate the safety gained with it. From the table LRPC seems to be faster than RPC on uniprocessor which is further effective with idle processor optimization. For NULL call, most of the overhead is imposed by kernel for binding validation and linkage management with return call much effective. System proved to be scalable by achieving maximum speedup on multi-processor by avoiding locking shared data during domain transfer. Even with effective data structures and control sequences, TLB misses still account for much of the delay. Authors do not evaluate the efforts needed to port stub generator on different machine architecture owing to its use of assembly language.
5. Confusion
How does server exactly handle domain termination?

Posted by: Unmesh Phalak | February 23, 2016 01:16 AM

1. Summary
This paper talks about a lightweight RPC mechanism for optimizing the common case of communication among protection domains on a single machine. The idea is borrowed from protected procedure calls which avoids conventional overheads such as copying arguments in and out of the kernel and dispatching threads on the server.

2. Problem
A micro-kernel or a distributed OS involves subsystems that need to communicate efficiently across protection or machine boundaries. RPCs were a popular approach employed for this due to their simple semantics. However, most micro-kernels and distributed systems try to service requests on a local machine making inter-machine crossings rare. Thus the overheads for a generic RPC are not justified in the common case.

3. Contributions
The authors perform a detailed study of RPC mechanism for three different systems, they observe that 1- only a small fraction of procedure calls are transferred over the network(0.6 to 3% ) and 2- most of the arguments to RPCs are of small sizes and involve fixed sized data structures.
The authors build upon the idea of a protected procedure call to effectively execute the calling thread in the protection domain of the server. Key components of their design are as follows:
1- For a binding request, the server replies back with a procedure description list (PDL) to the kernel. The kernel then allocates a set of Argument Stacks or A-Stacks, a linkage record and returns a capability object - Binding object to the requesting client domain.
2- During a call the client stub copies the arguments to an A-stack and invokes the kernel with the Binding object. After simple access checking the kernel being the aware of the entry point of each procedure does an up call directly to that stub. This differs from a conventional RPC which would do an additional pass through the scheduler and a thread dispatcher.
3- The A-stacks are pairwise read/write shared between the client and server domains, this provides isolation among two or more clients and avoids 2 copies to and from the kernel address space. Also, using a language that supports a separate argument stack the server function can directly use the A-stack for arguments and the return value.
4- LRPC uses optimized assembly based stubs. The client stubs use a fast path for simple data structures which is representative of the common case.
5- The authors also optimize for concurrency in multi-processor systems. The A-stack list is the only shared resource protected by a simple lock. Idle processors are also used to cache the server domain using a pre-warmed TLB. This reduces the latency to switch to the server protection domain and avoid TLB misses.

4. Evaluation
The authors perform a live study to characterize RPC calls on their SRC Firefly computer system. They also calculate the theoretical minimum RPC transfer time which gives a good idea of the overheads. It would also have been enlightening to understand how much of this overhead comes from complex stubs or server side dispatcher, this was one analysis worth seeing. The authors compare their design to a highly optimized RPC mechanism by the DEC SRC Tao, this shows how significant their idea actually is. They provide a break-up of the latency components for LRPC and show the overheads are minimal. It would be interesting to see that LRPC does not hurt performance in the less common case. In addition to this they show how their RPC system is not bottlenecked due to concurrent accesses in a multi-processor system. Also, an analysis of how large the E-stack memory footprint is would be interesting

5. Confusion
What happens when a function runs out of E-stacks due to multiple concurrent requestors from different protection domains?

Posted by: Brian Coutinho | February 23, 2016 01:11 AM

Summary
The paper describes LRPC(Lightweight RPC) which is a communication facility designed and optimized for communication between protection domains on the same machine. Using control transfer and communication model of capability systems along with programming semantics and large-grained protection model of RPC, LRPC present a very efficient design for cross-domain calls ensuring safety.
Problem
Small-kernel OS borrowed the idea of large-grained protection like RPC to ensure safety. This is ensured by separating components of the OS in disjoint domains and the domains communicate to each other via messages. In RPC, local communication is treated as an instance of remote communication and simple operations are considered to be in the same category as complex ones. This resulted in very high overhead in cases where the communication happens within the same machine between various protection domains. Designers started trading safety for performance by coalescing multiple subsystems within a single protection domain. The authors performed experiments to show that majorly cross-domain communication on the same machine dominates compared to cross machine communication.
Hence, there arises a need to make things simple for the most common case so that performance is not impacted while at the same time, the safety is never compromised.
Contribution
The major contribution of this paper is to combine the capability system and RPC system to design an efficient method for cross-domain communication on the same machine. In order to achieve this - 4 techniques are described.
1. A simple control transfer adapted from the capability system, where the client thread executes the procedure call in the server's domain. This is done by use of A-stack, linkage record, Binding Object and E-stack.
2. A simple data transfer where a shared space (A-stack) is used to allow sharing between the client and server. This eliminates the redundant data copying that RPC used to do. Moreover, the pairwise allocation of A-stack in the client and server domain ensures safety and correctness.
3. A much simpler stub is needed now as marshal/unmarshal is not required other than scenario's involving complex data. Also, the server stubs are directly invoked by the kernel on a control transfer thus reducing the cost of the cross-over.
4. LRPC is designed keeping in mind the multi-processor system and hence avoids using shared data that employs locking for synchronization. LRPC caches domains on the idle processors and thus helps in reducing the context-switch overheads by exchanging the calling and idling threads processors.
Evaluation
In the earlier part of the paper, the authors’ uses evaluation to show that existing RPC systems have a problem and build the entire premise of LRPC based on those evaluations. Later, they discuss four tests to show that LRPC indeed performs better and its design goals are met. The four tests results show that LRPC is 3X faster than SUN RPC. They also show the gain obtained by enabling the idle processor optimization. Most of the time, the system will be overloaded. But in case of presence of idle processor, a higher performance gain can be expected.
Another interesting piece of evaluation is the cost break-up of a single serial NULL LRPC call. This breakup clearly shows how much time is spent in which operation. A comparison of this simple LRPC with a complex LRPC would have made this more interesting as an apple's-to-apple's comparison for each operation can be easily made.
The authors also show how avoiding shared data helps in ensuring higher throughput on a multiprocessor system. I don't understand why domain caching was disabled for this experiment.
The authors should have evaluated the overhead in deciding whether a call is cross-domain call or cross-machine call to show that in order to optimize the common case, the uncommon case performance is not impacted. Another interesting evaluation to show would be about porting a stub generator on a different machine. It would instill confidence on the minds of the developer to adapt LRPC.
Confusion
I had hard time understanding the argument copying section where parameter copying can be avoided when the actual value of parameter is unimportant to the server. What does that mean? If some value is not important, why it’s even passed?

Posted by: Yuvraj | February 23, 2016 01:08 AM

1. Summary

This paper presents details about the LRPC communication facility designed and optimised for communication between protection domains on the same machine.

2. Problem

The RPC communication facility has unnecessary overheads involved when communicating within the same protection domain. In order to increase performance, the authors in this paper developed the LRPC method based on the protected procedure call mechanism. They minimise the number of times the data is copied for communication between the domains and also provide a simple control transfer and stub mechanism.

3. Contribution

The main contribution of this paper is the LRPC model, which achieves and optimised communication within the protection domain. The main execution model of the LRPC involves the client trapping into the kernel for calling the server procedure; the server then validates the caller, creates call linkage and dispatches the client thread into the server domain. An argument stack (A-stack) is shared between them. The LRPC runtime library registers the interface imported by the server. The server’s waiting clerk replies with a list of procedure descriptor containing the address space and the size of A-stack when the client makes an import call into the kernel. When a call is made, the client calls the user-stub that puts arguments into the A-stack and places procedure id, A-stack address and binding object into the registers and traps into the kernel. The kernel verifies the registers, puts return address and stack pointer into the linkage record, modifies the stack pointer to point to the Execution stack (E-stack) of the server and performs an upcall into the server-stub. The server stub then calls the server, which executes with the A-stack and E-stack. The stubs are automatically generated. Locking mechanism is used for LRPC in multiprocessors. When a context switch happens, the server's context are cached in idle processors. The kernel exchanges the client's processor with the server's processor once a call is made and exchanged back on return.

4. Evaluation

The LRPC was implemented and integrated into Taos, the operating system for the DEC SRC Firefly multiprocessor workstation. LRPC, RPC and restricted message passing were run, where LRPC performed 3 times better than the others. To determine the performance of the system, four tests running the null call and procedures representing calls having typical parameter sizes were run. LRPC costs on a single node and detail breakdown were presented. The results were properly analysed. TLB misses were identified as the cause for majority of the time used by Null call. Performance of RPC in uncommon cases was also considered.

5. Confusion

1) How is the protected procedure call executed?

Posted by: Nivetha Singara Vadivelu | February 23, 2016 01:07 AM

1. Summary
This paper introduces Light Weight Remote Procedure Calls (LRPC). This module is optimized for RPC among Protection Domains on the same machine. The module is able to offer substantial performance gains over traditional RPC mechanisms for this case while not compromising on safety.
2. Problem
The existing RPC mechanisms were written for the distributed system use cases relying on message passing. While this approach was logically sound for intra machine communication it was far from optimal requiring unnecessary overheads and slowing down performance. This forced application developers to coalesce weakly related systems into the same PD to avoid the RPC cost. Various workarounds such as passing parameters in registers existed but no solution tackled all the overheads of an RPC system on cross domain communication.
3. Contributions
The module redesigns RPC to reduce any extra overheads caused by extra copying and context switching. When a client binds to a server it receives a Binding object to identify the client. The kernel pre allocates Argument Stacks and linkage records to save time in the critical path. While calling a unique mechanism is used to use the client thread to execute the servers procedure. This is accomplished by using the A and E stack to track arguments and the server function stack. This saves extra context switches and copies by making the server read and write the return the return directly on the A stack. As the A stack is not a complex message simple stubs can be used in the critical path to reduce their impact. Various other optimizations further streamline the implementation by caching frequently domains on idle processors and avoiding parameter copying whenever assurance against parameters being changed mid call is not required. I believe the central contribution here was the use of the client’s thread in the server’s domain, this allows the module to avoid copies of the parameters and avoid extra context switches by using the same thread of execution. This emulates a traditional procedure in the closest sense while still preserving the safety constraints imposed by RPC.
4. Evaluation
The module is piecemeal analysed to determine the overheads introduced by LRPC over an ideal cross domain procedure call. The authors show how most RPC calls involve small parameters and go on to evaluate the impact of parameter/return value size on LRPC latency.
The module also examines the local locking policy by stress testing various RPC modules with various concurrent calls emanating from a multiprocessor to see how they scaled to this load. Throughout these testing they also evaluate the impact of caching frequently used domain as this is not possible on the Taos RPC package. Even though the authors analyze various facets of the solution at depth they do not evaluate long running real world benchmarks to check the cumulative impact. I am personally intrigued by the time/space cost of allocating A stacks/linkage records at bind time and how it would affect total system performance when many clients/servers are being destroyed and created. Similarly, I am not sure that the optimization of running domains on idle processors would have an impact when the number of domains far outnumber the number of processors in a system. All of these questions do not need to be answered individually and a real world benchmark would reveal any bottlenecks in this system.
5. Confusion
I am confused about the need to copy arguments at all, if the client thread executes the server code, isn’t it by default blocked, hence can’t the server use pointers to the client’s memory whenever the client is not multi-threaded?

Posted by: Abhinav Mehra | February 23, 2016 01:04 AM

1. Summary
This paper describes LRPC, which is a form of RPC that is designed to handle the “common-case” of small communications on the same machine. The authors present a description of their approach, some comparisons to RPC, and then evaluations for overhead and overall performance.

2. Problem
In 1990, the vast majority of RPCs were used mostly in local calls, and generally those calls don't involve large or complex parameter passing. As a result, there is a lot of complexity and context-switching overhead that, for the most part, is not being utilized to its fullest extent. LRPC wants to simplify communication under RPC while still maintaining the programming semnatics and protection model.

3. Contributions
The binding process is as follows: first, a server module exports an interface through a clerk. The clerk then registers this interface via a procedure descriptor list which in turn contains procedure descriptors. Each PD has an entry address, the number of simultaneous calls allowed, and an “argument-stack” on which the client's arguments are copied. Clients can then use Binding Objects to access interfaces. When accessed, the kernel verifies the argument stack and finds an execution stack in the server's domain. The server can then directly access the parameters (since the A-stack is in its domain), execute the call, and then pushes the return values back onto the A-stack. The server then returns out of the proecedure, out of the client stub, and back to the calling domain.

This procedure simplifies equivalent procedures in RPC. Only one procedure call and two returns are needed for LRPC execution. Arguments in RPC are copied four times, and in LRPC they are copied only once. For what the authors describe as the “common case” (ie small local calls), a lot of the unnecessary overhead of RPC is trimmed out. However, LRPC can also handle “uncommon cases”. While it still passes cross-machine calls to RPC, it can handle cases such as unexpected domain termination (via capturing threads and destroying in kernel once released).

Other contributions:
1) LRPC is also designed to take advantage of multiprocessors. Context-switch overhead and latency are reduced by caching domain contexts on idle processors.
2) E-stacks are drawn out of a pool and reused by the server when needed. An E-stack and an A-stack are bound to allow the server to quickly access either as needed.
3) Like RPC, stubs are also automatically generated, allowing for low maintainence.
4) Throughput is maximized by using as few shared data structures on the critical domain transfer path as possible.

4. Evaluation
The authors measure LRPC on both its own terms and against RPC. 100,000 cross-domain calls were made on a loop, and the average cost was calculated. With both single and multi-processers, they show through a time breakdown that LRPC has relatively low overhead and achieves much greater call throughput. LRPC can achieve 6,300 calls/sec for a single processor and 23k/sec for four processors as opposed to 4000 calls/sec on two processors for RPC.

In their evaluations, though, they test only against RPC and not other forms of IPC. Comparisons against other IPCs would have made a stronger case for adopting LRPC over, say, sockets/pipes/shared memory/queues/message passing/etc.

5. Confusion
Context-switch overhead is reduced by “exchanging the processors of the calling and idling threads”. Doesn't this add its own overhead, though? What is the definition of a “domain” that the authors are using here? Is the same as a Linux process? Also, why use RPC as the primary IPC within the same machine in the first place?

Posted by: En-Ui Annie Lin | February 23, 2016 01:04 AM

1. Summary
This paper describes LRPC designed to handle communication between protection domains on a single machine.

2. Problem
The concept of private address spaces for user processes protects monolithic kernels but very few protection boundaries exist within the OS itself and this makes it hard to modify, debug and validate. The common case for communication is between protection domains on the same machine and is typically simple with few arguments and little data. While RPC is robust to handle both local and remote calls, it has high overhead and there arose a need to design a lightweight solution.

3. Contributions
The primary contributions of this paper are:
- Binding: Servers export interfaces through clerks who register them with a name server and awaits import requests from clients.
- Simple Stubs: The stub generator produces simple stubs in Modula2+ assembly language as many of RPC's general stubs are rarely used.
- Simple Control Transfer: The kernel changes the address space for caller's thread and lets in continue to run in the server domain.
- Simple Data Transfer: Pre-allocated shared buffers are used to communicate arguments and results. Caller uses A-stack to transfer data.
- Multiprocessor Concurrency LRPC: Idle processors on multi-processor systems cache domains to minimize context switch overhead. Counters are used to track the domains with high LRPC activity.
Overall the authors have improved the design of RPC for the most common case (local communication).

4. Evaluation
The authors measure and compare the latency of LRPC with an optimised RPC on Taos and show that there is a 3X speedup for single processor machines.
The authors also demonstrate that LRPC isn't bottlenecked by locking of shared data structures and that the call throughput scales linearly with number of processors.
The authors also explained particular corner cases like domain termination, variable size A-stacks and cross machine talks.

5. Confusion
- With advances in networking, how relevant is LRPC in today's systems ?
- How language dependant is the implementation ??
- How are A-stack queue locks used ??

Posted by: Vinothkumar Siddharth | February 23, 2016 12:51 AM

1. Summary
The cross domain communication, that takes place in a same machine, is much simple and efficient than cross machine communication, which mainly uses network protocol using RPC. LRPC uses shared memory address space determined at the time of binding for communication to reduce the overhead of RPC.

2. Problem
Previous RPC system has unnecessary overhead of communication on a same machine. On the same machine, it does not need complex network protocol because communication between server and client on same machine can be done with simple and light packet.

3. Contributions
Main contribution in this paper is to reduce burden of RPC by introducing the local shared memory for communication in case that the communication happens in same machine. It seems like the rest of parts in the paper exist to support the shared memory address under lightweigh RPC.
To support LRPC, the paper introduce several ideas. LRPC uses A-stack, which stores argument, as a shared memory space. At the stage of binding, a clerk register procedure descriptor list, which includes procedure descriptor, entry and A-stack, from a server and provides it to a client. In the case of multiprocessor, LRPC switch calling thread with idle processor which has a server procedure for better performance.

4. Evaluation
Prior to measuring LRPC efficiency, the paper evaluates some systems. The result shows that many call are directed to local system not cross-machine and most arguments has a fixed size or small size, not requiring marshaling. To justify the research work, inserting this evaluation before describing the main contribution seems like reasoning.
The comparison between LRPC and SRC RPC is conducted with 4 test cases, which shows LRPC is 3 times faster than SRC PRC. I think the paper includes more works such as the performance comparison under severe workload and there are many mixed calling among the LRPC and RPC.

5. Confusion
→ What is the abstract and concrete threads in P43? It’s hard to understand the part of scheduling, dispatch.
→ What is parameter copy and Why does type safety motivate explicit argument copying in the stub in P49?
→ Why does a client thread switch to the processor server threads.

Posted by: Choungki Song | February 23, 2016 12:27 AM

1. Summary
The paper discusses about the implementation of LRPC which is an alternate provided for procedure calls across domains. The authors first show that most calls are on same machine and since existing mechanism can’t distinguish local / remote call, it goes through the entire call stack of regular RPC and ends up being in a costly affair.

2. Problem
There was no existing mechanism which could avoid the regular RPC semantics which were not required for a call on the same machine. Most of the existing work either provided optimization for a special case (System V - 32 bit message or Karger’s compiler driven optimization) or other systems which compromised security for performance. Message passing was another option but it too had high overhead and hence was not a feasible solution.

3. Contribution
The authors branched the local calls (RPC calls which only crossed domains) from the RPC calls (made to remote machine). This branching was qualitatively described to introduce an insignificant amount of overhead as it obviated multiple RPC protocols. This minimized the number of context switches and the traps (2 each) as compared to RPC (which involved 4 context switches). Binding object was used to establish and maintain a connection between client and server, which was done in kernel. They introduced clerks, which exported the procedures in form of procedure descriptor list (PDL) analogous to RPC implementation. Parameter passing was done using a new data structure - A-stack which provided a pairwise binding between the caller and the callee, interacting using a lock. Linkage record provided a return address. This solves the problem of extra security verification which the kernel otherwise would had to perform. Since the lock was held between the caller/callee, there was no need for a global lock and hence the system could scale better. If a processor was found to be idle, the authors extrapolated the idea of caching the context of blocked threads in form of caching domain states on these idle processors. Otherwise, the client thread would execute the server code which was provided in another data structure - E-Stack. Further optimizations were provided for A-stack by pre-allocating them and multiplexing procedure calls with same size. E-stack optimization involved lazy allocation to reduce memory footprints and then leaving the A-Stack - E-Stack binding for future procedure calls. Thus, authors were able to provide a comprehensive solution which improved the performance and did not undertake any mechanism which provide only a special case optimization.

4. Evaluation
The authors did provide a detailed analysis of performance comparison of LRPC, RPC and restricted message passing based approach and proved that LRPC does perform better by 2-3 times. Experiments on the multiprocessors proved that LRPC scaled upto 4.3 times, whereas traditional RPC running on Taos was bottlenecked at 2X speed due to global locks. However, the authors did not talk about the performance overhead on lazy allocation of E-stack or binding time in case of binding object.

5. Confusion
It is still not clear about the need of linkage request, since the same functionality is stored in the state of a thread, in case a thread switches context.

Posted by: Vikas Goel | February 22, 2016 11:55 PM

1. Summary
This paper talks about a new communication facility called Lightweight Remote Procedure Call. The paper talks about the main motivation behind the creation of LRPC, its main design techniques which make it perform better than LRPC along with its implementation and how its performance compares with an optimized RPC.

2. Problem
A lot of remote procedure calls happens across domains on the same system and there are lot of unnecessary overhead added by RPC when servicing such requests. The paper justifies its claims of introducing LRPC by showing that very few fractions of RPC are remote, most of the parameters passed are small, non-complex, don't need marshaling and there are a lot of RPC overheads which can be avoided for such messages/procedure calls. They also talk about some optimization done in RPCs but there is still scope for improvement

3. Contributions
The paper introduces a design of rpc called Light Remote Procedure Call(LRPC). There are 4 major design techniques which make it perform better than regular/optimized RPCs. It has simple control transfer whose execution model is borrowed from protection domains and semantics from normal RPCs. Some of the important features are A-stacks, linkage records, binding object. It has simple data transfer. In RPC, data is copied 4 times: Stub to RPC message, client message to kernel, kernel to server and server to stack whereas in LRPC, data is copied only once: from client stub’s stack to shared A-stack. It has simpler stubs where the server stub can be directly invoked by kernel on transfer. But for complicated data type its overhead is comparable to regular RPCs. The final feature is its design for concurrency where its performance increases significantly on multiprocessors because of its minimized use of shared data (avoids locking). The two features which I liked was the use of caching domain on idle process to reduce OS latency (reduce context switch) and pairwise allocation of A-stacks which provides protection and reduces the number of copy operations. I think this was a very important features. Overall the explanations in the paper were good.

4. Evaluation
The paper was able to justify why we need LRPC instead of RPC by showing us that very few fractions of RPC are remote, most of the parameters passed are small, non-complex, don't need marshaling, there are a lot of RPC overheads which can be avoided for such messages/procedure calls. They proved this by studying 3 different OSes. They justified their claim for LRPC's design for concurrency by showing that LRPC (avoid TLB misses, no locks) outperformed RPC significantly in terms of scalability (throughput on multiprocessor). They were also able to show that for different tests/ operations the overhead in LRPC is a lot less compared to an optimized RPC. One thing which I would have liked to see was that how did it's performance compared to other IPC methods like pipe and sockets, was the use of LRPC really justified?

5. Confusion
Is LRPC still prominent? Is it a viable replacement for pipes, sockets for IPC?
Didn't completely understand how sharing of A-stack works

Posted by: Anubhavnidhi "Archie" Abhashkumar | February 22, 2016 11:25 PM

1. Summary
LRPC reduces the overhead for RPCs that happen on a single machine and pass simple arguments/results by borrowing the control transfer and communication model from capability systems.

2. Problem
Although designed to support communication between different machines and transferring complex data structures, most RPCs are just passing simple arguments and results across protection domains on the same machine. Traditional implementations are not optimized for this use, and the high overhead pushes developers to put unrelated subsystems into a single domain, sacrificing protection.

3. Contributions
LRPC eliminates most copying needed by using a pairwisely shared argument stack. The A-stack is allocated by the kernel at binding time and managed by the user-stub. The only copy here is from user stub stack to A-stack and the callee can read arguments from and put results to the A-stack directly.
LRPC reduces the overhead of E-stack allocation by reusing A-stack/E-stack pairs. Allocated E-stacks are not unassociated immediately after the calls return. As A-stacks are picked in a LIFO manner, the association may be reused. The kernel also garbage collects E-stacks to form a E-stack pool instead of releasing them completely.
In multiprocessor settings, domain contexts are cached on idle processors. When calling or returning to a different domain context, the kernel first checks if there is any idle thread already running the expected context. The thread can run on the found processor directly without incurring context switch overhead.
LRPC falls back to traditional RPC transparently and efficiently when dealing with cross-machine or complex calls. The stub generator takes care of the complexness at compile time, and the remoteness is determined by a bit in the binding object at the first instruction of the stub.

4. Evaluation
The overhead of LRPC is compared with that of original Taos RPC on four procedures with varied argument / result sizes. It clearly shows the factor-of-three improvement over Taos and slight improvement with context caching optimization. They also give the breakdown overhead for calling the NULL procedure.
They also measured the scalability of their locking design. As opposite to Taos RPC, which levels off with two processors due to the unoptimized usage of a global lock, the number of calls made by LRPC scales linearly with the number of processors. May be restricted by the hardware availability, no test on larger number of processors are done.
The effectiveness of context caching is not evaluated well. As the measurements show, it saves 34 microseconds in Add while only 8 microseconds in BigInOut. Is it possible that ‘LRPC/MP’ become slower than ‘LRPC’ for larger arguments?
They did not provide any evidence that the optimization taken by LRPC does not hurt the performance for remote or complex calls. No evaluation on real application is done.

5. Confusion
The optimization of LRPC relies on the distinction between A-stack and E-stack. But for arguments passed by reference, the recreated reference is placed on the private E-stack. Does this violate the calling convention?

Posted by: Xiangjin Wu | February 22, 2016 11:23 PM

Summary

LRPC combines the communication model of capability systems with the large grained protection model of traditional RPC. This leads to a factor of three performance improvement over other approaches. LRPC enables systems to preserve both isolation and safety among domains. The paper highlights the design and implementation of LRPC along with a detailed performance evaluation.

Problem

Due to the high performance cost of using RPC to communicate between protection domains , system designers are pooling weakly related subsystems into the same protection domain. This improves performance but compromises the safety of the system.Further it has been experimentally shown that only 5.3% of all RPCs are cross machine RPCs. This inspired the development of LRPCs which are tuned to improve performance of local RPCs.

Contribution

There is a high degree of interaction between the server , client and the kernel, this differentiates LRPC from RPC which on conceptual levels have similar binding mechanisms. The clerk in the server run-time library waits for an import request from the client and responds to the kernel when one arrives.The kernel in-turn notifies the client and passes a binding object to the client which can be used in subsequent calls.This prevents clients from bypassing the binding phase.

A client makes a LRPC by calling into the client stub procedure which initiates the domain transfer.The stub is responsible for managing the A-stacks allocated at bind time.The arguments are pushed onto the A-stack, which is directly mapped into the server's domain. The server returns by calling into its stub which in-turn traps to the kernel.Verification at this stage is implicit. If arguments are passed by reference , the client stub copies the referent onto the A-stack and does not make an explicit copy of the data.

The LRPC stub generator produces run-time stubs in assembly language which raises the concern of stub generator portability. On a multi-processor system LRPC increases the throughput by reducing usage of shared data structures on the critical domain transfer path.Latency is reduced by caching domain contexts on idle processors , this reduces the transfer time which occurs during a context switch.

By using pairwise allocation of A-stacks , LRPC ensures that arguments are copied only as many times as necessary, reducing the number of copy operations from four to one. In-order to ensure type safety , explicit argument copying in the stubs is performed instead of wholesale message copying in the kernel.

Evaluation

Four types of tests were run on the C-VAX Firefly to evaluate the LRPC/MP which uses the idle processor optimization, LRPC which executes domain switch on single processor and the Taos RPC. 25% of the time used by the Null RPC is due to TLB misses , this has been justified by the fact that context switch on a C-VAX requires invalidation of the TLB. The performance benefit obtained by not locking the shared data and removing the contention on shared-memory multiprocessors is clearly demonstrated when compared to SRC RPC performance levels.A clear break down of the time taken for each step of an RPC call has been provided.Detailed theoretical explanation about uncommon cases like cross machine calls, variable size A-stacks , and domain termination have been provided. They could have evaluated and experimentally shown that the common case approach taken by LRPC is flexible enough to accommodate these uncommon cases.

Confusion
Did not completely understand how A-stack queue locks are used. Also found that the argument copying explanation was not very clear.

Posted by: shreya kamath | February 22, 2016 11:10 PM

1. Summary
The paper describes a new mechanism, LRPC, to let processes communicate on the same machine across different protection domains.
2. Problem
Before this paper, if processes wanted to communicate across different protection domains, they would have used the RPC mechanism in place that was also used for cross-machine communication. As a result, there was a high overhead of the packaging and the transport layer to go from one protection domain to another on the same machine.
3. Contributions
The authors first measure the performance of cross-domain RPC to see where the overheads of the mechanism lie. The overheads come from various places such as message buffer overhead, message transfer, context switch, etc. Using this knowledge, the authors propose a new mechanism, LRPC, that is optimized specifically for cross-domain communication rather than cross-machine. Similar to RPC, there is a binding stage where the server domain exports an interface and clients can communicate with the kernel to get binding to that interface.
At binding time, A-stacks are allocated, which are used to transfer arguments and return values between domains. The kernel also allocates a linkage record for each A-stack which is the address to return to for the caller. In addition, a Binding Object is created and given to the caller to use on every call, in order to prove that they have permission to access an interface of a server.
One drawback of this system is that the server gets the arguments from A-stack, but needs to use an E-stack to execute. As a result, this is only possible on systems that allow the use of separate argument pointer when functions are called. I could be wrong, but I do not think many modern languages support this, which makes this optimization that prevents many copies to not be useful.
The authors also add an optimization on multiprocessors where the client and server run on different cores, so a context switch does not need to flush the TLB and hinder the performance. However, the authors claim that they can keep a server domain on a processor because processor idling is common, but how common is processor idling today on big servers?
4. Evaluation
To evaluate their system, the authors compare an RPC system, LPRC system, and LRPC with the idle processor optimization. The authors show that the idle processor LRPC does best, then normal LRPC, and then RPC which is roughly 3 times worse than normal LRPC. The experiments that they use, such as sending big arguments, are valid ways of measuring performance. However, the idle processor optimization is much more tricker because they are assuming that there will always be an idle processor to run a server on. However, they do not consider an overloaded system, with possibly many servers running.
They also claim that using A-stacks will only need 1 copy rather than 4 in normal RPC, but that is only true if the A-stack optimization is possible. The paper is missing the performance evaluation of when this optimization doesn’t exist, for languages such as C, where a separate argument pointer is not a feature.
5. Confusion
How exactly does the kernel detect forged Binding Objects?

Posted by: Arman Shanjani | February 22, 2016 10:00 PM

1. Summary
RPC optimizes the inter-process communication over a network. It offers a transparent, simple and relatively faster mechanism for remote IPC. Such a model can also be extended to do a local IPC in small capability-based kernels (like Mach): a cross-domain communication in the same machine, as an alternative to using traditional pipes and sockets. In this work, they figure out ways to optimized RPC for common cases and thus make it faster and efficient.
2. Problem
Experiments show that most of the message passing in small kernels have small number of parameters and simple data structures. Also since this design is targeted to optimizing a common case- which is local IPC, there is no necessity of incurring cross-domain RPC overheads due to complex stub and runtime engines, multi-level dispatcher, intermediate data copying and its scheduling technique.
3. Contributions
Some notable ideas this paper puts forward are-
Optimizations at compile time for RPC- since the size and complexity of arguments are known, allocated the space in advance, perform necessary checks to generate code, thus have simpler stubs.
Simple control transfer by making kernel to do a direct context switch from client thread to a server thread, i.e., changes the address space for the caller's thread to point to the server domain (associates with an unassociated E-stack).
Simple data transfer by statically pre-allocating memory in the form of shared argument stacks(A-stacks) and avoid locking overheads using pairwise transactions.
Idle processors in multi-processor machines cache domain contexts to reduce context-switch overhead. Counters are used to keep the highest activity LRPC domains active on idle processors.
All this is done in a protected procedure call fashion, where calls go through the kernel that validates the caller, creates the linkage, context switches to server domain. And it provides a large-grained protection model through the interface binding, where server allows client to access procedures defined by the interface. Thus, the design of LRPC is relevant and sound for the desired context.
4. Evaluation
They analyze the performance of LRPC against RPC in Firefly OS. This paper gives a good breakdown of LRPC’s performance and overheads due to cross-domain call overhead, binding and linkage overheads. The copying experiment show how it avoids intermediate kernel writes unlike traditional message passing and DASH systems. They show that it is 3x faster than optimized RPC because of direct context switching that avoids TLB misses, no locking overheads. They also evaluate the design on multiprocessor which shows a significant improvement over optimized RPC. They go further to explain LPRC handling the uncommon cases- switching to RPC on cross-machine calls, need of larger A-stacks and domain termination.
However, a couple of misses in the evaluations- they talk about domain caching in multiprocessor, but disabled that and performed their tests. They state that shared-memory IPC would have synchronization overheads but do not compare the overall impact of those versus their own towards the performance, nor do they compare with all known IPC mechanisms.
5. Comments/Confusion
If A -stacks are shared between procedures in same interface, how would the multiple linkages be mapped? Also, the design of LRPC on a multiprocessor, with a single lock shared A-stack queue seems bit vague.

Posted by: Tithy Sahu | February 22, 2016 08:57 PM

CS 736 Reviews - Spring 2016

Lightweight Remote Procedure Call

Comments

Post a comment