« Implementing Remote Procedure Calls | Main | U-Net: A User-Level Network Interface for Parallel and Distributed Computing »

Lightweight Remote Procedure Call.

Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. Lightweight Remote Procedure Call. ACM Trans. on Computer Systems 8(1), February 1990, pp.37-55.

Review due Tuesday 3/3.

Comments

1. summary
In this paper, the design of LRPC a communication system optimized for the common case communication between protection domains on a single machine. LRPC achieves a level of performance for cross-domain communication that is significantly better than conventional RPC systems (3x faster), while still retaining their qualities of safety and transparency. techniques that contribute to the performance of LRPC are : Simple control transfer, Simple data transfer, Simple stubs, Concurrent design.
2. Problem
Most communication traffic in OS is cross-domain rather than being cross-machine and is simple rather than complex. Although conventional RPC can serve the needs of both the types. It failed to exploit the fact that cross domain procedure calls can be less complex than their cross machine counter parts. Instead RPC treats cross domain calls as a instance of cross machine ones. Because of this overhead cross domain calls suffer a loss in performance and deficiency in structure. This constrain lead system designers to place weakly related subsystems to be placed in the same domain for performance gains, with the cost of safety. LRPC facility address this issue.
3. Contributions
Binding -
- Clerk registers the interface with a name server and awaits import requests from clients.
- PDL = one PD{entry address, # of calls, size of A-stack} for each procedure in the interface
- For each PD kernel pairwise allocates A-stacks equal to the # of calls.
- Kernel also stores the linkage record (holds the return address) for each A-stack.
- After the binding, kernel returns a binding object to the client
Calling -
- Client makes an LRPC by calling into its stub procedure, which is responsible for initiating the domain transfer.
- Stub puts the address of A-stack, Binding Object, and PD into registers, and traps to kernel.
- kernel verifies the A-stack and locates the corresponding linkage
- kernel records the caller’s return address and current stack pointer in the linkage
- updates the thread’s stack pointer to E-stack,
- reloads the processor’s virtual memory registers with server domain and calls server stub
Multiprocessor Optimization -
- LRPC caches domain context on idle processors.
- When a call is made, the kernel checks for a processor idling in the context of the server domain.
- If found loads the calling thread in the processor loaded with the server context.
Argument Copying -
- LRPC copies only half the time when compared to RPC.
- i.e. Only from the stack of the client stub to the shared A-stack from which it can be used by the server procedure.
4. Evaluation
A good amount of evaluation is done in this paper to validate their claim. Initially they start by evaluating the Frequency of cross machine call, size and complexity of parameters and the performance of RPC for cross domain communication. These experiments prove that 5. Confusion
In UNIX all local functions are accessed using traps. I cant think of a case where RPC is used with a single machine.

1. Summary
The concept of lightweight remote procedure call (LRPC) is proposed, which addresses RPC between protection domains on the same machine. LRPC combines control transfer and communication model of RPC, and achieves at most three times efficiency compared to conventional RPC on the same machine while preserves isolation and safety among domains.

2. Problem
Due to high cost of communication between fine-grained protection domains, system designers tend to coalesce subsystems into large-grained protection domains, which trade safety for performance. This paper addresses this issue by proposing LRPC based on following three motivations: 1) over 95% of RPC is between protection domains on a single machine than truly remote calls; 2) parameters are often fixed in size and no more than 5, indicating pre-allocation of argument list is possible; 3) Stub, message buffer, access, message transfer, scheduling, dispatch, etc. are imposing significant overhead which could be eliminated/reduced for cross-domain RPC.

3. Contributions
LRPC borrows the execution model from protected procedure call(kernel trap) and programming semantics and protection model from RPC(binding, stub).
Server processes export interfaces to the kernel via a clerk. Client processes import the interface and binding to a server by kernel call. A procedure descriptor list (DPL) is returned from server to client upon successful binding. DPL includes procedure descriptor, entry address, argument stack (A-stack) capacity. A-stack is global shared memory between client and server to pass arguments and return values. Kernel associates linkage record to A-stack which saves the return address. Client obtains a Binding Object to access the server and the A-stack list upon successful binding.
A call from client initiates from calling its stub which handles copying arguments to A-stack before traps to kernel. The kernel does verification, then finds the correct linkage of that client and an execution stack (E-stack) in the server’s domain to be associated with A-stack. The server can access A-stack directly (without extra copy work) and puts return value in it before returning through its stub.
The stub encapsulates three-layer communication protocol into one. It is written in assembly by the generator to guarantee high efficiency.
In multiprocessor environment, caching domain technique is caching server domain contexts on idle processors. On a context-switch, the kernel tries to find an idle processor, which the domain is already loaded, and replace the original server processor with the idle one.
The argument copy is done only once in LRPC because client and server share the A-stack memory. References to data are handled by copying data bulk to A-stack and recreating new reference on the server side. Type checking the enforced but meaningless message can be ignored.
Remote calls could be recognized by examining caller’s stub, and conventional RPC is made for truly remote calls. A-stack size may not fit variable-length arguments, which is then default to Ethernet package size, or use out-of-band memory segment. Domain termination is handled by kernel and may not be done synchronously.

4. Evaluation
The paper used four benchmarks and compared LRPC on single processor, LRPC on multiprocessor and conventional Taos. It shows three times faster for LRPC over conventional Tao RPCs. And LRPC for multiprocessors performed slightly better than single processor. On single processor machines, while only 50% time overhead is added to theoretical lower bound (kernel trap), TLB misses on context switch take around 25% of total time.
It also shows that shared memory locking is favored compared to global locking in conventional RPCs. When the number of processor increases, LRPC observes a linear throughput increase while RPC meets the bottleneck over two processors.
I don’t think evolution is enough for this paper. There should be more benchmarks of various kind, instead only four having similar argument list.

5. Confusion
In section 3.2, it says A-stack/E-stack association is delayed until runtime, which calls for searching for, allocating and reclaiming E-stacks when an LRPC is made. Wouldn’t this affect the performance by inducing new overhead in communication?

1. summary
The paper presented some analysis of the popular RPC communication mechanism used in distributed computing systems, which leads to the motivation of LRPC - Lightweight remote procedure call. The LRPC desgin and implementation is discussed in this paper, and a series of performance evaluation is conducted.

2. Problem
The RPC mechanism is getting widely successful with large-grained protection systems. However, only a small fraction of RPCs are truly remote, and the majority of interface procedures move only small amounts of data. The performance of cross-domain RPC is not satisfying with large overhead.

3. Contributions
This paper introduced LRPC, and discussed those aspects in detail:


  • Binding - Servers exports interfaces through clerks in every domain and get registered with the interfaces. Oh the other hand, the clients bind to an interface by making an import call via the kernel and wait for servers’ clerks.

  • Calling - Each procedure in a client is represented by a stub in the client and server domains. To make an LRPC, the stubs are called, initiate the domain transfer, and manages the A-stacks allocated at bind time.

  • Stub Generation - The LRPC stubs are automatically generated at run time and are used to blur the boundaries between communication protocol layers to reach efficiency.

  • LRPC on a multiprocessor - LRPC adapts to the use of multi processors, and is improved to increase throughput by minimizing sharing data structures. Caching domain contexts on idle processors helps reduce LRPC context-switch overhead.

  • Argument copying - Pairwise allocation of A-stacks enables LRPC to copy parameters and return values only once, instead of four times in traditional cross-domain RPC.

4. Confusion
Is LRPC used anywhere today in distributed systems?

1. Summary
This paper presents a lightweight communication facility based on Remote Procedure Calls (RPC) designed to handle communication between protection domains on a single machine.

2. Problem
Some operating systems have large monolithic kernels insulated from user programs by simple hardware boundaries and few protection boundaries exist within the OS itself, which makes these systems difficult to modify, debug and validate. Capability systems consist of fine-grained objects sharing an address space existing in its own protections domain. These offer flexibility, modularity and protection. These are difficult to efficiently implement. The common case for communication is between domains on the same machine as opposed to across machines. Most communication is simple, involving a few arguments and little data, since complex data is often hidden behind abstractions. While RPC is robust enough to handle both local and remote calls, a lighter weight solution can overcome its high overheads.

3. Contributions
LPRC provides simple control transfer mechanism. Conventional RPC involves multiple threads that must send signals and switch context. Whereas in LRPC, the kernel changes the address space for the caller's thread and lets it continue to run in the server domain.
LPRC provides simple data transfer via shared buffers that are pre-allocated for the communication of arguments and results. The caller copies the data onto the A-stack, after which no other copies are required.
RPC has general stubs, but many of its features are infrequently needed. LRPC uses a stub generator that produces simple stubs in Modula2+ assembly language.
LPRC are designed for concurrency. Idle processors in multi-processor machines cache domain contexts to reduce context-switch overhead. Counters are used to keep the highest activity LRPC domains active on idle processors.

4. Evaluation
Some experiments were conducted to measure the performance gains and it provides 3X speedup for single processor and additional speedups for multiprocessor machines.

5. Confusions
Is this design very language dependent? With the advent of high-speed networking in today's distributed computing environment, is LPRC still relevant?

1.Summary
The paper discusses the Light Weight Remote Procedure Call (LRPC) facility in small kernel systems, for communication between the protection domains of the same machine. The principles behind LRPC is motivated by the common case of procedure calls on the same machine with fewer and smaller arguments/results. LRPC borrows communication model and large-grained protection from RPC, but eliminates most of overheads like stub generation, kernel message buffering, scheduling, access validation, and achieves a factor-of-three performance improvement.

2.Problem
Communication in small kernel operating systems was expensive and involved complex mechanisms making cross-domain procedure calls rather heavy-weight. To facilitate applications running on RPC systems, these OSes borrowed programming model and protection mechanism from RPC, which had huge overhead for cross-domain calls. To facilitate efficient inter-domain communication, control transfer should be simple enough for cross-domain calls. The authors in this paper using these design principles, have suggested light-weight remote procedure calls (LRPC) communication facility for cross-domain talk which adopts common case approach by still providing safety like RPC and better performance than them.

4.Contributions
LRPC combines the control transfer and communication model of capability systems for cross-domain procedure calls, where as retain the programming model and coarse-grained protection from RPC systems. LRPC aims to achieve simple control/data transfer slightly modified binding mechanism and called procedure linkage. It uses the A-stack and E-stack abstraction for direct execution of client's procedure in server domain. Kernel validates the binding and access validation in the initial call. LRPC uses stub mechanisms similar to RPC. One interesting contribution the paper makes is by using LRPC on multiprocessor to improve call throughput. it does this by minimizing the shared data structures in the critical domain transfer path. It also uses domain caching mechanism on idle processor reducing wake-up latency and context switch overheads. Pairwise allocation of A/E stack enables less argument and result copying. Overall, LRPC makes cross-domain procedure calls in small kernel systems swift and secure.

3.Evaluation
Throughout the paper, the authors have provided detailed analysis supporting their design decisions. Percentage of cross-machine calls in RPC accounts only to 0.6 to 3 in three of the systems indicating need of efficient cross-domain communication mechanisms (common-case). This is also supported by the analysis on argument and results size with 50% of the time RPC calls having 200 or less bytes. The overheads incurred by RCP cross-domain calls are evaluated and compared to theoretical minimum which could be achieved eliminating these overheads. Number of operations needed for a procedure in LRPC is compared to message passing mechanisms. Finally, reasoning for LRPC having slight performance overhead to theatrical minimum due to TLB Misses, global lock is given.

5.Confusion
RPC for remote services makes sense, but where does LRPC find its uses in modern applications and Operating systems? How does it achieve called serialization when multiple calls are being handles? (Kernel memory management in these scenarios is not explained). Argument copying and domain termination was very confusing.

Summary:

This paper describes how RPCs could be repurposed for local communication on the same machine. The designers were able to improve the performance of RPC by reducing communication overhead. The authors first conducted experiments to prove that most RPCs are local, the number of parameters are small land that the overhead of conventional RPCs is large. They then describe their system and evaluate its performance.

Problem:

The main problem that the designers of LRPC tried to solve is optimizing RPCs for what they describe as the common case: local communication with a small number of parameters.

Contributions:

The authors took some of the concepts from the SRC RPC implementation and made it safer while not sacrificing too much performance. Instead of using a global shared message buffer space, they used an A-stack that is shared between a client and server for each procedure call.

The A-stack provided them with a relatively safe and quick means of passing arguments and return values between the client and server. By treating the A-stack like a locally accessed call stack, they were able to reduce the number of times data needed to be copied to and from the A-stack and simplify the return process.

By having their stub generators produce stubs in assembly, they were able to exploit the simplicity of their stubs to improve performance. Further optimizations include the reuse of processes already in the server address space that are idle. This optimization was designed to make use of multi processor systems.

Evaluation:

The authors compared the performance of LRPC on a Firefly system and hence compared their performance with the native implementation, Taos. Their numbers show that they were able to significantly reduce RPC overhead and achieve better performance than Taos. It would have been interesting to see how their performance compares with SRC RPC since they did borrow a few concepts from that implementation.

They also show that their performance scales with an increasing number of processors due to the fact that they avoid using shared memory and hence locks.

Confusion:

I didn't completely understand what happens on a server when a domain terminates.

Summary:
The paper describes Lightweight Remote Procedure Call, a communication facility designed and optimized for communication between protection domains on the same machine.

Problem:
The conventional RPC approach has high overhead for cross-domain communication and leads to performance loss. LRPC achieves high performance for cross-domain communication while retaining safety and transparency.

Contribution:
1. The control transfer is achieved by the client's thread executing the requested procedure in the server's domain.
2. The LRPC stub generator generates run-time stubs in assembly language directly because of the simplicity and stylized nature of LRPC stubs.
3. In multiprocessor environment, LRPC minimizes the shared data structure on the critical domain path and reduces context-switch overhead by caching domains on idle processors.
4. Shared argument stack accessible to both client and server, eliminates 3 redundant data copying.

Evaluation:
LRPC was implemented and integrated into Taos. The paper first gives experiment data about frequency of cross-machine activity, parameter size and complexity, and performance of current cross-domain RPC. The RPC latency is evaluated by the Null procedure call. LRPC shows about 3X better performance under different parameter size setting and increases throughput by avoiding locking hared data during the call on shared-memory multiprocessors.

Confusion:
The paper mentions in the "Binding" part that the kernel can detect a forged Binding Object. I wonder how is that implemented.

Summary:
This paper proposes and evaluates the design of LRPC, a communication facility designed for communication between protection domain on same machines to perform the performance compared to traditional RPC, without compromising on security. Purpose of taking such approach is the high frequency of cross-domain over cross-machine RPC calls in small kernel distributed systems.

Problem:
In small-kernel parallel systems, the RPC overhead for communication between protection domain on same machine has been dealt with coalescing of weakly related subsystems into protection domains. This compromises on the security while not achieving the real performance improvement.

Solution:
Since the most common case of communication is cross-domain, paper suggests the implementation of a lightweight RPC that leverage the presence of subsystems on same machine. It is observed that simple byte copying is sufficient for data transfer across system interfaces, and most IPC is small parameters passing. The solution has simple control transfer, data transfer, stubs, and design concurrency. Key changes to achieve this simplicity are following:
1) A-stack: For each procedure in the interface, there is a procedure descriptor. And for each, PD there is a Procedure's argument stack that is mapped read-write and shared by both domains for communication. This reduces the storage size for procedures in same interface, as well.
2) leverage multi-processor architecture for RPC: This is done by minimizing the use of shared data structures on the critical domain transfer path. Instead, the domain contexts are cached on the idle processors.

Evaluation:
The paper evaluates impact of all the proposed changes by implementation in Taos system. The observation values clearly shows that improvement in performance for common calls, and multi-processor systems. Additionally, LRPC achieves better performance even on single processor null call.

Confusion/Learnings:
The paper is well-rounded. It will be good to compare in parallel, the existing subsystems interaction and A-stack based LRPC system.

Summary
The paper describes the lightweight implementation of RPC, which achieves huge performance improvements over RPC for cross-domain procedure call (i.e. procedure calls made within the same machine). LRPC poses less overheads for cross-domain communications and achieves an improvement in doing so over RPC with a factor of 3.

Problem
RPC mechanism already works fine for communicating across subsystems. Out of all RPC calls, authors observed that a big majority of the remote procedure calls are in fact cross-domain and not cross-machine. And also most calls transferred very small no. of bytes. Communicating on the same machine using RPC certainly has lots of overhead (message, stub generation, access validation, etc.). Until this paper, a cross-domain procedure was only considered as an instance of RPC which could be made considerably less complex. In an attempt to optimize the existing RPC, this paper introduces the lightweight RPC that would be more suitable for cross-domain communication without compromising on security.

Contribution
RPC is robust but comprises of several small steps (user -> stub generation -> message passing, access validation ->…). The authors have improved the performance dramatically by redesigning the architecture. The mechanism they came up with is combination of protected procedure calls and RPCs.

1.Binding mechanism - The binding of clients and servers is done such that most of the work required to make a connection is carried out before a request is made. As a result, a binding object is created that is like a capability to a certain exported server interface. Once the client holds the binding object, access to the service is very easy and quick with minimal security checks required.

2.Shared memory in the form of A-stack is created on the client side. This stack contains arguments and return values which will be placed during a call. This A-stack is mapped Read-write and shared by moth domains. Such placement of shared mem reduced the calling overhead and communicates only when needed.

3. Multiprocessors – In order to reduce the overhead of context switches, the author uses something like a NUMA trick. In an MP scenario, it places the calling thread on a processor where context of server domain is already loaded. The called server procedure can then execute on that processor without requiring a context switch.

By reducing the number of times transmitting and copying the data and other achievements made the procedure call 3 times faster. LRPC has provided simplification, performance and safety at the same time

Evaluation
The paper presents tests on time required to execute an LRPC with NULL call as the base case. The overhead of LRPC and RPC was measured with different argument sizes. Quite obviously the larger the argument, the longer the LRPC took but in all cases including the base case, LRPC was almost 3 times faster than RPC. The main reason behind the 3 fold improvement as pointed out by authors is reduction in the no. of TLB misses. On measuring the call throughput of LRPC vs RPC, LRPC shows more throughput with values close to being optimal.

Confusion
Are E stack always on server domain and A-stack always on client? If so, then how does the mapping of A-stack occur in the beginning? How does E stack borrow stacks and when? An example scenario in this context would be great.

Summary
This article describes Lightweight Remote Procedure Call (LRPC) which is a communication framework designed and optimized for communication between protection domains residing on the same machine. LRPC achieves this under the assumption that remote procedure calls, most of the time, happen intra-machine, and that most data transferred during RPC are small and fixed in size. LRPC employs simple control transfer, simple data transfer, simple stubs and is designed for concurrency.


Problem
The conventional approach to RPC has a very high overhead. This results in amalgamation of structures inside the kernel thus seriously exposing major unrelated parts to each other. This is a security issue. This also results in lack of structure in a kernel. While still retaining the qualities of safety and transparency, LRPC achieves performance levels that is a notch higher than conventional RPC systems.


Contributions
A brief study was done to prove that most RPC in machines is intra-machine and that data transfers involved in RPC is small and almost fixed in size.

The performance of cross-domain conventional RPC suffers from stub overhead, message buffer overhead, access validation, message transfer, scheduling, context switch and dispatch. Most of these steps in RPC can be bypassed resulting in a simpler and a higher performance RPC

LRPC binding is achieved using clerks which respond to the kernel's binding request on behalf of the client by dispatching a procedure descriptor list. Each PDL contains one procedure descriptor for each procedure in the interface.

Argument passing and linkage records for caller return are managed on A stacks allocated on a PD-client pair basis. A stacks can also be shared between procedures with similar requirement of arguments.

A procedure is represented by a call stub in the client's domain and an entry stub in the server's. The LRPC stub generator produces run-time stubs in assembly language directly from Modula2+ definition files.

To avoid concurrency overheads LRPC avoids locking except for locks for each A-stack queue. LRPC employs the services of extra processing cores to cache domain to avoid the overheads of TLB invalidation and virtual memory registers updating.


Evaluation

Four tests were run on the C-VAX Firefly using LRPC and Taos RPC. For domain-switching tests that don’t make use of extra cores, it was found that LRPC was roughly three times faster than SRC RPC. LRPC/MP which does domain-caching performs even better than LRPC. For a null procedure call, the Firefly virtual memory and trap handling machinery add a basic requirement of 109 microseconds. LRPC adds only 48 microseconds of overhead to this lower bound.


Confusion
How relevant is inter-domain RPC in modern free systems like GNU/Linux and what major security advantages does it provide over simple two level privileged and user mode access found in most widely accepted operating systems today?

1. Summary
The authors show that the contemporary RPC systems do not optimize for the common case of RPCs between protection domains in the same machine. The authors then design an RPC system optimized for this common cases using the idea of protected procedure calls from capability systems.

2. Problem
To design an efficient RPC system for the common case of RPCs across protection domains in the same machine (cross-domain) as opposed to RPCs across protection domains in different machines (cross-machine).

3. Contributions
The authors perform a case study of existing operating systems to show that only a small fraction of RPCs (or equivalent message transfers) are cross-machine and that a large fraction of RPCs use simple argument passing. They identify key sources of overhead in cross-domain RPCs in conventional systems, and design lightweight RPC system (LRPC) that avoids them. LRPC uses the client's thread to execute the requested procedure in the server's protection domain, thereby, avoiding scheduling and dispatching issues. LRPC uses an argument stack, which is shared between the domains of client and server, to the pass arguments and results across avoiding excessive data copies, and uses a linkage record, which is only accessible to kernel, to store the return address in the client. The binding process in LRPC involves allocating a set of these argument stacks and linkage records for each binding. The procedure invocation in LRPC involves the client stub finding an argument stack and copying arguments on to it, trapping to kernel, and the kernel priming an execution stack with call frame that the server stub expects and branching to server stub. LRPC's stub generation procedure is optimized to only make one procedure call (into client stub) and two returns (from server procedure and from client stub). LRPC also caches server domain contexts on idle processors to minimize the latency of context switches to further minimize the latency of RPCs.

4. Evaluation
The authors measure and compare the latency of RPCs for certain procedure signatures on LRPC and a conventional well-optimized RPC system in Taos, and show that the overhead of making a simple empty procedure call in LPRC is a factor three faster than that in Taos. They also provide a breakdown of components of an empty LRPC call. They show that LRPC is not bottlenecked by locking of shared data structures as conventional systems are by demonstrating that the call throughput scales linearly with the number of processors.

5. Confusion
How do clerks that help in binding clients to servers work? The subtle mechanisms used for avoiding argument copies in section 3.5 are unclear.

Summary:
This paper talks about lightweight RPC - a mechanism that allows efficient communication between different protection domains on the same machine.

Motivation:
The authors show that most of the communication in operating systems happens between domains on the same machine (cross-domain rather than cross-machine), with simple (and small) arguments and they state that optimizing this common case is the most important motivation for their work.

Contributions:
The authors argue that by avoiding the conventional RPC overheads by using the following ideas, they have designed an efficient communication mechanism for the common case mentioned above:
-Sharing argument stacks (A-stacks) between the client and server domains so as to avoid copying the arguments multiple times and dynamically associating each A-stack with an execution stack on server side to provide a low-latency control transfer path across domains.
-Type safety checks happen when copying data into the A-stacks and this helps the system avoid another scan of the data to perform the same checks later (on server side) and lets the kernel prime the first E-stack frame directly.
-Caching domains on idle cores so as to enable quick migration of threads across processors on lrpc calls (this avoids context switches and TLB flushes).
The uncommon cases (such as larger arguments, cross machine communication etc.) are handled by mechanisms more similar to conventional RPCs (and thus are more expensive than the common case).

Evaluation:
The authors present a detailed breakdown of the execution time for the null lrpc call and show that lrpc adds just a 48 micro-second overhead,which is extremely efficient, compared to traditional RPC (which adds a 355 micro-second overhead).

Confusion:
-The authors talk about domain termination- But isn't 'domain' here similar to the protection domain in Opal, i.e. just a passive context for threads to execute in? Should they not be talking about thread termination then? What does it mean for a domain to be terminated?

Summary
The paper describes the motivation,design, implementation and performance of Lightweight Remote Procedure Call, "a communication facility optimized for communication between protection domains on same machine". It explains the various techniques used by LRPC to achieve a significantly better performance than conventional RPC systems, for cross-domain communication.
Problem
Most of the RPC calls by the operating system are cross-domain rather than cross-machine. Using RPC for cross-domain communication resulted in performance overhead, as they were not designed or optimized for this. So there was a need for simple rather than complex cross-domain communication.
Contribution
Binding provides a simple control transfer. The client binds to a specific interface by import call via the kernel, which notifies the waiting clerk in LRPC runtime os server module. The clerk enables the binding by returning PDL. For every PD in the PDL the kernel allocates a argument-stack(A-stack) that is mapped read-write and on which the arguments and return values are palced during the call. At the end of binding the kernel returns a binding object to the client and is used a key for accessing server interface on every trap to kernel. The kernel also creates a linkage record that stores the caller's return address for each A-stack. Private execution stack on server domains were allocated and associated to A-stack only when required. This association remains even after the call returns so that they can be used for other calls. The authors also propose a method to improve the performance especially on multi-processor environment by keeping the context on the processor when idling so it could reduce the number of context switching as much as possible.
Evaluation
The authors evaluated the performance of LRPC by running four tests on C-VAC Firefly using LRPC and Taos RPC. LRPC on single processor was faster than SRC RPC of Taos by 3 times. The improved performance because of multiprocessor optimization were clear by the results. Also LRPC's have higher throughput than RPC on multiprocessor's because they dont use a global lock.
Confusion
I did not completely understand how parameters are by passed by reference by the client.

Summary

The paper describes Lightweight Remote Procedure Call (LRPC) facility which is optimized for communication between protection domains on the same machine. This is achieved by employing techniques for simple control transfer, shared argument stack, highly optimized simple stubs and by leveraging the speedup potential of a multiprocessor.

Problem

The original RPC systems caused an unnecessarily high cost when used between protection domains on the same machine. These caused designers to sometimes trade safety for performance. Thus some optimizations were necessary when the RPC calls are within the protection domains of the same machine.

Contributions

LRPC combines the control transfer and communication model of capability systems with the programming semantics and large-grained protection model of RPC. The following techniques have been used for performance optimization:

Simple Control transfer: LRPC runs the client's thread in the Server's domain for a requested procedure. The control transfers to the server's execution stack after the kernel validates the call for permissions.

Simple data transfer: A shared argument stack is used which is accessible to both client and server and thus reducing the overhead of data copying.

Simple stubs: The simplified control and data transfer allow LRPC to use highly optimized stubs.

LRPC benefits from multiprocessor systems by minimizing the amount of shared data structures used and hence minimizing the associated costs of locks and it also uses caching domains on idle processors to reduce overheads of context switching.

Evaluation

The authors use four tests that were run on the C-VAX Firefly using LRPC and TaOS RPC. They found that LRPC on a single processor is about three times faster than SRC RPC. The performance of LRPC/MP (Multiprocessors) were even better than LRPC.

Confusion

RPC systems are still needed for procedure-calls across machines. How does the application know whether and when to use LRPC or RPC ?

1. summary
The current paper introduces a Lightweight Remote Procedure Call (LRPC) communication mechanism, optimized for common case cross-domain procedure calls in small kernel systems. This method reduces the cost of traditional methods like RPC and message passing while maintaining the protection domains.
2. Problem
The authors point out that for capability based small kernel systems, cross domain protected procedure calls are difficult to implement and program. Distributed system mechanisms like RPC are shown to be useful for cross domain communication, but have overheads of remote calls associated with them. The objective here is to optimize the calling procedure without compromising the protection.
3. Contributions
The paper optimizes the mechanisms of RPC for the common case scenario. The authors note that most of the procedure calls are local and consist of simple arguments, thus providing scope to eliminate the overheads of RPC.
Binding process requires interaction between the client, server and the kernel. A shared argument stack is used to eliminate the need for message copy multiple times. A private Execution stack is used to execute the procedure in the context of the client thread. The execution and argument stacks are lazily binded to reduce the total call stacks required. The validation is performed only on the forward call path, removing overheads on the return path.
To increase the throughput of multiprocessor systems, shared data structures are minimized on critical domain transfer path. Unlike the RPC implementation, where a single global lock protects the transfer path, LRPC uses one lock per call/argument stack and permits multiple calls to proceed in parallel. To reduce context switch overheads, the calling thread is exchanged with an idling thread. The idle processor thus caches the server domain of execution and the exchange prevents a context switch. Asynchronous changes to input arguments are possible and thus are validated for integrity before use.
4. Evaluation
The performance of LRPC on single and multiple processors is compared with that of RPC implementation. LRPC is roughly three times faster than RPC, with LRPC/MP performing even better. A detailed timing profile of LRPC call shows that less than 50% of time is spent in the stubs and rest in kernel. TLB misses contribute to about 25% of the total latency, which is a bottleneck in this design. The throughput for LRPC/MP shows a speedup of 3.7 on a 4 processor machine.
5. Confusion
How did they manage to remove locking on shared data across multiple procedure calls? Good throughput can be expected once the synchronization is removed.

Summary:

The paper talks about a communication mechanism between two different domains on the same machine while maintaining both safety and performance. They implement Lightweight Remote Procedure Call (LRPC) i.e., an optimized version of RPC while maintaining the same semantics. The optimizations are done on control transfer where the number of context switches is reduced, on data transfer using shared argument stack to eliminate unnecessary data copying and data transfer and certain other optimizations.

Problem:

System designers avoided using RPC for communication between domains on the same machine as the overhead was very high thereby affecting performance. This forced the system designers to coalesce the different related subsystems into the same domain. This mechanism compromised the security between the subsystems. LRPC overcomes this problem and provided both safety and performance with their optimization on RPC.

Contributions:

LRPC provides binding mechanism similar to RPC where the server exports an interface through a clerk in LRPC runtime library which is registered with a name server which is then imported by clients. Each client is provided a binding object after they are validated, which later act as the key for client to access the server’s interface. This way LRPC induces protection between server and client. The client calls the procedure by passing the binding object and an allotted A-stack(argument stack) which is shared between the server and the client. The A-stack contains the argument passed by client to the server. The A-stack is mapped into the server and an argument pointer is used to refer to the arguments.

An execution stack is used to execute the server procedure which contains the argument stack pointer amongst various other things. A linkage record is used to hold the return address. On multiprocessors, LRPC tries to take advantage of domain caching to avoid unnecessary context switches. LRPC also provides optimization in copying the arguments apart from the shared A-stack. For example, in case of arrays, instead of copying the whole data, the starting address of the array alone is copied which can then be interpreted by the server.

Evaluation:

The paper clearly states the problems in using RPC for domain calls within same machine providing the corresponding measurements. LRPC provides solution to the problems that one may face with RPC and explaining how they overcome it. They clearly show the performance improvement of LRPC over RPC. But a more ideal comparison would have been with the mechanism of trapping into kernel and copying the arguments like it is done in system calls.

Confusions:

I am little confused on the part in argument copying where they say, for passing large values, copying concerns become less important. Is this because they are planning to just copy the starting address of the large data? If so, why can’t the same be done for small values? If address is copied, how would it work? Since the physical address should be the same and not just the virtual address.
Also, will the virtual memory of server for A-stack mapped to the same virtual memory address as for client?

Summary:

The paper talks about a communication mechanism between two different domains on the same machine while maintaining both safety and performance. They implement Lightweight Remote Procedure Call (LRPC) i.e., an optimized version of RPC while maintaining the same semantics. The optimizations are done on control transfer where the number of context switches is reduced, on data transfer using shared argument stack to eliminate unnecessary data copying and data transfer and certain other optimizations.

Problem:

System designers avoided using RPC for communication between domains on the same machine as the overhead was very high thereby affecting performance. This forced the system designers to coalesce the different related subsystems into the same domain. This mechanism compromised the security between the subsystems. LRPC overcomes this problem and provided both safety and performance with their optimization on RPC.

Contributions:

LRPC provides binding mechanism similar to RPC where the server exports an interface through a clerk in LRPC runtime library which is registered with a name server which is then imported by clients. Each client is provided a binding object after they are validated, which later act as the key for client to access the server’s interface. This way LRPC induces protection between server and client. The client calls the procedure by passing the binding object and an allotted A-stack(argument stack) which is shared between the server and the client. The A-stack contains the argument passed by client to the server. The A-stack is mapped into the server and an argument pointer is used to refer to the arguments.

An execution stack is used to execute the server procedure which contains the argument stack pointer amongst various other things. A linkage record is used to hold the return address. On multiprocessors, LRPC tries to take advantage of domain caching to avoid unnecessary context switches. LRPC also provides optimization in copying the arguments apart from the shared A-stack. For example, in case of arrays, instead of copying the whole data, the starting address of the array alone is copied which can then be interpreted by the server.

Evaluation:

The paper clearly states the problems in using RPC for domain calls within same machine providing the corresponding measurements. LRPC provides solution to the problems that one may face with RPC and explaining how they overcome it. They clearly show the performance improvement of LRPC over RPC. But a more ideal comparison would have been with the mechanism of trapping into kernel and copying the arguments like it is done in system calls.

Confusions:

I am little confused on the part in argument copying where they say, for passing large values, copying concerns become less important. Is this because they are planning to just copy the starting address of the large data? If so, why can’t the same be done for small values? If address is copied, how would it work? Since the physical address should be the same and not just the virtual address.
Also, will the virtual memory of server for A-stack mapped to the same virtual memory address as for client?

Summary :
The paper talks about the LRPC mechanism which is used for communication between protection domains on the same machine in small kernel systems. It highlights the main advantages of LRPC over normal RPC and evaluates the performance of LRPC against Taos RPC for a couple of procedure calls.

Problem :
Most of the communication traffic in OSes is cross domain traffic as compared to cross machine traffic and most of the arguments passed are simple. Traditional RPC mechanisms incur a lot of overhead when applied to cross domain. Therefore, the problem in question is to design techniques for the purposes of control transfer, data transfer and stub optimization such that we they take advantage of the fact that they are local operations between protection domains on the same machine.

Contributions :
1. Mechanism is similar to that of a protected procedure call which takes place through a kernel trap.
2. The concepts of RPC that apply to LRPC include binding and calling through stubs. Clerk performs the functionality of the rpc runtime which registers the server interface and handles requests from clients. Procedure descriptor list used to list the set of procedures exported by the server domain.
3. Arguments are pushed on to the A-stack on the client domain and the A-stack is mapped read-write and is shared by both client and server domains avoiding the need for a copy.
4. Linkage record allocated by kernel for each A-stack to store the return address of the caller. Binding objects are used by the kernel to validate the caller’s access to the procedure in the server domain.
5. Allocation of private execution stacks for the procedure in the server domain takes place only when needed and is associated with the A-stack. The association remains after the call so that it could be made use of in further calls.
6. Argument copying is efficient since four copying operations in the normal message passing is just replaced by a maximum of two copying operations.
7. The number of A-stacks and its size can be overridden and default size of an ethernet packet is used when variable sized arguments are present. Revocation of binding objects and invalidation of active linkage records takes place on termination of a domain.

Evaluation :
The paper provides the evaluation of four different procedure calls for both LRPC and Taos RPC. On an average, they conclude that the performance of LRPC was three times faster as compared to the normal RPC. The simplest cross domain call took 157 microseconds on a single processor as compared to SRC RPC which tool 464 microseconds for the same call. Their evaluation also indicates that the LRPC overhead which comprises the time spent in the client and server stubs is minimal. A throughput speedup of 3.7 is achieved in multiprocessor systems using LRPC.

Confusion :
How is domain termination different between traditional RPC and LRPC? The concept of caching domain contexts on idle processors was not quite clear.

Summary:

This paper talks about Lightweight Remote Procedure Call (LRPC) which is a modification of the traditional RPC to reduce the cost of inter domain communications in the same machine. The paper discusses the design and implementation of the LRPC through the modifications made during binding, calling and stub generation and how the control transfer mechanism of capability systems are incorporated to reduce latency.

Problem:

The RPC systems that were existing, performed badly for cross domain communication as there was overhead in terms of stub, message buffer, access validation when the message is sent and received, managing the client and server threads and context switch. The SRC RPC improved the performance by sharing the buffer in globally with a lock so there are no conflicts. But it also remove access validation on call and return which traded security for performance. The LRPC system tried to provide the same performance without any security trade off.

Contribution:

The LRPC combines the RPC system with the control transfer of the capability system from the protected procedure call. The server exports an interface through a clerk which registers the interface with a name and handles binding by providing the PDL (Procedure Descriptor List) for all procedures in the interface. By using an A-stack, which is accessible to both domains, message copying is reduced. The Binding Object that is returned to the client on import acts as an access validation. On a procedure call the stub prepare the A-stack and pack the necessary information for the kernel to invoke the server and traps into the kernel which in turn verifies the data and prepares the E-stack for the server. The server procedure executes and the server stub manages the activities to return the result. On multiprocessor, domain caching reduces the LRPC latency. If a processor is idling in context of the server domain, the client thread is placed on the processor with server domain thereby reducing the context switch overhead. If no idling processor is found, then a counter keeps track of the domain that is needed so that it could be run on one of the processors. Caching domains also reduce the domain crossing costs. LRPC also recognizes situation where the parameters are not used by the server like the Write call and avoids copying. By type checking the arguments during copy, the server avoid runtime overheads incurred due to type mismatch of arguments.

Evaluation:

The LRPC mechanism is evaluated using four tests with procedures having different parameter size. The LRPC on single processor outperforms the SRC RPC of Taos by a factor of 3. Also the multiprocessor optimization is shown to be effective. The break down for the theoretical minimum and LRPC overhead is given. The paper also discusses about the number of calls that can be made per second in LRPC which is greater than SRC RPC systems.

Confused about:

I did not understand how parameters passed by reference are handled.

Summary:
This paper describes the design and implementation of Light- Weight Remote Procedure Calls (LRPC), which is a communication facility designed and optimized for communication between protection domains on a single machine. LRPC simplifies various aspects of RPC mechanism such as control transfer, data transfer, linkage and stubs. LRPC borrows its execution model from “protected procedure call” of capability systems and its programming semantics and large- grained protection model from RPC. LRPC represents a viable communication alternative for small- kernel operating systems.

Problem:
Most communication traffic in small kernel operating systems is between its various protection domains on the same machine. Employing RPC for communication in such systems results in loss of performance because RPC does not distinguish between cross domain communication and remote communication and treats communication between domains as an instance of remote communication. This paper tries to optimize performance for the common case of cross domain communication.

Contributions:
- Binding: Conceptually, LRPC binding is similar to RPC binding except at lower level. At lower level, Server provides procedure descriptor list, used by kernel to allocate A- stacks and create linkage records. On completion, kernel returns to client a binding object and A- stack list.
- Calling: Client stub takes a A- stack off the queue, pushes arguments on to it. Then, it puts A- stack address, binding address and procedure identifier into register and traps to kernel.
- Simple data transfer: RPC requires data to be copied four times: Stub to RPC message, client message to kernel, kernel to server and server to stack. LRPC requires data to be copied only once, From client stub’s stack to shared A-stack from where server procedure can access.
- Simple stubs: Stubs blur boundaries between protocol layers to reduce the cost of crossing them. LRPC needs only one formal procedure call (into client stub) and two returns (one out of server procedure and one out of client stub)
- Concurrency: LRPC minimizes the use of shared data structures on critical domain transfer path. Also, on shared memory multiprocessors, LRPC latency is reduced by caching domain contexts on idle processors.

Evaluation:
To measure performance gains of LRPC over RPC, the authors run four different cross domain calls: NULL, Add, BigIn and BigOut on C- VAX Firefly using LRPC and Taos using RPC. The results show that LRPC is almost a factor of three faster than RPC. It is also reported that 25% of LRPC time was consumed in TLB miss handling. LRPC also seems to scale well with the number of processors.

Confusions:
Is LRPC used in any of today’s systems?

Summary:
The paper describes the implementation of lightweight remote procedure calls (LRPC), a common case optimization for cross-domain communication. LRPC simplies data transfer, control transfer, linkage and stubs. It is used in small kernel OS to avoid cost incurred by RPC. Evaluation shows that LRPC to be 3 times faster than RPC.

Problem:
The authors observe that majority of the RPC calls are cross-domain instead of being cross-machine. Conventional RPC mechanism doesnot distinguish between local and remote procedure calls. As a result, local calls incur more overhead and are less performant. The authors suggest optimizations to implement the local procedure calls i.e. cross-domain communication to remove overhead and get better performance results.

Contributions:
1. Simple control transfer through Binding. At a lower level, server provides procedure descriptor list used by kernel to allocate argument stack (A-stack) and create linkage record. On completion, kernel returns binding object and A-stack list.
2. Simple data transfer: Data ccopied only once [through shared A-stack) compared to 4 times in RPC. A-stacks are shared between procedures on same interface with similar A-stack sizes.
3. Stubs are generated at run time, blur the boundaries between the protocol layers to reduce cost of crossing them. LRPC needs only one formal procedure call and two returns.
4. LRPC minimizes the use of shared data structure on critical execution path.
5. Latency reduction by caching domain cntext on idle processors.

Evaluation:
Initially, the authors motivate the problem by showing that frequency of remote activity ranges from 0.6 to 3.0%. LRPC is shown to be 3 times faster than normal RPC. Various performance measurements were done with variable parameter sizes( Null, Add, BigIn, BigInOut). C-VAX system using LRPC 157us for call, roughly three times faster than Taos RPC. Throughput measurements showed a speedup of upto 4.3 with five processors.

Confusions:
Call throughput measurements were done with a max of 5 processors. Is there any specific reason for not using more processors?

1. Summary
This paper demonstrates the design and implementation of a lightweight remote procedure call framework for communication among different protection domains on the same host machine. LRPC offers the semantics and protection of RPC with the simplicity and efficiency of a protected procedure call.

2. Problem
The authors discovered that most of the invoked remote procedure calls were not truly remote. They would be served by a thread in the same machine. Since RPC is designed for remote communication, using it for local calls is inefficient.

3. Contributions
The authors implement the lightweight remote procedure call framework. LRPC uses the same programming paradigm as traditional RPC so that there are no compatibility issues. However LRPC differs from RPC in the way in which it handles the underlying details. Just like RPC, a client has to bind to a server to initiate LRPC communication. The kernel returns a binding object which the client can use to make any future LRPC calls. LRPC also uses stubs which are responsible for the transfer between protection domains. Since the call is to a procedure on the same machine the network overhead is eliminated. Instead, the arguments and return values are transferred by memory sharing between the client and server. The paper proposes optimizations to LRPC to use a multiprocessor environments efficiently. LRPC also handles failures on both client and server side by raising exceptions to the caller.

4. Evaluation
To evaluate the LRPC framework, the authors measure the latency of four different LRPC calls. These tests include calls that do no work which measures the time for switching protection domains and procedures that take and return large arguments. They compare the measured latency values against RPC implementations to show the improvement that LRPC offers. The paper also presents a breakdown of the time it takes for an LRPC call to complete.

5. Confusion
What are the advantages of using LRPC over communicating using portals like in Opal - both seem to be enabling inter-domain communication on the same host machine.

1. Summary
The paper describes the implementation of Lightweight Remote Procedure Calls (LRPC) which is used to call procedures across domains however in the same machine. The authors take advantage of the fact that procure calls across domains on the same machine do not have to be implemented in the same way as procedure calls across different machines.

2. Problem
a) Most of the procedure calls in the existing operating systems were on the same machine rather than across machines. Even in worst cases, 95% procedure calls were on the same machine across domains.
b) Most of the parameters passed are small in size. Hence marshalling them across the stubs are adding time and space when they can be just passed easily across domains on the same machine.
c) Even for a procedure that does nothing, the RPC has additional overheads from stubs, scheduling, message buffer, access validation, message transfer and context switch.
All these actions are unnecessary when calling procedures on the same machine and hence the authors implement a lightweight version using the fundamentals of the RPCs.

3. Contributions
a) The binding is made during run-time when the server sets up a clerk stub that contains the procedures list consisting of procedure descriptors(PD). The PD contains an entry into the procedure, the number of maximum simultaneous calls that can be made to the process and an A-stack(argument stack) for the procedure. A linkage record for each PD holds a return address when it is called.
b) At the time of the call, the PD gains access to the procedure, the arguments are pushed on to the stack and the thread is transferred control into the server’s address space (context switch).
c) In case of call by references, the address on the A-stack is copied on to the server’s private E-stack and worked on there. This also leads to cache-mapping an E-stack for every A-stack.
d) On a multiprocessor machine, the A-stacks are locked so as to provide synchronisation. The server and client domains are cached on the processors so as to avoid unnecessary context switches.
e) Compared to the RPC, the LRPC does half the amount of copying since it is does not copy the stub into the kernel and vice versa.

4. Evaluation
The LRPC is found to be roughly 4 times faster on a multiprocessor machine when compared to a normal Taos RPC - mainly due to lesser context switches and faster argument passing. The overhead due to LRPC compared to the theoretical work is almost 50%. Also compared to RPCs, the LRPCs provide higher throughput on multiprocessor machines since it does not have a global lock.

5. Confusion
The passing by reference was a little confusing. How does the server write on the address space of the client?

Summary: This paper introduce a Lightweight Remote Procedure Call (LRPC) for optimized communication between protected domains on the same machine. LRPC achieves both safety and performance, which conbines the control transfer and communication capability systems with the programming semantics and large-grained protection model of RPC. In experiment, the calls in single machines was speed up by factor of three based on RPC.

Problem: For communication between protected domains, the majority of them (about 95% to 99% according to evaluation in the paper) are within the same machine. Another thing is that the majority of these communication has very small messages size (compared with call arguments). In this case, we need to optimize these lightweight calls within the same machine specifically.


Contributions:
1. Simple control transfer. The steps are: clients call server, trigger kernel traps -> kernel do validation, create call linkage, and dispatch client's thread *directly* to the server domain -> client provide argument stack and the thread. This control is connect client and server by very simple steps.

2. Simple data transfer. The parameter-passing mechanism is similar to RPC. There is a stack shared by both client and server so that redundant data copy is avoided.

3. Simple stubs, based on the simple control and data transfer model, the stubs are also optimized, written by assembly languages, the only instructions are moving and kernel trapping.

Evaluation: The authors implement LRPC on Taos system, and compare the length of processing of four types of calls: NULL, add, BigIn and BigInOut. On average, LRPC achieves factor of three speed up than Taos system without LRPC. Also, the breakdown of the time of NULL call are shown: the LRPC overhead is only about half of the time of call processing.

Confusion: I do not see a comparison of safety features between LRPC and RPC, is it true that LRPC provides equal or stronger safety with three times better performance?

Summary
This paper explains the need for lightweight remote procedure calls (LRPC) for communication between protection domains in a single machine. It describes how it takes the communication and control transfer mechanisms from capabilities systems and combines it with the protection model of RPC. A detailed study was conducted which proved that a very less fraction to total communication actually crossed machine boundaries. The design, implementation and performance report of LRPC is also provided.

Problem
RPC communication between protection domain is very heavy hence decreases performance. To avoid heavy RPC calls, logically different entities are packaged together into a single domain, increasing its size and complexity. LRPC was developed after noticing these observations, it aimed to provide an efficient inter domain communication model.

Contribution
Before discussing LRPC, the authors provide us with few experimental observations. They monitored 3 operating systems (V system, Taos and UNIX+NFS) over a period of time and noticed that a very small fraction of communication cross machine boundary. This was because RPC calls are very slow hence programmers avoided these network communications. They also notice that the size of the arguments and complexity of the procedure call across domains were very small. There was no marshaling required and simple byte copying was enough to transfer data. Cross-domain RPC also encountered performance overhead in terms of stubs,message buffers, access validations, message transfer, scheduling, context switch and dispatch.
The design of LRPC was such that after kernel validates the caller, its traps were used to call server procedures by dispatching the client threads to the server and creating a call linkage. The server is also provided with the client’s argument stack. The client needs to bind with the server’s interface before the first call, which grants the client authorization to use the server’s procedures. LRPC also reduces context-switch overhead by caching domains on idle processors and placing the calling thread on the processor where the server context is already present.


Evaluations
Tests were conducted on the C-VAX Firefly using LRPC and Taos RPC where the null call was used to derive the baseline. Elapsed time was calculated in multiprocessor and single processor scenarios. LRPC(MP) was comparatively better than single processor LRPC, which in turn was found to be 3 times faster than SRC RPC. It was also observed in the evaluation that 25% of the LRPC time was consumed in TLB miss handling.
It was also observed that even with domain caching disabled, LRPC calls encountered speedup with increase in the number of processors. However, in case of SRC RPC, the throughout stops increasing after 2 processors because of global lock on the RPC transfer path.

Confusions
I was not clear on how global locks don’t allow throughput to improve with increasing number of processors and also on the topic of domain caching on idle processors.

Summary :
The paper presents a optimized, light weight version of the conventional RPC, the LRPC which can be used when clients are servers are on the same machine but in different protection domains.

Problem :
Most of the times, when objects want to transfer control between domains, it does not happen across system boundaries but only across protection domains within the same system. To deal with these control transfer, if we use the conventional RPC systems, it adds a lot of overhead by multiple parameter copies and control transfers from and to the kernel. This can be easily avoided if we detect which calls are actually cross-domain and use a lightweight RPC mechanism for them.

Contributions :
1. Identifying that cross domain calls are more frequently used when compared to cross machine calls. Thereby, improving the RPC call to improve performance by about four times.
2. Using pairwise shared memory to share arguments and execution stacks between clients and servers, instead of copying from one domain to another.
3. Using handoff scheduling to directly transfer control of thread from client domain to the server domain. This avoids the unnecessary overhead of having to trap into the kernel for each call.
4. Caching domains on different processors and easily transferring control if the processors are idle. This avoids having to update virtual memory and having to clear out TLB.
5. Guaranteeing protection by allocating pair wise shared memory which are accessible only to the server and the client.

Evaluations :
The authors show that LRPC is three times faster than the conventional RPC originally developed. It is slightly faster with the multi processor scenario. They perform analysis to find out how many TLB misses occur when the Null call is executed. Even with all the optimizations done to reduce misses, they observe 43 TLB misses here. They also compare the number of RPC calls made by a single processor and when compared with four processors, they observe a 3.7 speedup. This is because of domain caching. On a system with five processors, this is 4.3

What I found confusing :
What are the applications of LRPC in the current scenario?

Summary

Remote Procedure Calls (RPC) are a communication facility designed for both inter-domain (same system) and inter-system (different system) calls. In practical usage, the vast majority of calls take place on the same system, yet RPC treats all calls equally and so same system messages are burdened by the overhead required to connect to a remote server. This paper presents the Lightweight Remote Procedure Call (LRPC) facility which reduces the complexity needed for local system calls, while retaining mechanisms to fall back on more standard RPC methods for complex and inter-system requests.

Problem

The message-based approach used by RPC treats inter-domain and inter-system calls equally. This adds considerable overhead to all RPC procedures; yet the vast majority of requests take place on the local system. Most communications are short, simple messages that do not require library procedures or copying through the kernel. The kernel can in fact be bypassed for all call/return functions in a local system, yet RPC still passes messages through it. This is inefficient and unnecessary.

Contributions

The main contribution of this paper is the LRPC facility, which uses several techniques to dramatically improve RPC performance. Several methods make up the guts of LRPC, which have common themes of bypassing the kernel and exploiting parallelism. Client threads execute the requested procedure in server's protection domain. Client and server share an argument stack to prevent copying. Highly optimized stubs allow for simple control and data transfer. Caching domains on idle processors reduces context-switch overhead in multiprocessor systems. Finally, LRPC locks only the argument stacks in multiprocessor systems and nothing else, using a very small fraction of call time.

Evaluation

The paper presents two sets of evaluations. Very early on, it evaluates existing RPC implementations to determine the number of local calls vs. remote calls in common systems as well as message size. This reveals that >95% of all RPC calls happen within the same system, while >50% of messages are 200 bytes or less. Next, after describing the LRPC approach, the authors run several performance tests to reveal how much better their approach works than standard RPC. These reveal that LRPC offers a 3-4x performance improvement over more general RPC implementations, yet comparable performance to regular RPC on complex and inter-domain calls.

Confusions

LRPC seems like a very clever way to enhance RPC for local systems without affecting performance for inter-system calls. Is there a catch? Has this or something similar been widely adopted?

Summary:
This paper describes LRPC as an optimized version of RPC designed for the case of cross domain communication on the same machine. To reduce the performance overheads of conventional RPC, LRPC uses simpler mechanisms to reduce path indirection, argument copying, scheduling, dispatch and context switch costs that are associated with an internal RPC.

Problem:
Conventional RPC does not distinguish between local and remote calls and so the local calls incur higher latencies than they should when calling into another domain. This forces developers to put more and more unrelated components into the same domain which violates the structure of the small kernel operating system built for distributed systems. LRPC optimizes for the local calls so that both performance and the protection associated with different domains can be achieved.

Contributions:
LRPC reduces the cost of using stubs for local calls by having the kernel invoke the server entry stubs directly. This removes the intermediate steps of message examination and the need for a dispatcher at the server side. Tradition RPC required 4 copies to get the argument from the client to the server’s stack. LRPC brings this down to 1 by having a shared argument stack that the server procedure can use directly. LRPC also utilizes multiprocessors to speed up its calls by caching domains on idle processors which reduces the context switch overhead with loading the virtual memory registers. It also reduced contention in concurrency by minimizing the shared data structures and locks in the transfer path. All of this allows domains to be kept separate which improves the overall security and isolation of small kernel operating systems.

Evaluation:
The LRPC performance was evaluated by implementing it in the Taos OS on the Firefly multiprocessor workstation and comparing it against the SRC RPC which was the Firefly’s native communication system. Running cross domain procedures on a single node showed that LRPC added an overhead of 48us to the latency lower bound. The throughput of LRPC as compared to the SRC RPC was shown to be 3.7 times higher when run on four processors.

Confusion:
How did LRPC take off given that it required the language optimization of using separate argument and procedure stacks? Also I did not understand the bit on how parameter copying can be avoided in “situations where the actual value of the parameter is unimportant to the server”.

Summary
The paper describes Lightweight RPC (LRPC), a facility designed for communication between protection domains on the same machine. The paper presents a detailed comparison of LRPC with RPC and outlines advantages of LRPC over RPC. RPC incurs high cost/overhead in communication between protection domains on same machine in small kernel operating system. LRPC achieves high level of performance over RPC by implementing simple control transfer, data transfer, simple stubs and better design for concurrency.

Problem
Small kernel systems use distributed programming models but common case of communication is not across machine, it’s rather across domains on the same machine. The conventional RPC has not fully exploited the fact that cross domain procedure call can be less complex than its cross-machine counterpart and so local communication is treated as an instance of remote communication - simple operations are considered in the same class as complex one, incurring more overhead/cost. LRPC - optimized version of RPC for cross domain achieves high level of performance using new techniques.

Contribution
LRPC adopts an optimized common-case approach to communication and viable communication alternative for small kernel OS. It avoids needless scheduling, excessive runtime indirection, unnecessary access validation, redundant copying through the use of shared memory via the A-stack on client and server domain. LRPC avoids shared data structure bottlenecks (hence avoiding lock contention) and benefits from speedup multiprocessor for better concurrency. It also reduces context switch overhead by caching domains contexts on idle processors.

Evaluation
The authors have demonstrated viability of LRPC by implementing and integrating it into Taos, the OS for DEC SRC firefly multiprocessor. The simplest cross domain call using LRPC takes 157 microseconds on single C-VAX processor Vs. 464 microseconds by SRC RPC which is three times slower than LRPC. Another experiment demonstrates the throughput as a function of the number of processors simultaneously making call and we see that LRPC performs much better than RPC as LRPC avoids locking shared data during call and return in order to remove contention on shared memory multiprocessor (speed of 3.7 and close to max).

Confusions
Where do we actually use LRPC in today’s world? Also, what happens if resource gets migrated from remote server to local server or vice-versa?

Summary:

The paper describes the communication mechanism, LRPC, designed for communication between domains on the same machine, which adopts the same semantics and protection model as the RPC and execution model of the Protected procedure call. The authors provide evaluations to justify the need for a light weight communication mechanism, followed by the implementation and performance evaluation of LRPC.

Problem:

Also, it was observed that most of the communication- occurred between domains on the same machine (RPC ignored this fact) and involved simple data transfer. The existing RPC systems incurred heavy overheads for communication for this case, some of which are stub, message buffer etc. This forced designers to merge systems into same domains, compromising on security.

Contributions:

The most attractive component of LRPC was the ability of a server thread to execute the requested procedure in server's domain. Even though this seems like a by-product of the existing architecture, A-stack and E-stack, this was an elegant way to cross domains without incurring heavy overheads. Binding was enabled using a clerk on the server side and an import call via kernel. Allocation of shared memory locations (A-Stacks), between client and server, by the kernel was enabled by the Procedure Descriptor, PD, which was encapsulated in a Procedure Descriptor List, PDL (returned to the kernel on a Procedure call). The size A-stacks determined the number of possible instances of the server procedure, which I think was due to limited memory. Linkage record allowed the kernel to associate A-stacks with related return addresses. Stubs in each domain are responsible for initiating an LRPC. The kernel was responsible for protection checks and transferring the control to server stubs, by allocating E-stacks dynamically. An optimization is to hold A-stack – E-stack allocation until the number of E-stacks falls low, when the unused A-stacks are reclaimed. The stubs generated were in assembly language which limited the portability. Multiprocessor optimizations included locks per A-stack queue, caching of domains on idle processors to avoid context switch overheads, and exploit locality. Facilities to inform the stubs of unnecessary parameter copying, message type check etc provided efficiency and protection, by avoiding overheads and protecting servers from clients.

Evaluation:

The authors provide the performance figures, latency, for various workloads and different number of processors, including a NULL cross domain call to show the overheads of the mechanism. They also discuss the breakdowns of the latencies in different domains and for different operations. The figures show that LRPC is about 3 times faster and scales better than RPC on multiprocessors.

Confusions:
What are abstract and concrete threads? What are some of the examples of systems that employ LRPC in today's world? How is it implemented in systems without explicit A-stacks and E-stacks?

1. Summary
The paper illustrates an optimized communication facility for communicating between protection domains on the same machine including implementation, evaluation and uncommon case analysis.

2. Problems
Contemporary small-kernel operating systems, existing RPC systems incur an unnecessarily high cost when used for the type of communication that predominates-between protection domains on the same machine.

3. Contribution
LRPC combines the control transfer and communication model of capability systems with the programming semantics and large-grained protection model of RPC. In this paper, the author implemented and integrated LRPC into Taos operating systems; Examined relative frequency of cross-machine activity on three different operating systems and the result shows that cross-domain activity dominates cross-machine activity; Using procedure descriptor and argument stack to process binding, reduces the storage needs and enhances quick allocation of records; combines A-stack and E-stack to handle different argument passing; invoking server entry stub directly, no intermediate message examination and dispatch required reduces the cost of crossing protocols; minimizing the use of shared data structures thus increases throughput; idle processors are used to cache domain contexts when LRPC is on multiprocessor to reduce the latency and context-switch overhead; reduces argument copying by using A-stack.

4.Evaluation
There is a particular section to evaluate the performance of LRPC. Performing a large amount of cross-domain calls on four different tests shows that most of the time is spent in client and kernel; TLB misses results in a large percentage of time used. The global lock limits the throughput of SRC RPC. Besides, the author also evaluates the uncommon cases.

5. Confusion
How could data without interpretation occur?

1. Summary
In the paper "Lightweight Remote Procedure Calls", the authors describe a communication mechanism - LRPC, which is designed and optimized for communication between protection domains on the same machine running small kernel operating systems. It uses large-grain protection model of RPC while retaining the communication model of capability systems.

2. Problem
The authors examine three operating systems to determine the relative frequency of cross-machine calls. From the results, they conclude that remote procedure calls across domains on a single machine occur more frequently rather than across machines. They also observe that most of the calls used fixed size parameters known at compile time and transferred a small number of bytes. Given all this information, how can we improve performance without compromising on protection?

3. Contributions
1)Optimizing common case ie cross-domain communication and making them faster.
2)Binding:
- server exports an interface by registering with a clerk; client binds to the interface by making an import call which traps to the kernel; clerk replies with PDL (procedure descriptor list) containing Procedure descriptor(PD) entries; kernel pair allocates memory for A-stack based on the PD entry; kernel returns binding object and A-stack to the client.
3)Calling:
- client makes an LRPC by calling its stub which pushes the arguments to A-stack, binding object into registers and traps to kernel which finally performs an up call into the server's stub as specified in its PD.
4)Domain caching:
- LRPC reduces context switch overhead in switching between client and server protection domains by domain caching on idle processors i.e when a call is made, kernel attempts to exchange the processor of calling and idle thread, so that caller thread is placed on a processor which already knows the context of the server domain.
5)Copying arguments:
- traditional RPC within the same machine involves four copies of procedure arguments (stub stack -> message in client domain -> kernel domain -> server domain -> server stack) while LRPC involves one copy(client stub stack -> shared A-stack)
6)Protection:
- binding object is the client's key to access the server interface and if the client tries to forge it, clients cannot bypass the binding phase. Clients are authorized to access the interface if the server accepts its binding.
7)Transparency:
- binding object has a bit to indicate that call is to remote server and uses LRPC or RPC respectively

4. Evaluation
The authors have evaluated the performance of LRPC by running tests for Null, Add, BigIn and BigInOut calls on the C-VAX Firefly using LRPC and Taos RPC. They measured the elapsed time for one call by performing 100,000 cross-domain calls and averaging the elapsed time. They show how LRPC (without idle processor optimization, using a single processor) outperforms SRC RPC(also using a single processor) by a factor of 3.

5. Confusions
I am curious to know whether domain caching will actually provide drastic reduction in context switch overhead for local remote procedure calls (more specifically will there be so many idle processors that this optimization is useful).

1. Summary
This paper describes LRPCs (lightweight remote procedure calls) as implemented in the Taos multiprocessor operating system. This LRPC mechanism borrows aspects from general RPCs, yet introduces many new concepts which optimize for local, cross domain communication.

2. Problem
At the time of the paper, distributed environments functioned as large grained protection systems, and communicated between entities via RPCs. These large grained mechanisms had been transferred to small-kernels, where different components were placed in different address spaces (protection domains) for isolation and ease in debugging/validation. However, this led to a performance issue. Existing RPC systems were built with remote machines in mind, and performed no optimizations for local communication. The common case local communications incurred a high cost, so developers began to aggregate everything back into the same protection domain.

3. Contributions
LRPCs follow the basic structure from RPCs (client/server stubs and local procedure calls into these stubs). However, the amount of copying has been reduced dramatically through the use of shared-memory via the A-stack. A client stub simply copies arguments to some A-stack, which can then be directly read by the server. This reduces the number of argument copies from four (in RPCs) to one. To switch from client to server, the client thread is updated to point to a free server E-stack, and virtual memory registers are updated. This lightweight call is optimized for local calls, and is thus much faster.
As a further optimization, context switch times are reduced by running commonly accessed servers on idle processors. Clients wishing to run a procedure from that server simply perform a processor switch, and leave a client thread idling on the original processor. This can reduce TLB misses.

4. Evaluation
The authors find that on a single processor, their LRPC mechanism is about three times faster than the traditional RPC. When executing a null call, they achieve a 48us overhead on a call that takes a total of 157us. However, they find that 25 percent of time used is simply due to TLB misses. Finally, the authors show that because they avoided locking shared data, their system is able to scale with processors much more effectively than previous RPC mechanisms. They find a near-maximum speedup of 3.7 when scaling to four processors. Due to global locks, previous SRC RPC measurements show only a speedup of 2 is achievable.


5. Confusion
In the portion about passing parameters by reference, the authors mention when passing pointers, the referent is copied onto the A-stack, while the pointer is recreated in the private E-stack of the server. What happens when the data structure is larger than the A-stack? Are multiple A-stacks reserved, and do they have to be continuous (so one pointer can refer to the structure)?

Summary
Lightweight Remote Procedure Call (LRPC) is a communication mechanism that is designed to efficiently facilitate communication between separate domains on the same machine. The authors observed that the majority of RPCs on systems were used to communicate between processes on the same machine. They implemented LRPC with the goal of eliminating the unnecessary overhead that is incurred when a RPC is used to communicate to a different domain on the same machine. They borrowed many elements of a RPC, but optimized it for processes on the same machine by reducing the number of times the data was copied and optimizing context switches from server to client.
Problem
The problem that the LRPC attempted to solve was to mitigate the overhead of a remote procedure call that communicated with a process on the same machine. It was observed that the majority of remote procedure calls were incurring the overhead of setting up communication mechanisms for communicating to other machines, but simply calling a process located on the same machine. LRPC was an attempt to optimize for the RPC case where the procedure is located on the same machine as the calling process.
Contributions
Similar to normal RPC, LRPC had a mechanism for binding to a server that exported an interface. The client would bind to an interface and the server would respond with a procedure descriptor list which contains the addresses of each procedure in the interface, the maximum number of simultaneous calls to the procedure allowed, and the size of the procedure’s argument stack. The argument stack is used for passing arguments from the client to the server and it is shared memory that is located in the kernel. Also similar to RPC, the client calls an LRPC by calling into a stub. The stub takes care of putting the appropriate arguments on the A-stack and then it calls into the kernel. The kernel then verifies that the client has a valid binding and handles the transfer of control of the thread to the server. The kernel has to save the client’s return address, switch the thread’s stack to the execution stack of the server, and reloads the server’s registers. The server then calls into the appropriate procedure and runs on the clients thread. Because the A-stack between server and client is shared, the data only needs to be copied once for the server to access it. Additionally, on multiprocessor systems, the context switch overhead can be avoided if one of the processors is currently in the domain of the server. The client’s threads only have to be moved to the processor that is already in the correct server domain.
Evaluation
The authors performed a performance measurement of their system by using LRPC, LRCP/MP (Multiprocessors), and Taos RPC to call four different procedures: Null, Add, BigIn, BigInOut. They found that LRPC is about three times faster than normal RPC and LRPC/MP was even faster than that. Additionally, they found that LRPC only contributed to 48 microseconds of overhead and 25% of that overhead was due to TLB misses, because every context switch they had to invalidate the TLB.
Confusions
What are some of the advantages of LRPC vs other inter-process communication mechanisms? Is it only that LRPC provides levels of protection that other IPC mechanisms cannot? Also, do the concepts of LRPC apply at all to modern systems?

Summary:
The paper presents the design of a lightweight remote procedure call for communication between procedures on different domains of the same machine. The work stems from results that using existing RPC for cross-domain communication on same systems is not performance efficient. The proposed design simplifies many aspects of RPC including control and data transfer, linkage and stubs by using OS sharing mechanisms. The proposed idea is implemented on the Taos system and the results comparing to existing optimized RPC show a performance improvement of about 3 times.

Problem:
Traditional RPC like the one from Xerox we discussed previously, were not catered to communication on the same system leading to high overhead.To motivate their work, the authors present their results on the requirement of a lightweight RPC by showing most procedure calls are simple and within the same system, and some calls make up most of the transfers. This stems from the fact that small kernels in distributed programming models commonly communicate across same system. Since using existing RPC for cross-domain RPC (even with optimization) is not performance efficient, the authors propose that implementing a lightweight cross domain RPC would significantly improve performance.

Contribution:
A lot of the LRPC design is borrowed from the existing RPC mechanism.

Control and Data transfer: Control transfer in LRPC is achieved by explicitly running client thread on server domain along with the arguments and returns to the calling point on completion. The control transfer takes place by the kernel, which validates the call for permissions and does some more tasks before control transfer to servers execution stack. The semantics of importing and exporting interfaces are similar to RPC.

Dynamic association: The execution and arguments stacks are dynamically linked to prevent servers getting exhausted. When the client makes a call, the server finds a free execution stack and associates the clients arguments to it. Another primary contribution was that LRPC is designed for multi-processes by maintaining locks for argument stacks.

Argument copying: The argument copying latency is reduced tremendously by performing copy in stubs and onto the argument stack which is accessed directly by the server. The return is performed similarly by copying values to argument stack which the client can access.

Evaluation:
The authors use 4 cross-domain procedure calls to compare with RPC, and show results for LRPC, optimized LRPC and RPC comparisons which show LRPC to be about 3 times faster than RPC. The throughput figure shows that since no explicit locking is performed over shared data, LRPC scales for up to 4 multiple processes whereas RPC does not scale for more than 2 processes. This is due to the fact that LRPC locks only the argument stack for achieving multiprocessor functionality.

Conclusion:
Since A-stacks are statically sized, what happens in the case when size of arguments exceed the A-stack size?

Summary:
The paper presents the design of a lightweight remote procedure call for communication between procedures on different domains of the same machine. The work stems from results that using existing RPC for cross-domain communication on same systems is not performance efficient. The proposed design simplifies many aspects of RPC including control and data transfer, linkage and stubs by using OS sharing mechanisms. The proposed idea is implemented on the Taos system and the results comparing to existing optimized RPC show a performance improvement of about 3 times.

Problem:
Traditional RPC like the one from Xerox we discussed previously, were not catered to communication on the same system leading to high overhead.To motivate their work, the authors present their results on the requirement of a lightweight RPC by showing most procedure calls are simple and within the same system, and some calls make up most of the transfers. This stems from the fact that small kernels in distributed programming models commonly communicate across same system. Since using existing RPC for cross-domain RPC (even with optimization) is not performance efficient, the authors propose that implementing a lightweight cross domain RPC would significantly improve performance.

Contribution:
A lot of the LRPC design is borrowed from the existing RPC mechanism.

Control and Data transfer: Control transfer in LRPC is achieved by explicitly running client thread on server domain along with the arguments and returns to the calling point on completion. The control transfer takes place by the kernel, which validates the call for permissions and does some more tasks before control transfer to servers execution stack. The semantics of importing and exporting interfaces are similar to RPC.

Dynamic association: The execution and arguments stacks are dynamically linked to prevent servers getting exhausted. When the client makes a call, the server finds a free execution stack and associates the clients arguments to it. Another primary contribution was that LRPC is designed for multi-processes by maintaining locks for argument stacks.

Argument copying: The argument copying latency is reduced tremendously by performing copy in stubs and onto the argument stack which is accessed directly by the server. The return is performed similarly by copying values to argument stack which the client can access.

Evaluation:
The authors use 4 cross-domain procedure calls to compare with RPC, and show results for LRPC, optimized LRPC and RPC comparisons which show LRPC to be about 3 times faster than RPC. The throughput figure shows that since no explicit locking is performed over shared data, LRPC scales for up to 4 multiple processes whereas RPC does not scale for more than 2 processes. This is due to the fact that LRPC locks only the argument stack for achieving multiprocessor functionality.

Conclusion:
Since A-stacks are statically sized, what happens in the case when size of arguments exceed the A-stack size?

1. Summary
LRPC is a facility for communication between protection domains on the same machine. It is optimized for the common case where most communication is between domains on the same machine and use small & simple arguments. It provides the coarse grained protection & programming convenience of RPC without incurring unnecessary generic RPC overheads. Evaluation suggests that LRPC is up to 3x faster than RPC and gracefully handles uncommon cases without too much perf degradation.

2. Problem
Small kernel OSs have borrowed RPC’s protection and programming model for communication between isolated subsystems. However RPC is designed for generic case and has high overheads for small & frequent communication. Some OSs have had to coalesce weakly related systems into a single domain to avoid RPC overhead, sacrificing safety for performance. This paper tries to design a communication facility that offers same security as RPC without its performance overheads for such common-case communication.

3. Contributions
LRPC borrows from a protected procedure call where the CALL traps to kernel, which validates the identity of caller and dispatches the client’s thread to server’s domain. Upon completion of server routine, control is returned to client via the kernel. The server exports a Procedure Descriptor List contain descriptors for each procedure. During binding, the kernel will retrieve these descriptors and allocate pair-wise shared memory for passing arguments and results between client and server. Since this memory is natively accessible to both, performance is high. Authentication of client is done through a binding object which is like a capability to the server’s interface. During a call, the client stub will retrieve an free A-stack (argument stack), push the arguments into the A-stack and trap to kernel. Kernel will fetch a free E-stack(execution stack) for the server, update thread registers to execute from the new E-stack and jump to the entry address of the procedure. During return, the kernel looks up the return address of the client from a linkage record. To minimize the context switch overhead during LRPC, an optimization is done for multiprocessors where idle processors will run frequent services. Calling clients can then have their threads migrated to these server processors. In essence, LRPC provides the high performance of shared memory communication with the protection of RPC.

4. Evaluation
The overhead of RPC/LRPC are evaluated using a NULL service which does nothing. Results suggest that LRPC is able to achieve performance close to native HW restrictions. Performance for a few common services are also shared and are impressive. Since LRPC is designed to avoid shared data structures, it scales much better on multiprocessor systems.

5. Confusions
In a system which does not have a separate A-stack for passing arguments/results, how much benefits will LRPC provide? Is this behavior determined by HW architecture?
The way LRPC handles domain termination is not very clear. How do we have domain threads executing even after the domain has been terminated?

Summary:
The paper describes motivation, design, implementation and performance of lightweight remote procedure call. RPC has huge overhead in communication between protection domains on the same machine in small kernel operating system, LRPC helps in improving performance for cross domain procedure call by implementing simple control transfer, simple data transfer and simple stubs.

Problem:
In small kernel operating system overhead of communication between protection domain on same machine is high. To avoid this performance, developers used to combine logically separate entities in single domain, which in turn increased size and complexity. LRPC guarantees increase in performance without sacrificing safety for communication on same machine.

Contribution:
1. In a language environment that supports shared argument stack between client and server for parameter passing, LRPC removes overhead by eliminating three extra copies of arguments.
2. In a multi-processor system, LRPC reduces context-switch overhead by caching domains on idle processors and thereby saving overhead because of TLB invalidation. This technique is borrowed caching technique used in Amoeba and Taos system.
3. LRPC modified the protocol layers used in RPC, server entry stubs are now invoked directly by the kernel on a transfer and no intermediate message examination and dispatch are required.

Evaluation:
LRPC was implemented and integrated in Taos operating system and was performance is compared to SRC RPC, firefly's native communication system . Simplest cross domain call on LRPC is 3.5 times faster as compared to SRC RPC. Highly optimized assembly stubs in LRPC are four times efficient in performance as compared to stubs in SRC RPC. LRPC optimized for multi-processors shows around 3 to 4 times improvement of performance under different arguments size workload. LRPC improves performance by avoiding locks for shared data on shared-memory multiprocessors and throughput of 5 times is achieved on 4 processor system.

Confusion:
1. Didn't get clearly below para:
“Privately mapped E-stacks enable a thread to cross safely between domains. Conventional RPC systems provide this safety by implication, deriving separate stacks from separate threads. LRPC excises this level of indirection, dealing directly with less weighty stacks”.
2. What is difference between message passing and restricted message passing?

Summary
The paper describes the Lightweight Remote Procedure Call (LRPC) facility, which optimizes RPC for the common case of cross-domain communication on the same machine. The authors aim to minimize the overheads associated with RPC such as stub generation, access validation, scheduling delays, context switching latency and dispatch overheads.

Problem
Small-kernel operating systems were based on distributed computing models. The various modules of the OS were placed in separate domains, and relied on RPC for secure communication. The traditional RPC implementation does not distinguish between cross-domain and cross-machine calls, which results in needlessly long latencies for local procedure calls.

Contributions
The LRPC simplifies control and data transfer, relies on simple stubs to streamline local cross-domain procedure calls. The number of copy operations required for argument passing is optimized by using a shared argument stack between the client and server. LRPC exploits the concurrency of multiprocessor systems by minimizing shared data structures to ease lock contention, and caching domain contexts on idle processors, thus reducing context-switching overheads. Most importantly, it promotes the modular nature of small-kernel operating systems by providing a reliable, secure and fast communication facility.

Evaluation
The authors justify the need for optimizing the conventional RPC by showing that only 0.6 to 3 percent of RPC calls were across machines in real-world usage. The overheads introduced by LRPC are analysed for a single-processor null LRPC and add about 48us to the overall latency. The improved performance of LRPC as compared to conventional RPC is demonstrated by measuring the call throughput on a Firefly machine with four C-VAX processors. The LRPC implementation supports 3.7 times the call throughput of the conventional RPC approach.

Confusions
Could you explain the difference between explicit argument copying in the stub and message copying in the kernel? Do the observations on behavior and complexity of RPC calls hold up for modern workloads? Is LRPC still used anywhere? Won’t a lot of language environments (like C) be precluded from using LRPC due to the need for separate A and E stacks?

1. Summary
LRPC is a reduced-overhead version of RPC designed to greatly speed up local RPC calls. It removes the overhead assosciated with treating all RPC calls as potentially remote and allows much faster context switching.
2. Problem
Conventional RPC can serve for both local and remote communication, but has a high overhead when used locally because it treats local cross-domain communication as an instance of remote communication. In the view of the authors, this causes programmers to structure their applications in non-optimal ways to avoid the deficiency in performance. LRPC looks to solve this problem and increase the performance of local cross-domain communication.
3. Contributions
LRPC has a simpler execution model compared to full RPC. A client makes a call to a server using a kernel trap, and the kernel then dispatches the client thread straight into the server's domain after flushing the TLB and setting up the address space. After the call completes, the kernel returns execution and results back to the client's point of call.
To reduce context switching overhead, server domains are idled on unused processors which enables fast context switching into them from client threads, because the TLB does not need to be invalidated and the number of extra memory fetches that need to happen due to TLB misses is reduced.
Argument passing is simplified in LRPC by reducing the number of copy operations needed to make a call from client to server. A shared argument stack is used between client and server that removes the necessity to copy arguments into and out of the kernel domain. This can reduce the number of copy operations on a call from four down to one or two (depending on whether the arguments are mutable or not).
4. Evaluation
LRPC was evaluated on the C-VAX Firefly system and compared to the native implementation of RPC, Taos. Overhead measurements are provided for LRPC vs the native system, and the overhead of an LRPC system call is broken down further into its components which is good to see. The authors also measured the scaling delivered by removing locking of shared data structures during calls on one, two, three and four processor systems.
5. Confusion
In the multiprocessor section, the authors state that TLB misses are the cause of a lot of overhead, so they solve this by migrating the calling thread to a new processor that already has the server execution environment set up. Isn't this just moving the overhead of TLB misses out of the RPC call back to the calling thread after the return from the server, due to the migration to a different processor? I also don't understand the sentence "The idling thread continues to idle, but on the client’s original processor in the context of the client domain." Do they switch the server thread's context to that of the client thread as a "parking" place for it?

Summary:
The paper describes an optimization approach to RPC with an implementation called Light Remove Procedure Call (LRPC). The authors note performance overheads with existing RPC systems and propose a new approach in LRPC that increases performance without sacrificing security. The general architecture of LRPC is similar to existing systems, with some small differences.

Problem:
RPC systems at the time were not as performant as they could be. The authors measure the overhead induced by different RPC systems and argue that because it is so high, programmers are structuring their applications in ways that don’t make sense; including logically separate entities in a single domain for either security or performance reasons.

Contributions:
LRPC improves on other RPC systems in several ways. The clients thread executes the called procedure code of the server domain in its own context, and both caller and callee share an argument stack. LRPC also uses optimized stubs written in assembly as a result of the simplistic job the stubs must perform due to the simpler context switching and shared argument stack. They also include an optimization in the multiprocessor case, where server contexts are loaded onto idle processors allowing a reduction in the overhead when context switching, as the client thread can be migrated to parked server context processor.

Evaluation:
The authors provide measurements of the improved overhead that LRPC has over other RPC systems. To do this, the authors ran a NULL RPC call, which sends no arguments, returns no values, and does nothing. Measuring the time this takes is a good estimation of the overhead induced by each RPC system, and they note very reduced overheads in the LRPC case.

Confusion:
The design relies on the lack of cross-machine calls, and programmer defined interfaces. This seems so similar to dynamic linking, I’m wondering what benefits LRPC has over dynamic linking.

Post a comment