CS 736 Reviews - Spring 2017: Lightweight Remote Procedure Call

« Implementing Remote Procedure Calls | Main | U-Net: A User-Level Network Interface for Parallel and Distributed Computing »

Lightweight Remote Procedure Call

Lightweight Remote Procedure Call Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy. . ACM Trans. on Computer Systems 8(1), February 1990, pp.37-55.

Only one review do this week. Reviews for this paper due Thursday 2/23 at 8:00 am

Posted by Michael Swift on February 20, 2017 06:23 AM | Permalink

Comments

1. Summary
This paper outlines the development of lightweight remote procedure calls (LRPC), a communication mechanism between protection domains on the same machine. The authors identify opportunities for optimizing the current mechanism (RPC) for such scenarios. The authors justify this by the use of an extended evaluation of current systems showing less than 3% RPC calls crossing the system level boundary in general. They also outline the integration of language features and system design properties to achieve an optimized solution. They present an evaluation of their work for the Taos Operating system.
2. Problem
The existing mechanism for remote procedure calls spent most of its time communicating between different protection domains on the same machine. In this scenario the RPC mechanism incurs heavy penalties and creates a bottle neck for throughput. The authors try and reduce these heavy penalties to develop a mechanism (LRPC) using the same API but providing faster communication between processes on the same machine (fast IPC).
3. Contributions
The authors used a systematic approach to find opportunities for optimization both in hardware and software in an existing system. They provide an implementation and evaluation of their optimization (LRPC) for a realistic modern multiprocessor system. They advocate for process identifies to be present in the TLB tags to reduce pagefault penalties during context switches. Their methods which involved the use of collapsing the stack, using high-level language features to provide security and speed, avoiding extra copies of the same data, using shared address space buffers for passing args, handeling pointer args gracefully have stood the test of time and in my opinion are this papers biggest contributions. Also they have built LRPC to now serve as a fast and secure IPC mechanism on the same machine.
4. Evaluation
The paper evaluates the existing implementations of RPC in a broad range of systems. But only evaluates their solution for Taos on a Firefly machine (for engineering effort). They show drastic improvements. But as usual since they limit their implementation to only one type of system (use exotic language, hardware and OS features) its hard to judge the relevance of the methods or their readiness for use in other kind of systems.
5. Confusion
How dependent are they on the OS/Language stack they are using ?
The authors mention the fact they wanted to develop a faster IPC mechanism in LRPC but this fact is not stated explicitly enough in the paper why ?
Are RPC/LRPC the forefathers of the concept of both lambdas and micro-services ? (The more I read, the more I am inclined to believe so)
Why did Taos OS die ?(Seems like something way ahead of its time, yet very relevant today)

Posted by: Akhil Guliani | February 23, 2017 09:14 AM

1. Summary
This paper outlines the development of lightwieght remote procedure calls (LRPC). This is mostly an identifying weakness in current mechanism and optimizing it to make it better type of work.
2. Problem
The exisiting mechanism for remote proceure calls spent most of its time on the same machine/domain. In this scenario the TPC mechanism would beahave like its a cross domain call and create a bottle neck for throughput.
3. Contributions
The authors used a systematic approach to find opportunities for optimization both in hardware and software in an existing system. They provide an implementation and evaluation of their optimization (LRPC) for a realistic modern multiprocessor system. They advocate for process identifies to be present in the TLB tags to reduce pagefault penalties during context switches. Their methods which involved the use of collapsing the stack, using high-level language features to provide security and speed, avoiding extra copies of the same data, using shared address space buffers for passing args, handling pointer args gracefully have stood the test of time and in my opinion are this papers biggest contributions. Also they have built LRPC to now serve as a fast and secure IPC mechanism on the same machine.
4. Evaluation
The paper evaluates the existing implementations of RPC in a broad range of systems. But only evaluates their solution for Taos on a Firefly machine (for engineering effort). They show drastic improvements. But as usual since they limit their implementation to only one type of system (use exotic language, hardware and OS features) its hard to judge the relevance of the methods or their readiness for use in other kind of systems.
5. Confusion
How dependent are they on the OS/Language stack they are using ?
The authors mention the fact they wanted to develop a faster IPC mechanism in LRPC but this fact is not stated explicitly in the paper why ?
Are RPC/LRPC the forefathers of the concept of both lambdas and micro-services ? (The more I read, the more I am inclined to believe so)

Posted by: Akhil Guliani | February 23, 2017 08:46 AM

Summary
The paper discusses design, implementation and evaluation of Lightweight Remote Procedural Call (LRPC) for small kernel operating systems. It also points out the inefficiency of cross domain RPC in several systems and can be optimised using their proposed LRPC mechanism. Some of the main features of the LRPC are simple control transfer, simple data transfer(using argument stacks), optimised stubs and support for multi-processor execution, which allows many improvements over traditional RPC mechanism while retaining the security and transparency aspects of it.

Problem
The conventional techniques before LRPC have not fully explored the fact that the cross domain procedure call can be considerably less complex than its cross machine counterparts and their are many opportunities for optimisations and performance improvements. LRPC tries to give solutions to this problem,

Contributions
The execution model of LRPC is borrowed from protected procedure call and its programming semantics and large-grained protection model are borrowed from RPC.
LRPC binding mechanism used clerk interface for exporting interfaces and for accepting incoming interface bind requests from clients. The clerk enables the binding by replying to the kernel with Procedure Descriptor List(PDL) that is maintained by the exporter of every interface. The PDL and Procedure Descriptors store server domain entry points, maximum number of calls allowed from a client and the size of its argument stack (A-Stack). The A-stack is used as a shared memory between the client and server. It is used to pass arguments and return values can be directly shared without intermediate kernel copying. Linkage record is maintained by kernel to identify caller’s return address, while the Binding Object is the client's key to accessing the server's interface. The server makes use of E-stack for dispatched thread for execution purposes. LRPC also uses the domain caching optimization on multi-core architectures to keep frequently used server domains cached on idle processors to allow reduced context-switch overhead.

Evaluations
The authors first points out the inefficiencies of traditional RPC model by studying the V System, Taos (the Firefly Operating System) and UNIX+NFS systems. They demonstrate that the most RPC based communication is local and simple. These results are used to form a base for LRPC design.
LRPC performance was evaluated on the Firefly machine, even without domain caching optimization, LRPC call execution time was found to outperform Taos by a factor of three. A breakdown of call execution overhead revealed lower overhead for LRPC. They observer the LRPC overhead by running single processor NULL test, where the overhead is mainly due to binding and linkage management.
They also show the efficiency of domain caching on multiprocessor C-VAX, how LRPC handles cases such as identification of cross-domain and cross-machine calls, A-stack management, failure handling.

Confusions
I did not understand completely the case where parameters are passed by reference.
Also how locking is done when in multiprocessors system, is guarding A-stack queue is sufficient?

Posted by: Om Jadhav | February 23, 2017 08:06 AM

1. Summary
This paper introduces the Lightweight RPC as an alternative to RPC which avoids high cost when used to communicate between protection domains on the same machine. Lightweight RPC is optimized for RPC performance for local clients by combining the communication model of capability systems and the protection model of RPC.

2. Problem

Conventional small kernel operating systems at the time of writing this paper which borrowed large-grained protection model of the RPCs incurred high overheads of RPCS as they considered the local communication between cross domains as an instance of remote communication. This also lead to them having a poor structure. The authors proposed LRPC recognizing that most RPCs is in between protections domains on the same machine and hence the overheads associated with RPCs such as stub overhead, message buffer overhead, access validation and context switch can be avoided.

3. Contributions

LRPC avoids the overheads of RPC's while also providing the qualities of safety and performance by employing techniques such as Simple stubs, simpler data transfer, simple control transfer and simpler data transfer.

a. In LRPC, client executes the procedure in server’s protection domain. They avoid expensive traps and use handoff scheduling which enables direct control transfer.
b. An argument stack (A-stack) is shared between the client and the server that avoids redundant copying of data
c. The stubs generated are in assembly language and are highly optimized.
d. The stub generator produces run-time stubs in assembly language which can be highly optimized.
e. LRPC design is concurrency friendly and avoids unnecessary locking contention overheads by reducing shared data-structures usage.
f. Simple control transfer by using Lazily allocated execution stacks (E-stack).
g. Idle processors in a multiprocessor system are used to cache domains and hence reduce latency due to context switch.

4. Evaluation
Evaluation was carried out on Taos OS with 100K cross-domain calls in loop and it was shown that LRPC can achieve up to 3 times performance benefits compared to conventional RPC systems. They demonstrated reduced lock contention and low overhead by analyzing the Null LRPC. The performance improvement of LRPCs on multiprocessor systems is also demonstrated.

5. Confusion
a. How abrupt domain termination is handled in LRPC was not clear.
b. How exactly does caching of domains on idle processor work?

Posted by: Lokananda Dhage Munisamappa | February 23, 2017 07:59 AM

1. summary
The paper proposed LRPC, which optimizes Remote Procedure Call for the condition that the two ends of communication are located in the same machine. Their method could improve the performance of cross-domain RPC while retain the safety and transparency.

2. Problem
For fine-grained protection, small-kernel system tried to use RPC, but it could result in performance loss. And most communication trafic in a server is cross-domain and simple.

3. Contributions
a). LRPC can have significant performance improvement and retain the safety and transparency of RPC.
b). They revealed that most communication in operating system is cross-domain and simple.
c). Combining the execution model of protected procedure call and larget-grained protection model.
d). Simple control transfer (thread of client directly execuate in server), Simple data transfer, Simple Stub(optimized stubs),
e). Multiprocessor support.

4. Evaluation
They evaluated the performance of LRPC on a single machine and the call throughput on a multiprocessor. They achieved 3X speedup for single processor. Baseline operation result of NULL is provided and they also broke down the operation to address that where the overhead comes from.

5. Confusion
Does the assumption of RPC usage still hold today? Is it necessary considering that in this era the appearance of high-speed network?

Posted by: Jing Liu | February 23, 2017 05:17 AM

1. Summary
This paper talks about LRPC designed for efficient communication between protection domains present in the same machine. LRPC is a modified version of RPC and it provides various performance improvements for cross domain communication by reducing the overheads involved in the traditional RPC.

2. Problem
RPC was designed for providing communication across a network. RPCs are good when large amount of data transfer is required as the cost of overheads become much smaller than the actual data transfer time. There are a lot of overheads involved in the conventional RPCs like stub overhead, message buffer overhead and access validation. Data transfer across protection domains in the same machine is much smaller and simpler. Thus, RPC can be modified for this case by reducing the overheads to improve performance.

3. Contribution
Authors show that only a small portion of RPCs are actually remote and most of the calls are made for operation between protection domains on same machine. Most of the parameters used in the cross-domain RPCs were very small and simple and did not require all the processing that is needed for cross machine RPC. Following contributions were made to made the system efficient:
a. Simplified control transfer by executing client’s thread’s requested procedure in the server’s domain.
b. Data transfer was made simple by employing a mechanism similar to that used by procedure call. Redundant data copying was also avoided by using a shared argument stack between client and server.
c. Highly optimized stubs could be designed based on the simplified control and data transfer mechanisms.
d. Shared data structures were avoided which reduced the use of synchronization primitives and resulted in better performance.

4. Evaluation
Authors compared the performance of LRPC and RPC. Four different tests were run on the C-VAX Firefly using LRPC and Taos RPC. In all these tests LRPC performed almost 3 times better than RPC. Call throughput for single processor and multiprocessor is also measured. In this case also LRPC performed better than RPC on both single processor and multiprocessor by roughly a factor of 3.

5. Confusion
a. Paper mentions that a stack of linkages is necessary for performing multiple procedure calls. How can a process perform 2nd call without returning from the 1st call?
b. Can you please explain the working of domain caching on idle processors.

Posted by: Gaurav Mishra | February 23, 2017 03:07 AM

Summary: Remote procedure calls (RPCs) are useful for communication between processes on same or different machines. However, RPCs lose out performance while trying to provide generality. Lightweight RPCs address this issue and facilitate more optimized communication between protection domains on same machines. This paper discusses the design and implementation of LRPCs and then evaluates the performance. LRPC improves the performance by avoiding excessive overheads such as scheduling, access validation, copying, synchronization wherever they are not required.

Problem: Most communications in an OS are between protection domains on same machine, and communication across machines is not large. Even in distributed environment, systems try to localize compute operations to exploit data locality, cache and other features and communication is mainly across domains on same machine. Conventional RPC facility involves complex data structures and are costly for cross domain communication. These problems motivated the authors to design a communication facility which provides simple control transfer and data transfer and is optimized for cross domain communication.

Contributions:
1. Client executes the procedure in server’s protection domain. Handoff scheduling enables direct control transfer instead of through expensive traps.
2. Argument stacks shared between client and server obviates unnecessary data copying.
3. The stubs generated are in assembly language and are highly optimized.
4. LRPC design is concurrency friendly and avoids unnecessary locking contention overheads by reducing shared data-structures usage.
5. Lazy binding of E-stacks and A-stacks to minimize total call stacks required.
6. Improves performance by reducing context switches by caching domains on idle processors in case of multiprocessors.
7. Provides flexibility and low cost implementation by assigning a bit in binding object to indicate the type of call i.e. RPC or LRPC.

Evaluation: LRPC design was evaluated based on 4 tests (null cross domain call, procedure taking two 4-byte arguments and returning one 4-byte output, procedure taking one 200-byte argument, and procedure taking and returning one 200-byte argument) which were run on C-VAX Firefly using LRPC and Taos RPC. LRPC shows performance improvement up to 3 times in comparison to traditional RPC. Reduced lock-contentions makes LRPC more scalable in comparison to RPC. The paper does not provide detailed metrics or contribution of each factor in improving performance. The paper also does not talk a lot about safety.

Confusion:
1. Domain termination part for RPC and LRPC is confusing.
2. How do they manage synchronization between different processors while caching domains?

Posted by: Rahul Singh | February 23, 2017 02:49 AM

Summary:
This paper presents the design of Lightweight Remote Procedure Call (LRPC) which is a communication facility specially optimized for cross-domain communication on capability systems. It basically takes advantage of the fact that these communications are simple and handle mostly small value parameters. Thus achieves higher performance by eliminating the overhead of using the traditional RPC for this type of communication.
Problem:
Using the traditional RPC for cross-domain communication is unnecessarily expensive which has forced most system designers to combine weakly related subsystems into single protection domain, thus compromising safety. Also, the authors observe that cross-domain communication is much more frequent than cross-machine communication and that complex parameters are rarely used. The leads to the design of LRPC, which deals with this common case.
Contributions:
> LRPC achieves higher performance without compromising safety.
> Client’s thread will execute the desired procedure at server domain via simple control transfer.
> Pairwise allocation of A-stacks for argument sharing between the client and server, eliminating redundant data copy overhead and maintains safety.
> The caller’s return address is stored in a linkage record that is accessible only by kernel.
> Elimination of extra message passing and verification.
> simplicity enables stub generator to produce run-time stubs in assembly language which contributes to performance improvements.
> Increased throughput by minimum use of shared data structures on domain transfer path.
> Reduced context-switch overhead and low latency, because domain contexts are cached on idle processors.
Evaluation:
The authors conduct experiments on three operating systems to better understand the relative frequency of communication. The percentage of operations that cross machine boundaries is as low as 5%. The analysis on the size and complexity of cross-domain calls reveals that majority of transfer involves fewer than 200 bytes and that simple byte copying is sufficient.
LRPC is evaluated on C VAX Firefly, it is found provide 3 times better performance than Taos. With idle processor optimization, the performance is higher.
Confusion:
Need more clarity on how context-switch overhead is reduced by using idle processors.

Posted by: Pallavi Maheshwara Kakunje | February 23, 2017 02:06 AM

1.Summary
The paper describes the design and implementation of the Lightweight Remote Procedure Call, which is a communication mechanism used by the two protection domains to communicate with each other and thus reducing the overhead compared to the conventional RPC specifically where communicating protection domains are in the same machine.

2. Problem
In case of small kernel architectures, using RPC (which is designed to communicate between processes in two different machines) as the medium of communication between the protection domains in the same machine leads to high overhead(stub overhead, message buffer overhead, access validation, message transfer, scheduling, context switch, dispatch) in communicating between the domains. To overcome this communication overhead, weakly related domains are sometimes merged into a single domain thus compromising the security in order to achieve better performance.

3. Contributions
The LRPC uses four main techniques to reduce the overhead occured in cased of communicating using conventional RPC.
→ Simple control transfer - Similar to RPC, the server exports the interfaces and the clients bind to the interface before using them. But here the control is transferred to the server by placing the thread’s user stack pointer in the Execution stack of the server.
→ Simple data transfer - the argument stack(specific to a procedure) is shared between the client and the server through which the arguments and the return values are exchanged. Each AS is associated with a linkage record which stores the return address of the client which could be used to transfer the control back to the client directly from the server.
→ Simple stubs - generates the highly optimized stubs.
→ Design for concurrency - It avoids using the shared data structures.
It takes advantage of multiple processor architectures to avoid context switching by switching the thread to run on an idle processor with the server’s context domain and switching it back to the its previous processor once server is done and vice versa. This also decreased the TLB misses and thus increasing the performance.

4. Evaluation
C-VAX Firefly which is a LRPC implementation performs three times better than the traditional RPC implementation. The four test cases Null, Add, BigIn and BigInOut were used to evaluate the performance. It is also shown that the call throughput of the LRPC increases with the increase in the processors while the performance remains constant in case of RPC after two processors. This limit is explained as due to the global lock held in case of large transfer path.

5. Confusion
Elaborate on the what exactly happens to the threads when the domain they belong to terminate abruptly.

Posted by: Sowrabha Horatti Gopal | February 23, 2017 12:39 AM

1. Summary
The paper describes a mechanism similar to RPC to make procedure calls between different protection domains on the same machine while ensuring both safety and performance.

2. Problem
Small kernel operating systems have advantage over monolithic kernels in that they offer fine grained protection through various protection domains. But, communicating between these domains by making procedure calls was difficult to efficiently implement and program. Remote procedure calls which were primarily designed for distributed computing environments can also be used for communicating between protection domains. But, they involve unnecessary overheads when passing data between clients and servers on the same machine.

3. Contributions
The paper motivates the problem by analyzing the frequency of cross-machine RPC activity, and finds it to be generally low. It then identifies several overheads associated with conventional RPC systems which can be avoided for cross-domain calls on the same machine. Performance is improved through many optimizations
- Control transfer is simplified by making the client thread execute the procedure in the server's domain
- An argument stack (A-stack) is shared between the client and the server to reduce redundant copying of data
- Stubs are highly optimized for same-machine calls
- Idle processors in a multiprocessor system are used to reduce latency due to context switch.
This design gains performance benefits from simple control transfer by using execution stacks (E-stack) and reduce data copying by sharing data structures between domains.

4. Evaluation
The design is evaluated using microbenchmarks, and the overhead of LRPC is broken down. It is shown that LRPC can achieve up to 3 times performance benefits compared to conventional RPC systems. The paper does not evaluate the effects of these optimizations on the uncommon case of cross-machine communications though.

5. Confusion
What is the limit on number of simultaneous calls initially permitted to procedures by clients? Doesn't a client block until a procedure call completes? How can it perform these multiple calls simultaneously?

Posted by: Suhas Pai | February 23, 2017 12:14 AM

1. Summary
The paper talks about Lightweight RPC, designed to optimize communication between protection domains on the same machine. This is done by combining control transfer and communication model of capability systems with semantics and protection model provided by RPC.

2. Problem
The authors recognize that most RPCs in OSs is between domains on same machine (i.e. cross-domain) rather than domains located on separate machines (i.e. cross-machine). This dominance of cross-domain communication is primarily because OSs mostly localize processing and resources to achieve acceptable performance. Additionally, most communication is simple rather than being complex since local operations very rarely involve complex parameters. In such a case, marshaling complex arguments by RPC is also undesirable.

3. Contributions
Compared to RPC, LRPC significantly improves the performance for cross-domain communication while still guaranteeing safety and transparency. This is primarily because of the following 4 techniques :
i) simple control transfer, where the client thread executes procedure call in server’s domain
ii) simple data transfer, where a shared argument stack (A-stack) is used for sharing between client and server. This eliminates redundant data copying (seen in RPC).
iii) simple stubs, The use of simple control and data transfers in LRPC facilitates generation of highly optimized stubs. Additionally, on control transfer the kernel directly invokes the server stubs thereby reducing crossover cost.
iv) design for concurrency - LRPC benefits from speedup potential of microprocessor and hence also avoids shared data structure bottlenecks.

4. Evaluation
LRPC was implemented on Taos OS. LRPC performed 3 times better than RPC and restricted message passing. With 100K cross-domain calls made in a loop, the average cost was computed. On both single and multi-processor system, LRPC incurs relatively low overhead and has greater call throughput.

5. Confusion
I do not completely understand what happens in domain termination. How does it differ for both RPC and LRPC ?

Posted by: Dastagiri Reddy Malikireddy | February 22, 2017 11:18 PM

1. Summary
This paper introduces lightweight RPC that is designed for same-machine, different protection domain communication with small and simple arguments. This LRPC reduces the overhead in traditional RPC systems and can be used in small kernel operating systems.

2. Problems
Small kernel operating system wants to separate components in disjoint domains, and use RPC to communicate between these domains. This idea will make OS easy to design, debug and validate, but the large overhead of traditional RPC systems disallows OS designers to do this. So, this paper tries to design a lightweight RPC mechanism for communication on same machine with little overhead, but also maintains the semantics of previous RPC systems.

3. Contributions
This paper finds that in most workloads, very few RPC calls across the machine boundaries, and most RPC calls have simple and small arguments, so it is not necessary for every RPC calls to follow the data marshalling and network transfer. LRPC is a combination of remote and protected procedure call. It follows the semantics of RPC: server export, client bind, communicate on different protection domain. For execution, it is similar to protected procedure call using a kernel trap. During binding, server export interface using a clerk, and client connect to clerk through OS kernel, the result of bind is a procedure descriptor containing entry address on server domain, argument stack, etc. During the procedure call, client puts arguments on a shared argument stack, then trap into kernel. Kernel verifies identifiers and switches to server address space to execute the procedure. This paper also mentions stub generation, multicore locking and argument copying techniques to maintain the semantics and improve performance of LRPC.

4. Evaluation
This paper compares LRPC with Tao RPC in four workloads. In all experiments, LRPC uses about one third of the time compares to Tao RPC, and with idle process optimization, LRPC can be even faster in multicore settings. LRPC is also more scalable than RPC in multicore processors, it can almost achieve a linear increase in call throughput, while RPC throughput does not increase much when the number of core is larger than 2. This paper also includes some breakdown analysis of LRPC overhead.

5. Confusion
(1) How to use linkage record (the caller return address)? How does LRPC return? I think section 3.2 makes the call of LRPC very clear, but how to return from LRPC is not very clear. Like when to use the linkage record to make sure LRPC returns correctly.
(2) In section 3.2, LRPC updates user stack pointer to E-stack and reloads the processor’s virtual memory register. What is the difference between this and a context switch? I think it is very similar to conventional context switch.

Posted by: Tianrun Li | February 22, 2017 11:05 PM

1. Summary
The main drive behind this paper was describing the lightweight remote procedure call LRPC, which is a modified version of a typical RPC that is more geared towards cross-domain calls. The paper opens with an introduction to the problem before going into the implementation of LRPC and finishing with some performance data.

2. Problem
The problem, presented by this paper, is that RPCs are actually not well tuned to handling typical communication traffic in operating systems. The problem with RPCs is that they focus on cross-machine communication and passing complex parameters, which is contrary to typical usage characteristics according to the paper. By focusing on these aspects RPCs are actually hurting performance to the point where system designers are combining subsystems to help performance at the cost of safety.

3. Contributions
This paper proposes LRPC to address the common case of simple parameter passing with cross-domain communication. To remove overhead the LRPC uses the client's concrete thread dispatched by the kernel to the server domain to run. Because of the common simple argument case the LRPC uses an A-stack that is shared with the server domain to pass arguments. This reduces the amount of argument copying from four times down to just one. The LRPC stubs blur the boundary between the protocol layers, the server stubs are invoked directly by the kernel, reducing the overhead cost of crossing between the layers. In order to accommodate cross-machine calls the paper states that RPC can still be used by passing from an LRPC stub to an RPC stub in the case of cross-machine calls with negligible overhead impact. The LRPC also takes advantage of multiprocessors and uses idle processors to cache domain contexts reducing the need for context switches.

4. Evaluation
The evaluation sections consists of a few tests comparing the LRPC to the Taos RPC and it shows significant performance gains of the LRPC in comparison. It probably would have been better to have a more robust system of testing presented to better prove their point. However, the inclusion of a NULL call test to show the overhead of the LRPC was really informative. For multiprocessor environments the evaluation shows that the LRPC can continue having linear gains with increased processors whereas the RPC flattens after two.

5. Confusion
I did not quite understand how domain context caching works or how it would be implemented without causing significant overhead.

Posted by: Brian Guttag | February 22, 2017 10:54 PM

1) Summary

In LRPC, the authors propose a lightweight version of RPC optimized for IPC on the same machine. The authors show that most RPC communication is local and with small arguments. Thus, they propose a design that optimizes these cases.

2) Problem

RPC is a useful way of representing communication between different processes. It allows the programmer to focus more on algorithmic design than networking boilerplate.

Since RPC can be on the critical path of protocol performance, RPC implementations have been heavily optimized. However, the authors show that the common case behavior of RPC is actually local communication in systems that use RPC as an IPC mechanism. Thus, a significant performance improvement can be gained by improving this case's performance.

3) Contributions

The authors' first contribution is an analysis of RPC use cases. They find that the overwhelming majority of RPC in distributed applications is to other processes on the same machine. They also find that small messages dominate.

The authors suggest that instead of using traditional RPC for local communication, a great deal of overhead can be removed by simply using shared memory to implement local RPC calls. This simple idea produces a 4x performance improvement.

In particular, the authors propose using a shared memory argument stack to pass arguments from client to server and to return values back to the client. Server execution stacks and argument stacks can be reused to reduce memory overhead.

Furthermore, the authors build into their design a number of optimizations and tricks. They avoid context switches by "caching" server processes. They also attempt to reduce the memory overhead of allocating RPC stacks.

4) Evaluation

The authors do a good job of motivating their optimization. They spend significant time demonstrating and analyzing common case behavior. They break down where overhead exists. This analysis sets the "big picture" that guides all of the design decisions and optimizations in the rest of the paper. They also explain how their optimizations accomplish this goal.

However, their evaluation section seems defficient. Their microbenchmark results are value because they help show where latency is reduced in a single RPC call. However, they never evaluate the performance on a large system with complex behavior. For example, they could have evaluated their implementation using the same workloads they used to do their initial analysis of common case behavior.

5) Confusion

I did not understand their context switch optimization exactly. I understood that the technique is used to avoid context switching between client and server, but how does this happen exactly?

Posted by: Mark Mansi | February 22, 2017 07:52 PM

1. Summary
RPC systems result in high cost when used to communicate between protection domains on the same machine. Lightweight Remote Procedure Call helps overcome this high overhead of same-machine communication by combining the communication model of the capability systems and protection model of RPC, hence promoting both safety and performance.
2. Problem
Remote Procedure Calls provide large grained protection mechanisms, i.e. protection boundaries are defined by machine boundaries. Since, supporting fine-grained protection like capability systems was a difficult and tedious task, small-kernel operating systems adopted the large-grained protection mechanisms. This seemed to be suitable even for managing the subsystems (both the control transfer and communication) which are not remote, i.e. which are placed on the same machine but different domains. This approach had a major flaw – it considered local communication same as remote communication, resulting in complication of simple operations.
3. Contributions
Small-kernel operating systems which borrowed large-grained protection model of the RPCs infringed one of the basic principles of systems design – isolating the common case. Local communication was considered as a generic case and handled in the same way as a remote call. LRPC takes this into consideration and eases cross-domain communication without compromising on safety or transparency. LRPC implements simplified mechanisms for control transfer, data transfer, stub generation and concurrency in its implementation.
LRPC simplifies binding operation by maintaining a Procedure Descriptor List (PDL), Argument Stack and Linkage record. It uses Binding Object as a token for accessing server’s interface. It reduces the overhead in calling the procedure and control transfer by the use of lazily allocated A-stacks/E-stacks. LRPC optimizes stubs by generating them in assembly language directly, which also blur the boundaries between the protocol layers to reduce the cost of crossing between them. LRPC also takes into account shared memory multiprocessors and increases throughput in such cases by caching domains on idle processors.
4. Evaluation
The paper gives a systematic analysis of both RPCs and the LRPCs. The drawbacks of RPC systems are studied methodically in the beginning of the paper to motivate the implementation of Lightweight RPCs. The pattern for RPC calls is recognized by studying the cumulative distribution of these calls against data transferred, also performance overhead is studied for different systems. After the discussion of implementation of LRPCs, the authors present their performance against four different tests. Also they present the minimal overhead caused by LRPC by analysing the Null LRPC. The performance improvement of LRPCs on multiprocessor systems is also demonstrated. Finally, the authors discuss how LRPC fares in performance for some uncommon cases.
5. Confusion
1. The paper mentions that procedures in the same interface having A-Stacks of similar size can share A-stacks. How this is made possible?
2. Can you please explain how LRPC handles the uncommon case of domain termination?

Posted by: Sharath Hiremath | February 22, 2017 06:31 PM

1. Summary
This paper introduces lightweight remote procedure call (LRPC), which is an optimization for RPC between processes on the same machine. The key features include optimized binding and calling with A-stack and E-stack.

2. Problem
The problem authors claim is most RPCs actually happen between domains in the same machine, rather than on different machines (>=97% on table 1). In addition, RPC arguments are usually simple types (e.g. bool, int), and small size (figure 1). RPC at that time has not been optimized for communication in the same machine, in table 2, the authors show that conventional RPC (null RPC, an RPC doing nothing) overhead (such as stub execution, message copy, scheduling, dispatch, etc) dominate most of execution time (>=70%).

3. Contributions
The first contribution is optimized binding. When binding, kernel allocates argument stack (A-stack), and maps it both to client and server virtual memory. kernel is also responsible to allocate the execution stack (E-stack) for server in advance, to execute RPC later. Return address of client is also recorded in kernel. A-stack and E-stack are allocated before RPC calls happen, this avoids the overhead that client and server to allocate stacks when execution. For each RPC call, arguments copy is reduced to only once: from client's stack to A-stack when client calling RPC. Because A-stack is shared among client and server, server doesn't need to copy the information on A-stack again (though the authors point out this copy cannot be eliminated if arguments are required to be on E-stack when execution). E-stack is allocated privately to server to guarantee protection between client and server.

The second contribution is optimized calling. As authors state in page 43, conventional RPC implemented RPC call by first blocking client thread, then selecting one thread of server to execute and, unblocking client thread when execution is finished. This synchronization method is not necessary when both client and server threads are in the same machine. When client issues RPC call, the client traps into kernel (like a system call), and kernel verifies client's arguments and return address, then kernel coordinates the context switch and dispatches to run server's RPC handler (stub). This eliminates the overhead of scheduling on kernel's end and dispatching RPC calls on server's end.

4. Evaluation
The authors implemented their design on C-VAX Firefly. To test their design's efficiency, they run 4 RPCs NULL, Add, BigIn, BigInOut with 100,000 times on a single node, for their RPC (LRPC/MP and LRPC) and a traditional RPC (Taos RPC). The result (table 4) shows their RPC is generally at least 3 times faster than traditional RPC. They also measure the call throughput as the number of processors increasing. Their RPC gets 2 times more throughput than Taos one due to avoid locking shared data during call and return of RPC.

5. Confusion
I don't understand the upcall in the top of page 46 ("perform an upcall into the server's stub ..."). What is upcall more than scheduling server's stub to run?

Posted by: Cheng Su | February 22, 2017 03:58 PM

Summary:
This paper presents the light weight remote procedure call, which is a communication facility designed and optimized for communication across protection domains on the same machine. It combines the control transfer and communication model of capability systems and semantics of RPC, and improves performance by a factor of three for the common cases of passing small simple arguments.
Problem:
Most communication in operating system is between domain on the same machine and has simple and small arguments. However, traditional RPC systems treat local communication across domains as instances of remote communication, and thus suffers from loss of performance and deficiency in structure.
Contributions:
1. The authors demonstrate with Unix+NFS that most system calls target at the same machine and most arguments are simple and small, with byte copying sufficient enough for data transfer between domains. Authors also show the overhead of local communication in conventional RPC system with the theoretic minimum value.
2. LPRC uses procedure descriptor list, maintained by the exporter of every interface for binding. For each procedure description, the kernel allocates shared A-stacks for between client and server domains for argument passing and return values. Each procedure is represented by a call stub in client domain and an entry stub in server’s domain. The kernel records a linkage record for each A-stack for a caller’s return address. A client makes a LRPC by calling into the stub, which manages A-stacks allocated for the called procedure as a LIFO queue, and the stub pushes the arguments into the A-stack and traps to kernel. The kernel verifies the caller and records the execution state of the calling thread, and updates the caller thread’s stack pointer to the corresponding execution stack in the server’s domain and transfers to server domain. The server domain processes the call and initiates return domain transfer by trapping to the kernel. In this way LPRC reduces argument copying (server domain can read directly from the shared A-stack) and avoids creating a new thread in the server domain (the caller thread continues execution within the server domain).
3. LRPC increases throughput by reducing the use of shared data structures and each A stack is guarded by its own lock. LRPC also reduces context switching by caching domains on idle processors. On a call, the kernel checks if the idle processors are in the context of server domain, and if so, the kernel places the calling thread on the corresponding idle processor.
4. Deciding if the call is cross-domain or cross-main is decided by reading the first instruction of the stub.
5. On server domain termination, the kernel scans the domain’s threads looking for threads running on behalf of LRPC and restarts these threads in clients with call failed exception, and also scans the binding object and invalidates any active linkage record.
Evaluation:
Authors compare LRPC with Taos RPC, and shows that LRPC is roughly three time faster than RPC. In multiprocessor environment, LRPC’s throughput scales linearly with the number of processors while RPC’s throughput stops scaling beyond 2 processors.
Confusion:
How much benefit can we get from domain caching, since each call will finally return from server domain after finish? Or in other words, are there enough idle processor in the context of server domain? Or is the processor only temporarily idle for a memory access (for example) and the new thread preempts the original thread without exploring more parallelism by creating a new thread on another processor?

Posted by: Yanqi Zhang | February 22, 2017 08:52 AM

Summary:
The paper presents a communication facility, Lightweight RPC that is designed to improve communication performance between protection domains on the same machine. LRPC combines execution model of protected procedure call of capability systems with programming semantics and large-grained protection model of RPC.

Problem:
Most communication traffic in OS is cross-domain and involves little amount of data transfer. The problem is that conventional RPC systems treat local communication as remote communication and have high overhead which leads to performance loss.

Contributions:
The client’s thread execute the requested procedure in the server’s domain.
The parameters are passed between client and server through shared argument stack. This eliminated redundant data copies. Pairwise allocation of argument stacks ensure safety.
Use of above techniques of simple data and control transfer facilitates the generation of highly optimized stubs.
LRPC has three layer communication protocol : end-to-end, stub-to-stub and domain-to-domain. Stubs blur the boundaries between the protocol layers to reduce the cost of crossing between them.
LRPC minimizes the use of shared data structures to avoid lock contention so as to increase throughput.
Domain contexts are cached on idle processors to reduce context-switch overhead so as to improve LRPC latency.

Evaluation:
The paper presents experimental results to validate their arguments. Only 0.6 - 5.3% of RPCs are cross-machine in three of the operating systems and majority of the calls transfer less than 200 bytes of data. The paper also presents overheads of 6 RPC systems for Null call.
LRPC is evaluated on C-VAX Firefly and is compared with Taos RPC. LRPC performs 3 times better than Taos without idle processor optimization. With this optimization, the results are even better. For Null LRPC, the slight performance overhead over theoretical minimum is attributed to TLB misses.

Confusion:
I did not understand the section on Domain Termination.

Posted by: Neha Mittal | February 20, 2017 10:36 PM

1. Summary
LRPC allows for RPC-style calls between protection domains on the same machine with drastically reduced overheads.

2. Problem
Small-kernel OSs often implement calls between services running locally on the same machine using standard RPC systems, which were originally designed for communication between machines. These systems perform many copies and do not leverage the kernel for local communications. One solution to this problem is to lump services together into a single protection domain to avoid the cost of RPCs. This sacrifices robustness and avoids the actual problem: the overhead of the RPCs themselves.

3. Contributions
The authors identify an important area of optimization for small kernels: interprocess communication between services.
The authors identify the specific sources of IPC overhead incurred by contemporary RPC systems, focusing on the issues of multiple copies, scheduling overhead, and verification.
They contribute a concrete mechanism (the LRPC) to provide optimization without greatly disturbing the infrastructure already in place. This mechanism uses the kernel to provide server and client with shared memory buffers to cut down on the number of copies required. It uses the kernel to provide a lightweight call-and-return system. It uses capability-style permissions to verify communications when the communication link is initially set up between two processes.

4. Evaluation
The authors first present the motivation by examining the overheads and patterns observed in RPCs on contemporary systems. They first concluded that a large majority of RPCs executed by System V, the Firefly OS, and Linux+NFS are local to a machine. They then show that these local RPCs often have small, simple arguments. The proceed to illustrate the heavy overheads incurred by the RPC systems of each OS by measuring the latency of a null RPC, which simply returns NULL. These tests show that the overheads of the runtime systems are very significant compared to the theoretical minimum number of operations.
They then breakdown the gains that their system makes over conventional RPCs. They observe that they make fewer copies. They show that the time spent executing stub code and running kernel code is also extremely low. The fine-grained locks provided by LRPC allow the system to scale with the number of processors.
They admit that their system does have poor performances for special cases such as large message sizes and many channels of communication (as measured by the number of A-stacks), but decline to provide any performance numbers measuring these overheads.

5. Confusion
They talk about how lightweight their stub code is, compiled down to machine code and not leveraging complicated runtime checks. However, they also discuss how the Modula2+ system enforces type safety at runtime, which seems to be a contradictory statement. This is probably because I don’t know about Modula2+.

Posted by: Mitchell M | February 20, 2017 09:43 AM

CS 736 Reviews - Spring 2017

Lightweight Remote Procedure Call

Comments

Post a comment