« Lightweight Remote Procedure Call. | Main | Scheduler Activations: Effective Kernel Support for the User-Level management of Parallelism. »

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Thorsten von Eicken, Anindya Basu, Vineet Buch, Werner Vogels U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain Resort, Colorado, December 1995, 40-53.

Comments

1. summary
The paper pointed out that the current practice of putting network communication modules in the operating system kernel is not a good idea, since the processing overheads in the kernel is limiting the end-to-end performance, causing inflexibility and making new protocol support impossible. The paper presents U-Net, a new communication architecture that allows for the construction of protocols at user level.

2. Problem
Placing network communication functions in operating system kernels could cause these three major issues:


  1. Performance - For traditional networking architectures, messages from user applications need to make several copies and get accross several levels of abstraction to reach the network interface device driver, leading to too much processing overhead. The software path traversed by messages at end hosts is becoming the new bottlenect for local-area communication.

  2. Flexibility - As the kernel does not have application specific information for network communication, it is not possible to use such information during protocol processing to achieve higher efficiency and higher flexibility.

  3. New protocols - It is not easy to support new protocols or new message send/receive interfaces, such as Active Message, while all protocol processings are placed into the kernel.

3. Contributions
This paper proposed U-Net, a user-level communication architecture independent of the network interface hardware. The architecture contains three major parts:


  1. Endpoints - Each application has an endpoint serving as its handle into the network.

  2. Communication segments - A region of memory in an endpoint that holds message data.

  3. Message queues - Consists of a receive queue, a free queue and a send queue, to hold descriptors for messages received or to be sent.


With the new architecture of U-Net, this paper described how to send and receive messages, how to accomplish message multiplexing and demultiplexing, and how to avoid extra message copy into the kernel. It also introduces kernel emulation of U-Net that deals with the issue of scarce U-Net endpoins, and direct-access U-Net architecture that allows the senders to transfer message data into application data structures without any intermediate copy into a buffer.

4. Evaluation
The paper presented two U-Net implementations: one using SBA-100 and another using SBA-200, and benchmarked these implementations. In section seven, it is shown that the bandwidth utilization is highly improved with U-Net UDP and U-Net TCP, in comparison with Fore UDP and Fore TCP, and the RTT latencies are greatly reduced with U-Net as well.

5. Confusion



  1. For kernel emulation of U-Net, I’m wondering who gets to decide whether a piece of message to sent goes to a U-Net endpoint, or an emulated end-point? It seems every application would want to send on real end-point since it provides better performance, even if the application doesn’t necessarily benefit from the improved performance.


  2. Is it a sort of violation of the end-to-end principal in networks by placing protocol processing into the OS kernel rather than at user level?


1. Summary
The paper presents the design and implementation of U-Net, which provides a virtual view of network interface to user processes giving them access to high-speed communication devices removing the kernel from the communication path while still providing protection.

2. Problem
To minimize the processing overhead for sending and receiving packets at endpoints with a primary goal of providing efficient low latency communication and a secondary goal of providing maximum bandwidth even when processing small packets.

3. Contributions
The U-Net architecture virtualizes the network interface by allowing each process to create one or more end-points, which contain a communication segment for storing data sent and recieved, and message queues for holding descriptor to messages sent, received or communication segments that are free. To send a message, the user process uses a free region in the communication segment to hold the data to be transmitted, composes a descriptor containing meta (offset into the communication segment and length), and places the descriptor in the send queue. U-Net multiplexes the send queue of all the end-points on to the network fabric. Incoming messages are demultiplexed by U-Net based on the destination tag in to the corresponding end-point's receive queue. U-Net supports polling, where the user process can poll the receive queue for incoming messages, or notifications, where the user processe can register an upcall with U-Net. U-Net supports two forms of data copy, namely, base-level, where sending and receiving data involves one copy into/from the communication segment, and direct-access, where the communication segment spans the entire address space letting received messages be deposited anywhere in the address space.

4. Evaluation
The authors implement U-Net on two commodity ATM network interfaces, and, also, implement higher layer protocols like Active Messages and TCP/UDP on IP on top of U-Net interface. The authors measure the latency and throughput achieved at various messages sizes on raw U-Net, Active Messages, UDP and TCP, and show that they are substantially better than their conventional implementations in the kernel. They also run Split-C benchmarks (programs to matrix multiply, sort, etc.) to show that they perform well on realistic workloads.

5. Confusion
The authors brush aside the issue of how the multiplexing tags are setup. How are U-Net like interfaces used in the real world deployments, if any, in conjunction with now default TCP/IP protocol (which they claim is legacy)?

Summary:

This paper attempts to create a communication architecture that achieves performance that is close to the hardware limit. They use the concept of removing protocol processing from the kernel and exposing it to applications. This paper acts a proof of concept with the authors implementing existing protocols like UDP & TCP to demonstrate performance.

Problem:

As pointed out, the main issue that this paper tries to solve is communication that isn't handicapped by the kernel. In addition, the authors also target the common case of most communication consisting of small messages rather than streams of data.

Contributions:

The biggest contribution of this paper is the fact that the authors were able to implement a communication architecture that doesn't need a lot of kernel support. Their implementation presented an application with an abstract view of the network and they provided an interface to the user. They achieved this with using off the shelf components as well.

They use a descriptor queue to handle messages and this gives them a few benefits. They achieved flow control by using these descriptor queues and signals that indicate when the receive queue was almost empty or full helped them deal with indicating overflows. A further optimization would allow them to use these descriptors for small messages with the need for space in the communication segment.

They implemented a tag system, which together with the addressing provided by endpoints, allowed them to protect messages when multiple processes access the network simultaneously. By designing their architecture to work with kernel emulated endpoints, they were able to provide virtual endpoints to used applications and work with limited resources.

Being able to implement TCP and UDP is also another big win for the authors. This allowed them to compare baseline performance and show that their communication architecture is actually useful.

Evaluation:

As the authors pointed out in the introduction, they wanted to focus more on smaller messages and their test methods reflect this. They compare their UDP and TCP implementations with native ones and show that they were able to use more of the potential bandwidth provided by the hardware.

In addition, they also execute a few parallel benchmarks and compare their performance with off the shelve components to the CM-5 and CS-2. Their numbers are impressive compared with these systems but they only test for a few benchmarks.

Confusion:

The authors make a big deal of true zero-copy, where an intermediate buffer isn't used. How does a zero-copy architecture deal with page faults if the communication latency is lower than the time needed to service a page fault?


Summary:
This paper proposes the design of UNET communication architecture that provides user level processes access to high-speed communication devices. This allows construction of user level processes whose performance is at par with capabilities of the underlying network. Paper further evaluates this design in different scenarios.

Problem:
The bottleneck in local-area communication is due to the software path traversed by the messages at the send/receive ends at the kernel level. This involves several copies and crossing of multiple layers of abstraction between device drivers and user applications. Thus, the processing overhead of the communication is more than the network latency and the user application are unable to leverage the high speed network availability. This problem is predominantly visible in small message exchange that constitute large percentage of the messages exchanged over network.

Solution:
The idea to solve the small message end-to-end latency problem due to processing overhead is to remove the kernel from the critical path of send/receive messages. To do this UNet multiplexes the physical NI among user processes accessing the network and enforces protection boundaries and limits resource consumption. Three key component of the UNet architecture to achieve this are:
-message queues: consists of send, receive and free queues at endpoints. Packet inside the queues are indicated for processing by using the descriptor flags. This indication could be polling based or event driven
-Endpoints: this is the application handle into network
-Communication segments: they maintain the queues and descriptor data
Another feature Unet offers is true-zero-copy allowing direct copy of packet data into the application data structures useful for applications like split-C

Evaluation:
Evaluation of UNet in SBA100 and SBA200 shows that round trip latency for packets of size smaller than 40 bytes is considerably reduced. TCP, UDP and ActiveMessage implementation of UNet show latencies close to absolute minimum.

Learnings/Confusions:
Instead of giving user processes access to the NI, we can solve this problem by packet coalescing of small size packets at network interface level? How good is this possibility in the given problem? This will not solve the flexibility problem though.

Summary:
This paper introduces the U-Net communication architecture. It provides processes with a virtual view of a network interface to enable user level access to high-speed communication devices. It removes the kernel from the communication path which still providing full protection.

Problem:
The bottleneck in local-area communication hash shifted to the software path traversed by messages at the sending and receiving ends. In UNIX, the message path through the kernel involves several copies and crosses multiple levels of abstraction between the device driver and the user application and results in high processing overheads.

Contribution:
(1) The U-NET is composed of three main parts including endpoints, communication segments and message queues. Messages sent are composed in the communication segment and a descriptor is pushed onto the send queue. Incoming messages are demultiplexed by U-Net based on destination and receiving can be done by either polling or event driven. Batching is used to optimize performance.
(2) A tag in each incoming message determines its destination endpoint. The operating system service will perform necessary authentication and authorization checks. Limited access to endpoints/communication channels/message queues and tagged messages enforce protection boundaries among multiple processes.
(3)Direct-access U-Net supports "zero copy" by allowing communication segments to span the entire process address space and by letting the sender specify the data offset.
(4) Base-level U-Net buffers message data and pins the communication segments to the physical memory. The send/receive queues may hold entire small message instead of pointers to the data to optimize performance.
(5) Kernel-emulated U-Net endpoint is supported to reduce resource consumption.

Evaluation:
The paper gives two implementations of U-Net in two hardware and cost breakup for a single-cell round-trip. The relationship of between round-trip times/bandwidth and message size is shown. The Split-C benchmark shows the computation and communication performance characteristics on CM-5, CS-2 and U-Net ATM cluster. TCP and UDP modules implemented for U-Net using the base-level U-Net functionality have lower latency higher bandwidth.

Confusion:
The paper introduces two implementations based on different hardware. Does that mean that the U-Net should have different versions for different hardware? Based on the large hardware number, the implementation work can be huge.

1. Summary

U-Net is a system to implement networking without the usual overhead of most network protocols and implementations. It has a simple interface, suitable for tunnelling other protocols, or the creation of new protocols, and offers high throughput and low latency for small messages as compared with more traditional protocols.

2. Problem

In RPC style use cases, the large majority of messages are small, and thus the overhead of the network stack becomes an important impact. Most modern protocols are designed with bandwidth for large messages in mind, and thus pay little attention to the overhead, since it is only a marginal part of those use cases. Thus they perform badly with smaller messages.

3. Contributions

From the perspective of the application, they data to be sent is setup in a communications segment then the data to be sent is added to a send queue. And on the receive side, the packets to be received are added by the U-Net implementation to the communication segment and added to the receive queue, at which point, either the process can poll for the message, or receive a notification of a new message.

The fact that the buffers used for sending and receiving are in large part controlled by the application allows for true zero copy operations in some cases, since the application may construct and use the data in place, thus removing any need to copy into a internal data structure. In the case of the base implementation, due to size and location restrictions on the communications segment, the use of this is limited. With Direct Access U-Net, which is a superset of the usual implementation, the communication segment is allowed to be the entire address space, thus allowing U-Net to directly copy the received message into the appropriate places inside of internal data structures, thus further reducing cases where copying via an intermediate buffer is required. This though, required significant hardware support, in particular some sort of IOMMU to translate addresses.

The flexibility of the message format allows the use of more familiar protocols like TCP and UDP tunneled over U-Net, while still achieving very low latency. In addition, for protocol development it offers a clean interface, without the hassle of writing kernel code.

One way they implement U-Net was to use a network card modified with a custom firmware, which was able to understand the U-Net protocol and automatically route the messages to the correct buffers in memory. But they also demonstrate an implementation that is mostly in the kernel, once for a network card that lacks the hardware to support the firmware implementation, and another time to conserve hardware resources while allowing even applications that do not require the performance of direct access to be implemented over U-Net.

4. Evaluation

They quite successfully demonstrate the flexibility of the base protocol by implementing both a new protocol, U-Net Active Messages and two old protocols, TCP and UDP on top of U-Net without large losses of efficiency. In particular, in the case of UAM, they demonstrate only slightly higher round trip times than raw U-Net.

They then further use these implementations to successfully demonstrate that TCP and UDP over U-Net can have much lower round trip latency and especially in the case of small message sizes, much higher throughput, than their conventionally implemented counterparts.

They do mention that the fact that this requires DMA pages for many of the structures, there is a limitation on the number of concurrent users of U-Net. And they suggest using some sort of dynamic allocation of DMA space. This it seems would adversely affect performance, perhaps significantly. This seems to likely be the largest fault against the U-Net concept. Alternatively it may be feasible to show that under a reasonable server workload, the amount of DMA space used is small enough.

5. Confusion

I don't think I understand exactly what their Split-C benchmarks charts are attempting to say. In particular, since they are comparing different systems, it is not clear to me how the data is renormalized, and how best to interpret it.

Summary
The paper presents U-Net which relegates the act of sending and receiving network communication from the kernel to the user space. For small messages particularly, the kernel stack operations of buffer management, message copies, checksumming, flow-control handling, interrupt overhead as well network interface control can add significant overhead for low latency requiring application domains. If complete user access to the network interface is allowed then operations of buffer management, protection and much better application-protocol amalgamation can be achieved. Various kinds of implementations on top of U-Net are described. Detailed benchmarks are also provide. Also higher level protocols like TCP and UDP are implemented over U-Net on ATM channels and it is shown that for small message sized, the link bandwidth is achieved.

Problem
The overhead that the kernel adds for small-message network communication is starkly visible for low latency requiring applications. There is a demand for low-latency communication especially in a local area setting, exploitation of full link bandwidth even with small messages and achievement of much more efficient and productive protocol-application interweaving than is allowed by a conventional kernel based networking stack. This is addressed by transferring the network interface interfacing mostly from the kernel to the userspace. This has been shown to lower latencies considerably.

Contributions
The main focus of U-Net revolves around two needs. The first is to have low latencies and high bandwidth with small messages and the ability to get protocol design and integration flexibility on off-the-shelf hardware and software. Low latencies have been achieved on parallel processing machines but this has come as a result of extensive changes to hardware and the operating system interfaces. U-Net achieves this in regular hardware and software by cutting down on the myriad layers of abstraction in the kernel and having simpler userspace mechanisms. Endpoints serve as an application's handle into the network and contain communication segments which are regions of memory holding message data and message queues which hold descriptors for messages that are to be sent or that have been received. Each process that wishes to access the network first creates one or more endpoints and then associates a communication segment and a set of send, receive and free message queues with each endpoint. Multiplexing and demultiplexing of data flows from various applications is handled by the use of channel identifiers. The role of the kernel is only to establish a route for the underlying substrate demands so (ATM does that) and make sure all security constraints are passed before returning a channel identifier (the VCI in case of ATM) to the requesting application. The implementation of U-Net is done at two levels. One is called the Base-level in which Zero-copy is achieved. This is truly one copy because data is copies at least once between the application's data structures and a buffer in the communication segment. Mostly Base-level U-Net is implemented in hardware if the hardware is not very customizable. The other level is called the Direct-Access U-Net architecture. This capability allows message data to be copied directly into application's data structures without any intervening copy into a buffer. But this in essence mandates the network interface to have some kind of memory management and have a wide bus to address a huge address space of an application because application's data structures could lie anywhere inside it. For applications that do not need the same level of performance from individual real communication segments of their own can use kernel emulation of U-Net in which the kernel has only one real endpoint and multiplexes all flows into it.

Evaluation
The U-Net according to the article has been implemented on two hardware network interfaces namely the SBA-100 and the SBA-200. While the first uses kernel-emulation the second used direct-access U-Net. SBA-100 latencies are about 66 microseconds while SBA-200 latencies are about 65 microseconds. Various frameworks and protocols have been benchmarked on these hardware. For U-Net Active Messages round-trip latency is about 71 us, for Split-C store it is 72 us and it is 138 and 157 us for UDP and TCP respectively.

Confusion
Can U-Net be used as a backbone mechanism for remote memory access using very fast substrates like Infiniband to achieve scalable NUMA?

1. Summary
    This paper presents U-Net, which provides applications in the user-space access to the Network interface. This not only provides the application flexibility to suit itself but also improves efficiency by reducing the processing overhead involved in sending and receiving messages especially with small-sized messages. Some evaluation are conducted to measure the latencies and bandwidth using U-Net architecture and compare performance of standard protocols like TCP, IP, UDP using U-Net architecture with their corresponding kernel versions.

2. Problem
    The central idea of U-Net is to remove the involvement of the kernel in the path of sending and receiving messages and allow the communication layers used by each process to be tailored to its own demands. This reduces the processing overhead required to send and receive messages as well as provides flexible access to the network interface. The following 3 goals are highlighted in the paper:
1) reduce the latency of communication in local area settings
2) improve the performance of the network even with small messages by overcoming the overheads
3) by giving applications greater control, promote the use of novel communication protocols

3. Contributions
    While many previous work had already proposed user-level network interfaces, this was the first architecture that did not require any custom hardware or modifications to the OS and also supports traditional protocols as well parallel language implementations. Emphasis is laid on protocol design and integration flexibility with standard protocols and off-the-shelf hardware
    Another important aspect that sets U-Net apart from related work was its focus on low latency and high bandwidth using small messages. This is of increasing importance as small messages appear in a lot of use-cases like transfer of objects across the network, maintaining consistencies in caches over a distributed service, administrative services involved in distributed systems (like authentication. protection, etc.), software fault-tolerance algorithms and group communication tools.
    By getting rid of the kernel and systems and placing the control of the network in the software's hands, U-Net architecture has probably been a motivation for today's Software Defined Networks (SDN).
    The U-Net architecture virtualizes the network interface so that each process has an illusion that it owns the interface to the network. Protection is assured through kernel control only during the set-up and tear-down of the communication channel.
    The approach proposed in this paper is to incorporate the message multiplexing and demultiplexing in the network interface with the help of message tags which ensures protection and moving all buffer management and protocol processing out from the kernel to user-level. It is mainly comprised of three components: endpoints which serve as a program's interface to the network, the communication segments which are regions of memory holding the messages, and message queues (send, receive and free) which hold the descriptor to the outgoing, incoming messages.

4. Evaluation
    The two main objecties of providing efficient low latency communication and offer high degree of flexibility have been accomplished. The processing overhead on messages has been minimized so that the latency experienced by the application is dominated by the actual message transmission time.
    Experiments are conducted to measure the round-trip latency and bandwidth of U-Net implementations. For messages smaller than 40 bytes, the round-trip latency is about 65 µsec which is favorable compared to other parallel applications.
    U-Net architecture was able to exploit the full potential of the ATM network bandwidth of 140 Mbps. Using Split-C, a parallel language, the performance of seven benchmark programs using U-Net ATM cluster rivals that of parallel machines like CM-5 and Meiko CS-2.
    Standard protocols like TCP and UDP implemented using U-Net achieve latencies and throughput close to the raw maximum and Active Messages round-trip times are only a few microseconds over the absolute minimum.

5. Confusions
    Does U-Net suit only parallel and distributed application as the title of the paper suggests. Is it because small messages play a very important role in such systems?
    Does the lack of support for the integration of kernel and application buffer management or the use of them with high overheads, which especially affects reliable protocols which make use of message retransmission a major setback for U-Net?

1. Summary
This paper proposes a new network architecture U-Net that enables user-space application’s direct access to network interfaces without kernel interference to minimize processing overhead latency and achieve full bandwidth of network fabrics. It guarantees protection and embraces flexibility that supports traditional protocols as well as innovative protocols without custom hardware or OS modification.

2. Problem
The latency of message communication includes processing overhead that wraps and unwraps packets through software layers and network latency that is the transmission latency through network fibers. As network fabric improves, packet processing upon sending and receiving becomes the bottleneck of lowering latency further, especially for small packets. Packet processing by kernel in conventional architecture affects the performance in a large scale in parallel and distributed systems. We need a new architecture that provides safe, flexible, low-latency, high-bandwidth communication.

3. Contributions
Each process has an endpoint as an entry point to U-Net architecture. The endpoint includes communication segments that store messages and receive/send/free queues holding descriptors pointing to messages to be received/sent/free to reuse. Each message has a tag specifying routing information including endpoint, and is sent to destination when a communication channel has been registered for it. U-Net multiplexes true network interface among endpoints and demultiplexes messages to their destinations.
U-Net supports two modes. Base-level (zero-copy) needs copying message to networking buffer before sending, while direct-access (true zero-copy) have direct access to messages in communication segments via descriptors. Small messages are optimized to be stored in descriptors without communication segment. Emulated endpoints are mapped onto one real endpoint by kernel multiplexing.

4. Evaluation
The papers shows two implementations of U-Net using SBA-100 and SBA-200. For U-Net using SBA-100, only round trip latency is measured. For U-Net SBA-200, various message size on UAM, UAMxfer, Raw U-Net are compared. It also shows early fiber saturation in bandwidth when the message size is still small. Personally I don’t understand why evaluation of U-Net SBA-100 makes sense here. It’s not tested on various message sizes, and only throughput comparison with SBA-200 does not make the point.
UAM is fully evaluated for single cell round-trip latency, bulk transfer round-trip latency, bulk store bandwidth, and bulk get bandwidth, which shows UAM is comparable to Raw U-Net.
The authors also evaluate U-Net on three machines using 7 Split-C benchmarks. However, they did not analyze the result, e.g. why CM-5 is slower but has lower overheads compared to the other two?
For legacy protocols TCP/IP and UDP/IP, the evaluation shows U-Net could achieve the bandwidth limit much earlier than Fore ATM/Ethernet when message size is increasing, while the latters fail to saturate the fiber even with larges messages.

5. Confusion
From the paper it implies the kernel should determine the route path and set up channel, and do authentication and authorization. The process still needs to trap into kernel-mode, which could also be a bottleneck. Why is this architecture significantly faster than the traditional architecture? More specifically, why moving protocol stack into user space could lower latency so much? Just trapping into kernel should not make such big difference.

Summary:

The paper provides a model U-Net(User Level Network Interface) to speed up the latency time involved in processing the messages which are sent across networks. They do so by moving the kernel out of certain paths and providing direct communication between the user application and the network interface thereby avoiding unnecessarily trapping into the kernel.

Problem:

Today, across networks, most of the messages being passed are small and thereby most time is spent in processing the messages at the sending and the receiving end rather than actually transferring the messages across the networks. This paper tries to solve this problem by attempting to reduce the overhead involved in processing the messages.

Contributions:

The model proposes moving the protocols involved in networking out of the kernel thereby providing flexibility to the applications to develop and use a protocol that suit their needs. The problems arising in this approach were identified as -
multiplexing the network
providing protection across the processes
providing fair use of the network amongst processes since it is a limited resource

U-net provides for direct communication between the applications and the NI (Network Interface) while sending and receiving messages. It does so by providing send, receive and free queues along with a communication segment (together called an endpoint) which lie in the user processes address space. So, before a process wishes to access the network it is required to create an endpoint consisting of the queues and communication segment. Now to send a message the process places a descriptor in send queue pointing to a location in communication segment. To receive a message, the application polls the receive queue which contains a descriptor pointing to a location in communication segment or registers events with the NI. Hence the process of unnecessary copy to kernel buffers is avoided and NI reads and writes directly from and to these communication segments.

U-net uses tags to multiplex and demultiplex messages. These tags are registered with U-net using certain channel identifiers. When sending a message, based on the channel identifier the NI inserts the appropriate tag, and while receiving based on the tag, NI identifies the channel it needs to use. An operating system service provides for route discovery and authentication and authorization checks.

Evaluation:

The paper has identified the reason for the latency in passing messages across networks and has provided statistics for the same. They have provided an approach similar to exokernel and provided a clean interface of moving the network protocol out of the kernel. They have clearly identified the problems that one might face and how to solve them. They also have implemented and provided the measurement justifying their approach.

Confusions:

In section 3.5, U-net provides a kernel emulation for certain applications. They say U-net doesn’t benefit from certain applications (I guess the application passing large messages where network bandwidth is the main constraint). How do they identify these applications and how does the kernel emulate U-net for them?

What are VCI (Virtual Channel Identifiers) similar to in Unix OS? I am unable to grasp the concept of a channel identifier and how they come about. Is it that every process is given such an identifier?

Summary
The paper presents U-Net architecture that provides user processes direct access to network interface so as to provide efficient low latency communication and offer high degree of flexibility by moving the entire protocol stack to the user space.
Problem
The conventional networking architecture's message communication through kernel involved several message copies and processing overheads that prevented it from fully utilizing the underlying high speed networks. As a result applications failed to get performance improvement even with high speed networks. Secondly, placing the protocol processing in kernel makes it quite difficult to support new protocols and message send/receive interfaces. The authors felt that applications would benefit more from a flexible network interface.
Contributions
In order to reduce the processing overhead for a message transfer the kernel is removed from the critical path which eliminates the system call overhead and allows for buffers to be managed efficiently. The network interface is incorporated with muliplex/de-multiplex capabilities to enforce protection boundaries. This visualizes the NI and provides each process the illusion of owning the network interface.Processes interacting with the network create endpoints that have communication segments setup to hold messages for sending. These messages are then accessed directly by the network interface and are sent over the network. Queues are used for messaging and flow control.U-Net uses a tag for each incoming message to determine its destination endpoint and thus the appropriate communication segment for the data and message queue for the descriptor. U-Net supports true zero copy architecture in which data can be sent directly out of the application data structures without intermediate buffering. But base level architecture is implemented because the hardware available does not support direct-access architecture.
Evaluation
U-Net architecture has been implemented using SBA 100 , SBA200 interfaces on SPARC stations running SunOS. The UAM prototype is used to measure roundtrip times. They show that the overhead of UAM is very less. The comparison with standard kernel implementation on latency and bandwidth is provided. The result shows that for small message, The benchmarks showed that reduced latency and increased bandwidth is achieved with small message size in UAM.
Confusion
The paper mentions that the number of endpoints that have direct access to NI are limited, and that emulated endpoints are used if all them are used. So what policy is used to ensure that these end points are used efficiently/properly ? and who implements it ?

1.Summary
The paper describes a user level network interface and architecture called U-Net enabling user processes to directly access network. It aims at achieving low-latency communication by eliminating the processing overhead involved in sending, receiving the messages and by removing kernel out of the critical path. A detailed analysis on how U-Net performs for different protocols and message sizes is given.

2.Problem
Most of the distributed computing applications had to incur huge processing overheads due to the communication protocol implemented in OS kernel. Even-though, there were improvements in network hardware (fibers) and small messages were most frequent, low-latency communication was limited by OS processing overheads of buffer management, message copying and other validation tasks. Also, existing kernel modules are not flexible enough to support new protocols implementation in applications. The authors have in this paper, have tried a novel approach of moving protocol implementation to user-space and achieve low-latency communication by removing kernel form critical path.

3.Contributions
A new abstraction is provided at user-space to achieve high performance communication by providing virtual Network Interface (NI) to each user process. U-Net architecture has communication endpoints which act as applications handle into network. Endpoints have communication segments which are memory regions to hold message data and descriptors for send/receive/free queues. Multiplexing and de-multiplexing is provided at NI to handle different processes' requests to use network. Kernel is only responsible for set-up and tear-down of communication channels using channel ids and tags. One important contribution is of two-types of architectures U-Net uses. Base-level or zero-copy uses intermediate buffers in communication segments whereas Direct-access or true-zero-copy can directly copy host data structures to destination memory region. U-Net strongly uses memory mapped networking support allowing users directly communicate with other nodes with provided protection boundaries. A separate emulated endpoints case is supported for many applications competing for endpoints and communication segments which are scarce resources.

4.Evaluation
Two U-Net implementations with separate ATM NI interface cards and firmware is evaluated on Sparc workstations running SUN OS. First one with SBA 100 achieves a single cell message round-trip latency of 66us and bandwidth of 6.8MB/s. Most of the overhead in that is due to the CRC computation at either side. Second one with SBA 200 and i-960 processor support achieves 160us and bandwidth of 13MB/s. Raw U-NET is compared with U-NET Active messages transfer for varying message sizes. The bandwidth saturates at 512 byte message size to fiber limit of 15MB/s for all message types. U-Net ATM cluster machines' round-trip latency and bandwidth is compared with CM-5 and Meiko machines and the results are close and comparable. Split-C network benchmarks performance is compared with all the above 3 machines. Finally, Fore ATM TCP and UDP protocols' network performance comparison with regular protocols is given.

5.Confusions
How does modern machines and OS do memory mapped network communication and DMA access for networks? How does U-Net compare with RPC calls for distributed computing? Parallel machines like CM-5 have switched to MPI and U-NET seems to be still adding protocol complexity again as we discussed in last class.

Summary

U-Net is an abstraction provided to the user-mode processes that it owns the network interface. It achieves improvement in the latency of processing by removing the kernel from the communication path, while still providing full protection. The architecture is flexible because the user-level processes can implement their own optimized protocols.

Problem

The processing overheads of a message in the kernel limits the peak bandwidth and cause high latencies. These overheads are caused by buffer management, message copies, checksumming, flow-control handling, interrupt overheads in the kernel. With the advent of distributed systems, RPC, etc., small messages have become predominant in the network. So, there is a need to reduce processing overhead of small messages. Also, if the network stack is implemented in the kernel, the set of protocols become ossified for the user process because it doesn't have any flexibility in changing them.

Contributions

The architecture virtualizes the interface in such a way that the mechanisms can provide every process the illusion of owning the interface to the network. This is done by removing kernel from the critical path of sending and receiving messages. The kernel is only involved when an application requests for the setup or teardown of a channel and its endpoint (application's handle into the n/w interface). The interface provides protection and multiplexing/demultiplexing by using tags (that identify the communication channels) on the messages. It has gotten rid of copying the messages multiple times and achieved zero copy and true zero copy in the base-level mode and direct-access mode, respectively.

Evaluation

The authors ran the experiments after implementing it for both SBA-100 and SBA-200, two SPARCstations running SunOS. U-Net's performance was found to be better than the conventional kernel implemented n/w stack. U-Net's performance also approaches the raw throughput of the network.

Confusion

Are there any modern systems where this kind of an abstraction of providing per-process network interface is used ?

Summary:
This paper presents U-NET, a communication architecture for user-level communication on an off the shelf hardware platform running a standard operating system. The central idea in U-NET is to move the kernel from critical path of sending and receiving messages. This helps immensely in reducing the latency of small messages. The U-NET communication architecture basically virtualizes the network interface so that each process has the illusion of owning the network interface

Problem:
In traditional operating systems, all the protocol processing happens in the kernel. The path taken by the messages through the kernel results in large processing overheads due to several message copies and crossing of multiple levels of abstraction between the device driver and the user application. This overhead is more pronounced for small messages, which ideally rely on quick round- trip requests and replies. U- NET tries to address this problem by moving parts of protocol processing from kernel space to user space.

Contributions:
U-NET architecture virtualizes the network interface in such a way that OS and hardware mechanisms can provide an illusion of private network to each process. U-NET provides abstractions in the form of endpoints to virtualize the network interface. Endpoints contain communications segments, which hold message data and message queues for stationing messages yet to be sent or received. U-NET uses descriptors to keep track of messages in the queue and uses flagging mechanism to reclaim buffers allocated for communication segments. Endpoints allow U-NET to enforce protection boundaries among processes accessing the network and also help in providing isolation since endpoints and the encompassing communication segments and queues are only accessible by the owning process. The message queues are optimized to hold entire small messages, thereby avoiding buffer management overheads. The base level U-NET architecture minimizes redundant copies and tries to approximate “true zero copy”. In reality, it ends up making one copy from user space to communication segments. True zero copy can be achieved in direct access U-Net, which allows communication segments to span the entire process address space and by letting the sender specify the offset in memory at which the data has to be copied. Another elegant interface provided by U-NET is that of kernel emulated endpoints, which look just like regular endpoints to the processes except that they are optimized for performance since kernel multiplexes multiple emulated endpoints into a single endpoint.

Evaluation:
The authors use SPARCstations running SunOS and two different Fore System interfaces for their evaluation. They report round trip times for systems using both interfaces i.e. SBA-100 and SBA-200. There is also a sensitivity study done to report round-trip times and bandwidth as a function of message size. The authors also evaluate the UAM prototype provided by U-NET with the aid of some micro-benchmarks. There is further evaluation using Split-C benchmarks to compare U-Net version of Split-C to the versions implemented in super computers such as CM-5 and Meiko CS-2

Contributions:
Is the U-NET architecture currently implemented in any commercial systems? If not, why?

1. Summary
The paper explains the implementation and working of U-Net, an exokernel-ish OS that interfaces with network cards thus eliminating all the work that the traditional OS did in its kernel to virtualise the network interface card. The authors argue that the entire protocol stack must be placed in the user level application which will have direct access to the network.

2. Problem
a) In communication networks, the erecting overhead was comparable to the message latency. This is true in case of RPCs and distributed networks and thus inhibits the user from obtaining the maximum bandwidth and efficiency provided by the developing hardware.
b) Messages were getting increasingly smaller and thus had to be communicated faster compared to bulk messages. Although owing to the processing overheads, they were quite slow.
c) All the network protocols were placed in the kernel section of the OS and hence it became almost impossible to change the protocol to suit the application and the needs of the user.

3. Contributions
a) Messages are sent and received by storing them in a buffer area called communication segment in the address space of the process. The send queue, receive queue and free queue are used to point to the message in case of sending or receiving. A separate buffer in the network interface is not needed.
b) Every process is denoted by a tag in the network interface and the process communicates through a channel identifier that sends the messages to the correct destination. Using the mechanism, multiplexing is provided thus virtualising the network interface to the process.
c) If the address of the receiver is already known and provided by the sending process, messages are sent without buffering them (like DMA) and this is known as true zero-copy. This reduces one copy compared to buffering in the communication segment.
d) If it is not viable to allocate one communication segment and three queues to every process, these regions are abstracted by the kernel and thus involves minimal kernel involvement.
e) In case of true zero-copy the network interface has to act as the MMU and also handle page faults. Also, as an optimisation, small messages are stored as such in the queues instead of pointing back to the communication segment.

4. Evaluation
The U-Net Active Messages(UAM) implementation shows that the UAM overhead is about 6us for a single cell round-trip. For block transfer round trip, the time is liner with a factor of 0.2us for every byte. The UAM also exploits the bandwidth limits of the network interface. A plot of the time that is taken by the network and the CPU shows that most of the time is spent on the kernel. Although with the U-Net implementation in the U-Net ATM cluster, the communication is faster compared to other implementations that ran the Split-C benchmarks.

5. Confusion
At the end of section 3.1, the authors say - “A process must be able to disable upcalls cheaply in order to form critical sections of code that are atomic relative to message reception”. Does this mean that upcalls are not performed during critical sections or the upcalls are the critical sections?

Summary:
The paper describes the communication architecture, U-Net, the driving force behind the idea and the details of two implementations of the architecture. The authors present evaluations of the design for different benchmarks and show that the design does in fact fetch better performance than traditional networking architectures.

Problem:
The existing architectures involved passing the messages through the kernel, which incurred overheads due to copying of data multiples times and crossing domains, between user and kernel space. Also, the existing systems were rigid and did not easily support addition of new protocols or interfaces. These were significant in case of small messages, where the overheads mentioned were substantial compared to total latency.

Contributions:
The authors decided to move the protocol processing into user space, but at the same time, providing the protection and security as in the kernel space. The most important contribution, according to me, was the idea of virtualization of network devices. This I believe enabled having multiple devices connected to the system without sacrificing performance of any. Security was provided by using the kernel to setup and terminate the communication channels and other resources, such as queue buffers. The data was multiplexed at the sender end and de-multiplexed at the receiver end using channel identifiers and tags. The Operating system is used to authenticate and authorization along with other operations like route discovery, switch path etc, which I think was the responsibility of some kind of a network driver included in the OS; the exact mechanism is not mentioned in the paper. Another key feature that provided optimization was the zero copy mechanism, where the communication segment spanned the entire process address space and the sender specified an offset in the destination communication segment, where the message was to be placed. The endpoints provided to the processes were the virtual endpoints and the real endpoint, which was large, was multiplexed among the processes by the kernel.

Evaluation:
The authors provide latencies and breakdowns, highlighting the overheads of the proposed architecture. The graphs show that the latency for small message sizes is significantly lower compared to other designs and is comparable for larger sizes. The proposed model fails, as compared to CS5’s performance, in cases such as the radix sort algorithm for small messages, but performs better in other cases. The graphs provided seem quite unintuitive to me, I found it hard to get the inferences from them.

Confusions:
What does ‘tag’ mentioned in section 3.2 represent? Also, doesn’t having the controls, such as route discovery, signaling etc, in the OS limit the flexibility?

Summary:

In this paper, the authors discuss about U-Net, a user level network interface architecture which provides an abstraction to processes of an individual network interface. The theme of this approach is to move the networking architecture to the user space out of the kernel to avoid buffering overheads and provide user programs greater flexibility in managing the network interface. The authors discuss the performance of the U-Net against Split-C benchmarks. Also the paper talks about how traditional protocols like TCP and UDP could be supported by this approach.

Problem:

When the network interface is moved out of the kernel, there rises the issue of multiplexing. Also, protection of processes becomes an issue since two processes using the networking interface should not interfere. The U-Net architecture design is motivated to provide low latency across local networks, full bandwidth utilization even for with small messages and flexibility to use legacy protocols and at the same time scale to new protocols.

Contribution:

The U-Net tried to achieve maximum bandwidth even with small messages. The U-Net architecture by using an endpoint was able to abstract the network interface. Sending and receiving messages became simple using the send, receive and free queues. The upcall feature could save significant time for processes instead of polling regularly. By using tags for processes and associating them with communication channels, the U-Net is able to efficiently multiplex and de-multiplex the messages. The tags help in protection as messages with correct tag will be delivered to the receiver process. The idea of supporting true zero copy in Direct Access DMA could provide significant performance improvement as it reduces one level of buffering. By holding small messages in send and receive queues directly, overhead for sending small messages is reduced. Emulated endpoints provide an alternate to regular endpoints for those applications that do not need it. In implementation of Active message in U-Net, if the sending buffer is full, the sender polls for incoming messages thereby reducing idle time. The U-Net by moving buffer from kernel to user space was able to tackle the problem of message loss in high bandwidth data streams due to buffer overflow.

Evaluation:

The implementation of U-Net was tested on latency and throughput. The performance of U-Net could not match CM-5 as the latter used custom network interfaces placed on memory bus. Also, the graphs show that maximum bandwidth was achievable. The performance of U-Net Active Messages was compared against Raw U-Net which showed that both have comparable performance. U-Net ATM was also tested on the Split-C benchmark and compared with CM-5 and Meiko. The performance of U-Net matched Meiko but was lower than CM-5 for small messages. Also, the U-Net implementations of TCP and UDP performed better than their kernel counterparts.

Confused about:

I had a question in section 3.2 where it said incoming messages are mapped to the correct process using the tag. If a process in one machine is communicating to a process in another machine for the first time how does the message get mapped to the correct process as the tag value will not be known by the sender. Is this idea used now? Doesn’t this place an onus on the programmer to implement a lot of networking features in the program?


1. Summary
This paper describes U-net, a userlevel implementation of a virtual networking interface made to remove the barrier a kernel networking stack presents when implementing new networking protocols.
2. Problem
At the time u-net was created, recent hardware advancements meant local area networking was getting much faster, but processing overhead (time spent moving packets through the kernel) was not shrinking proportionally. When sending small messages, the processing overhead dominated the network latency, a problem the creators of U-net aimed to fix.
Additionally, kernel level implementations of the network stack created a high barrier to entry for people who wanted to implement a new network protocol, for example to more efficiently transport streaming video.
3. Contributions
U-net moves the network stack into userlevel using endpoints, which are application handles to the network. U-net itself demultiplexes messages as they come in on the NI based on their destination.
A "true zero copy" architecture is attempted, where data is sent directly out of application level buffers without intermediate buffering steps and vice versa with incoming data. Due to lacking hardware support for the direct IO needed, the U-net implementations tested in the paper still required data to be placed in specified buffers which is more of a one-copy architecture.
U-net also supports something called kernel emulated interfaces, which is a way of reducing demand for full U-net endpoints by multiplexing several of them into one using kernel emulation of the endpoint functionality. This is useful for applications with lower performance demands that shouldn't be wasting scarce resources.
4. Evaluation
The authors of the paper evaluated U-net on systems running SunOS 4.1.3 and using Fore SBA-100 and SBA-200 interfaces. They presented a large number of latency and bandwidth measurements on both interfaces with U-net performing generally faster. Also used were split-c benchmarks to highlight U-net's performance under more realistic workloads.
5. Confusion
I'm surprised to see the speed up differences the authors measure for TCP and UDP. It seems surprising that kernel overhead is able to cause such a large difference and that the move to userlevel code vanishes the problem to some extent.

Summary:
This paper describes U-NET communication architecture which provides processes with a virtual view of a network interface to enable user level access to high-speed communication devices. UNET focuses on low latency communication by reducing processing overhead for small messages and offering flexibility.

Problems:
In traditional UNIX networking architecture, networking protocol were implemented in kernel. With increase in network bandwidth, processing of data became bottleneck. It involved copying data from user to kernel, from kernel to network interface hardware. Also, context switch time from user mode to kernel mode became dominant in application using small messages. Direct user level access to network was possible with some of the early implementation, but they require custom hardware. UNET aims to solve low latency problem without requiring any specific hardware and at the same time preserving microkernel architecture for ease of protocol implementation.

Contribution:
1. Though user-level access to network interface was possible in some of the systems, they required customized hardware. UNET major contribution is providing user-level access to network without any modification in hardware or operating system.
2. Buffer management and protocol processing is moved to user level and mux/demux us performed directly inside network interface which enables NI virtualization and provides each process the illusion of owning the interface to the network.
3. UNET architecture is composed of endpoint which contains communication segments and message queues. Communication segment contains message data and queues contains message descriptor. Sending data involves writing data in segment and pushing descriptor in send queue. Incoming messages are demultiplexed by UNET based on their destination. Event driven model or polling can be used at receiving end.
4. With the help of endpoints and communication channels one application cannot interfere with the communication channels of another application providing protection.
5. In base level UNET architecture needs one intermediate copy of messages whereas in Direct access UNET architecture no intermediate copy is necessary.

Evaluation:
Round trip latency and throughput is measured for U-NET TCP/UDP implementation and compared to fore TCP/UDP implementation. Round trip of latency of UNET is roughly 1/5th of the fore implementation. Fore TCP implementation maxes at 10 MB/sec while U-NET TCP achieves 15 MB/sec. Active messages are implemented on U-NET and compared with parellel computer running same Split-C programs. Performance of U-NET clusters is competitive supercomputers both running variety of Split-C benchmarks.

Confusions:
Though paper initially claims it requires no specific hardware in true zero copy it needs special hardware. Is this implementation used today?

Summary:
This paper talks about a new communication system (U-net) built on top of off-the-shelf communication (ATM) hardware with network protocol stack being pushed into application-space.

Motivation:
The authors feel that getting the kernel out of the path for sending and receiving messages, allows application specific optimizations, removes unnecessary overheads that occur inside the kernel (such as repeated copying) and increases flexibility for quickly implementing new features in protocols. They feel that as networking hardware becomes faster (and with the predominant usage of small messages, where latency matters more than throughput, for transfer of synchronization and control information across nodes), reducing the software overheads will gain increasing importance.

Contributions:
The main ideas that are explored in this paper regarding the design of such a communication system are:
-The U-Net system exposes a network interface that provides multiplexing(/demultiplexing) and protection on top of the hardware, but does little otherwise. All other functionality is pushed into application space.
-Each application has its send and receive queues and message contents(called endpoint) inside its own address space (mostly pinned to physical memory) and polls (/registers upcalls/sleeps on status change of the recv queue) for new messages entering its recv queue.
-In systems with the required hardware support (i.e. ability of i/o bus to access all addresses), U-net supports direct data transfer between endpoints, bypassing any copy into a network buffer.
The authors discuss the two versions of their U-Net implementation (with and without direct access) on different generations of Fore hardware. Also they discuss the implementation of existing protocols (tcp/ip,udp, active messages) on top of the U-net system.

Evaluation:
The authors show that the communication over the U-net system saturates the bandwidth for packet sizes of around 800 bytes. Also on their implementation of active messages on top of U-net they show that they reach 80% of the theoretical limit of bandwidth saturation. But they agree that parallel systems like CM5 which place network interfaces on memory buses are faster.(But they had argued that because of the custom hardware required for these systems, their communication system is more flexible). They also show that their tcp implementation on top of U-net has reduced latency and increased throughput compared to the kernalized tcp implementation on top of ATM.

Confusions:
The paper talks about the Operating system service figuring out the routes between machines. Is that not the function of the networking protocol implemented on top of u-net?

Summary :
The authors talk about U-Net, which is a virtual network interface that could directly be accessed by applications at the user level thereby removing the complexity of the kernel. The paper also discusses how various protocols like TCP, UDP and IP have been implemented using U-Net and also communication mechanisms like Active Messages. They demonstrate the high performance by measuring the improvement in latency and bandwidth achieved for small messages using U-Net.

Problem :
The problem in question is about developing an architecture that would allow user level processes to directly access the network interface providing flexibility of protocol processing at user level. Many applications communicate using small messages and demand low round trip time. The goal is to achieve low latency, high bandwidth for small messages and support common communication protocols. In addition, since the kernel is bypassed, a mechanism for multiplexing the network needs to be devised and protection between processes accessing the network should be ensured.

Contributions :
1. Incorporates the multiplexer/demux in the network interface and handles the protocol processing and buffer management to the user level thereby providing flexibility.
2. Architecture mainly composed of endpoints, which are the application’s handle into the network, communication segments to hold the message data and message queues for the descriptors for messages to be sent/received.
3. Tags used in messages to determine the destination endpoints. Protection boundaries enforced by the use of endpoints, communication segments and message queues as these are accessible only by the owning process.
4. Two levels of architecture - base level and direct access supported. Base level supports zero copy which requires one intermediate copy in the network buffer. Direct access supports true zero copy in which data is directly sent out of the application data structures without any buffering.
5. Also implement a user level library that exports the generic active messages interface and uses the U-Net interface. This includes functions for flow control and retransmissions necessary for active messages.
6. Protocol implementation can be changed to modify the buffering and timer mechanisms so that they could be optimized for the application and the network technology used (ATM).
7. Tuning of TCP with respect to factors like size of segments and window size to achieve higher bandwidth.

Evaluation:
A comprehensive evaluation of U-Net has been done for latency and bandwidth. It is shown that the round trip latency for small messages of size less than 40 bytes is 65 microseconds which they claim is comparable to other research results. Besides, they also evaluate the bandwidth for various message sizes and conclude that for message size about 800 bytes, the network is fully utilized. Low latencies are observed for TCP and UDP that have been implemented using the base level U-Net functionality.

Confusion :
What are the conclusions derived from the comparative evaluation of Split-C benchmark over U-Net active messages with those on machines CM-5 and Meiko CS-2 ?

Summary :
The authors talk about U-Net, which is a virtual network interface that could directly be accessed by applications at the user level thereby removing the complexity of the kernel. The paper also discusses how various protocols like TCP, UDP and IP have been implemented using U-Net and also communication mechanisms like Active Messages. They demonstrate the high performance by measuring the improvement in latency and bandwidth achieved for small messages using U-Net.

Problem :
The problem in question is about developing an architecture that would allow user level processes to directly access the network interface providing flexibility of protocol processing at user level. Many applications communicate using small messages and demand low round trip time. The goal is to achieve low latency, high bandwidth for small messages and support common communication protocols. In addition, since the kernel is bypassed, a mechanism for multiplexing the network needs to be devised and protection between processes accessing the network should be ensured.

Contributions :
1. Incorporates the multiplexer/demux in the network interface and handles the protocol processing and buffer management to the user level thereby providing flexibility.
2. Architecture mainly composed of endpoints, which are the application’s handle into the network, communication segments to hold the message data and message queues for the descriptors for messages to be sent/received.
3. Tags used in messages to determine the destination endpoints. Protection boundaries enforced by the use of endpoints, communication segments and message queues as these are accessible only by the owning process.
4. Two levels of architecture - base level and direct access supported. Base level supports zero copy which requires one intermediate copy in the network buffer. Direct access supports true zero copy in which data is directly sent out of the application data structures without any buffering.
5. Also implement a user level library that exports the generic active messages interface and uses the U-Net interface. This includes functions for flow control and retransmissions necessary for active messages.
6. Protocol implementation can be changed to modify the buffering and timer mechanisms so that they could be optimized for the application and the network technology used (ATM).
7. Tuning of TCP with respect to factors like size of segments and window size to achieve higher bandwidth.

Evaluation:
A comprehensive evaluation of U-Net has been done for latency and bandwidth. It is shown that the round trip latency for small messages of size less than 40 bytes is 65 microseconds which they claim is comparable to other research results. Besides, they also evaluate the bandwidth for various message sizes and conclude that for message size about 800 bytes, the network is fully utilized. Low latencies are observed for TCP and UDP that have been implemented using the base level U-Net functionality.

Confusion :
What are the conclusions derived from the comparative evaluation of Split-C benchmark over U-Net active messages with those on machines CM-5 and Meiko CS-2 ?

Summary :
The authors talk about U-Net, which is a virtual network interface that could directly be accessed by applications at the user level thereby removing the complexity of the kernel. The paper also discusses how various protocols like TCP, UDP and IP have been implemented using U-Net and also communication mechanisms like Active Messages. They demonstrate the high performance by measuring the improvement in latency and bandwidth achieved for small messages using U-Net.

Problem :
The problem in question is about developing an architecture that would allow user level processes to directly access the network interface providing flexibility of protocol processing at user level. Many applications communicate using small messages and demand low round trip time. The goal is to achieve low latency, high bandwidth for small messages and support common communication protocols. In addition, since the kernel is bypassed, a mechanism for multiplexing the network needs to be devised and protection between processes accessing the network should be ensured.

Contributions :
1. Incorporates the multiplexer/demux in the network interface and handles the protocol processing and buffer management to the user level thereby providing flexibility.
2. Architecture mainly composed of endpoints, which are the application’s handle into the network, communication segments to hold the message data and message queues for the descriptors for messages to be sent/received.
3. Tags used in messages to determine the destination endpoints. Protection boundaries enforced by the use of endpoints, communication segments and message queues as these are accessible only by the owning process.
4. Two levels of architecture - base level and direct access supported. Base level supports zero copy which requires one intermediate copy in the network buffer. Direct access supports true zero copy in which data is directly sent out of the application data structures without any buffering.
5. Also implement a user level library that exports the generic active messages interface and uses the U-Net interface. This includes functions for flow control and retransmissions necessary for active messages.
6. Protocol implementation can be changed to modify the buffering and timer mechanisms so that they could be optimized for the application and the network technology used (ATM).
7. Tuning of TCP with respect to factors like size of segments and window size to achieve higher bandwidth.

Evaluation:
A comprehensive evaluation of U-Net has been done for latency and bandwidth. It is shown that the round trip latency for small messages of size less than 40 bytes is 65 microseconds which they claim is comparable to other research results. Besides, they also evaluate the bandwidth for various message sizes and conclude that for message size about 800 bytes, the network is fully utilized. Low latencies are observed for TCP and UDP that have been implemented using the base level U-Net functionality.

Confusion :
What are the conclusions derived from the comparative evaluation of Split-C benchmark over U-Net active messages with those on machines CM-5 and Meiko CS-2 ?

Summary
The paper presents U-Net, a communication architecture which provides a virtual view of a network interface to enable user level access to high speed communication networks. The main idea is to lower the latencies caused by the kernel, processing of packets, in the case of high speed networks. This approach avoids using the kernel and offers a more flexible development environment for parallel programs.

Problem
In this paper, authors identify that there is a significant overhead associated with processing each message in the networking in-particular high-speed LAN. Network undergoes low-utilization more se when a large number of small messages are sent over network for an application (say RPCs). The scenarios worsen as the kernel is made to do network processing. This decreases the flexibility of the system. By removing the networking stack from the kernel and making it available to user-level processes, creating specialized communications protocols is easy. All of these requirements of course should not compromise efficiency or security.

Contribution
Authors demonstrate the necessity of moving the send and receive paths for network messages out of the kernel. ¬This is achieved using idea of endpoints. Endpoints are like handles that allows direct communication with a process. This allows user-level processes to behave as though they own the network interface. For applications unaware of endpoints, they are simulated along with kernel modifications. For example TCP/UDP implementations can be tweaked to achieve better bandwidth utilization that what can be provided by kernel. Kernel only is responsible for multiplexing processes using network resources. Hence buffers and network abstractions can be managed at the user level, allowing each application to choose the facilities that are optimal for their own. Although direct-access U-Net was not implemented at the time of the paper, the authors claim that there should not be any sort of buffering and that data would be sent directly to and from application data structure. Another important contribution is the implementation of U-Net active messages on top of U-Net, a protocol used for communication in multiprocessor systems that ensures reliable delivery when catastrophic failures have not occurred.

Evaluation
The paper reports thoroughly on performance of U-Net architecture. Results were obtained by running U-Net on SPARC systems with SunOS. Two System ATM interfaces have been used to record the round-trip time for a single message of various sizes. For both SBA-100 and SBA-200, it was seen that kernel caused a significant amount of overhead. Authors have also evaluated the TCP and UDP implementation on U-Net and for both U-Net achieved higher bandwidth as compared to kernel. Split-C application benchmarks were shown to compare the U-Net’s performance with optimized network stack of supercomputers.

Confusion
U-Net certainly takes away the network stack off of kernel which makes it fast. The U-Net architecture also compensates by adding its own protection mechanism. My question is – how can we compare these two security mechanisms? Is one always better over the other? Are we compromising on security when removing the control from kernel?

Summary:
The paper presents U-Net, a network interface architecture which eliminates kernel from the critical path of send and receive communication lowering processing overhead to aachieve low latency. The design is highly flexible and is shown to implement TCP, UDP protocols and Active Messages. Evaluation on SPARC architecture support authors claim (highly performant, low latency and flexible).

Problem:
LAN speed and availability has improved over the time. Now, major contributor to latency and lower bandwidth is the processing time spent at send and receiving ends which is due to large time spent in kernel processing. The authors show that latency is significantly high for small messages and improving transmission time is not a solution to the problem. The paper proposes to remove kernel from the critical path, move all the processing to the user-level to improve latency and performance.

Contributions:
1. Moving network stack implementation to user level to remove kernel from critical path to facilitate end-to-end communication. This reduces the processing overhead and lowers latency.
2. Flexibility in design: The architecture supports both traditional protocols such as TCP, UDP and others such as Active Messages.
3. Endpoints, communication channels and tagging enforce protection boundaries among multiple processes accessing the network.

Evaluation:
The paper presents an extensive evaluation of the proposed architecture on systems running SBA-100 and SBA-200. RTT is found to be 66usec and 160 usec resectively. Higher RTT in SBA-200 is attributed to its complex message data structures. UAM, authors' own implementation of ActiveMessages has a roundtrip time of 70us which is pretty close to min. value. Similarly, UNet based UDP and TCP protocols have lower latencies and higher bandwidths for variable message sizes.

Confusion:
Is U-Net architecture prevelant in any domain? If not, what are factors behind it?
Is moving the logic out of kernel always a good idea? I feel, maintenance is an issue when things are implemented at user-level.

Summary
U-Net provides a new architecture which allows for development of communication protocols in the user space. It aims to reduce the per message processing overhead and exploit the full network bandwidth by providing a flexible interface to the lowest layer of the network and moving the protocol processing into the user space.

Problem
The availability of high speed local area networks has caused a bottleneck in the per message processing at their source and destination. The existing model does not support implementation of new network protocols or interfaces.

Contribution
U-Net architecture attempts to provide low latency and high bandwidth using small messages. Its user level interface virtualizes the network interface in such a way that it can create an illusion that every process has its own interface to the network. The U-net architecture consists of three components — endpoints (handle to the network), communication segments (region to hold the data) and message queues (hold message descriptors). U-Net uses a tag in the incoming messages to identify the target endpoint thus enabling multiplexing and demultiplexing of messages . It also implements a mechanism called zero copy where the data is delivered without any intermediate buffering, it is sent to the user level data structures when it arrives at the NIC. Due to limitations of the IO bus it implements a base level version which requires an intermediate copy in the network buffers(zero copy) and a direct access version which does do any intermediate buffering (true zero copy)

Evaluation
U-Net was implemented on SBA-100 and SBA-200 (Fore Systems ATM interfaces) and its performance was evaluated. The authors also provide numbers which show that U-Net TCP and UDP offer low latency and high bandwidth as it had claimed in the paper.

Confusions
I am curious to know if this model of network implementation gained popularity? Is it being used in any current systems?

Summary:
The U-net paper discusses the implementation of a network interface partially independent of the kernel, which highly reduces latency. A key idea from the paper is the fact that high network bandwidth can be achieved with commodity hardware without much modification to the Operating System. The paper describes how the network interface is implemented at the user level and how applications can leverage this abstraction to make their own communication protocol decisions.

Problem:
The time taken for end-to end message transfer using standard network interfaces is slow due to the many layers of abstraction, data copying and involvement of the kernel. This increases per message latency, which is critical when communication involves lot of small messages. Newer applications demand lower per message latency, and the authors want to evaluate if the bottleneck in communication could be solved by moving essential components of the network stack to the user level, but also maintain the same protection levels with low overhead.

Contributions:
The key idea described in the paper is to redesign the networking stack by taking it out of the kernel, giving capabilities to user level applications to make protocol decisions. This concept is very similar to the exokernel approach. It leads to virtualizing the network layer, providing end-to-end communication capabilities to each application. Protection is provided by means of OS support by tagging messages to its desired endpoint. A lot of data copying by means of the kernel is eliminated since the kernel is responsible primarily only for set up of the communication channel and tear down once the message has been sent.

Evaluation:
There are extensive latency and bandwidth evaluation results presented in the paper. Specifically interesting to me was the Split-C benchmark comparison on CM-5, U-Net and Meiko CS-2. This graph shows that U-Net was able to perform better on a range of algorithms as compared to the other highly optimized systems. However, I fail to understand as to why results for some applications (radix sort small msg) were worse (higher latency) than other reported results.

Confusions:
It is unclear to me how the OS recognizes the correct tag for the message?
Also, is it always beneficial to move code from the kernel to the user level? Over a period of time, how is user-level code maintained and made compatible with evolving hardware?

1. Summary
The paper introduces U-Net, a low latency, high bandwidth communication architecture for distributed systems. It virtualizes the network interface to the user applications, which can send and receive data without kernel intervention.
2. Problem
Network communication in traditional OSes depends on the abstractions provided by the kernel device drivers. This includes costly system calls and multiple buffer copies, which results in the processing overhead dominating the network latency for small messages. Thus the raw network bandwidth cannot be completely utilized. Moreover, applications are limited by the protocols supported by the kernel and cannot take advantage of new protocols, interfaces and optimizations.
3. Contributions
The primary contribution of the paper is to build a communication architecture that removes kernel from the critical path and entrusts the applications to optimize and process the protocol interface. The user applications are provided direct access to the network interface, with the OS performing some network services, authentication and authorization.
The U-Net multiplexes the NI among all the processes, enforcing protection boundaries and limiting resource consumption. In order to communicate, a process creates an endpoint and associates a send/receive queue and communication segment with it. The queues are used for messaging and flow control. Each message contains a tag that identifies the source and destination endpoints. U-Net implements a base-level or one-copy buffer management mechanism. This implies that the message is copied only once to the buffer in the communication segment.
4. Evaluation
Performance analysis is done on a SBA-200 NI card and 5 SPARCStation-20 and 3 SPARCStation-10 systems connected through a Fore ATM switch on a 140Mbps fiberlink. The roundtrip latency for longer messages was lower (120us) than the Fore firmware latency (160us). The saturating bandwidth achieved is around 120 Mbps. U-net implementation of TCP and UDP performs better than the Fore implementation, with around 50% more bandwidth and at 1/6th of the latency. This really shows the strength of the U-Net architecture.
5. Confusion
I did not clearly understand the concept of tags vs endpoints. For Split-C benchmarks, in fig.5, for small message sizes, U-Net seems to have high network latency, which is not what is expected of its design right.

1. Summary
This paper introduces the mechanism of U-Net which is designed to provide efficient low latency communication and to offer a high degree of flexibility. U-Net decreases the round-trip latency especially for small messages, which dominant more and more on the network interface. The paper also shows that removing the kernel from the communication path can offer new flexibility besides high performance.

2. Problems
The path crossing multiple levels of abstraction has overheads that limit the peak communication bandwidth and cause high end-to-end message latencies. Techniques are explored to improve both the performance and the flexibility. To remove the kernel completely from the critical path and to allow the communication layers used by each process to be tailored to its demands. There are issues including multiplexing the network among processes; providing protection such that processes using the network cannot interfere with each other; managing limited communication resources without the aid of a kernel path and designing an efficient yet versatile programming interface to the network.

3. Contribution
The major contribution is that this paper proposes a simple user-level communication architecture which is independent of the network interface hardware. U-Net allows all messages pending in the receive queue to be consumed in a single upcall; general tag used that allows communication between arbitrary processes; enforcing protection boundaries enforcedly by endpoints and communication channels; supports true “zero copy”; direct-access U-Net allows communication segments to span the entire address space of a process and allows sender to decide destination by an offset.

Two U-Net implemented on different OSes; U-Net Active Messages layer that has a set of primitives to initialize GAM interface, send request and reply messages.

4. Evaluation
Very detailed evaluation is provided in the paper. Comparison shows that the round-trip tie of U-Net is significantly lower than others on small messages; the bandwidth achieved is higher than UAM (quite close); On Split-C, actually U-Net got a much larger round-trip latency compared CM-5 and Meiko. Evaluations on TCP and UDP verifies that the architecture is able to support the implementation of traditional protocols, most modules are supported except ARP and ICMP; U-Net UDP has no expensive losses compared with kernel UDP, similarly, achieves higher bandwidth and lower latency with smaller window on TCP.
5. Confusions
What’s the difference between a user-level TCP/IP protocol and a regular TCP/IP protocol.
Any other drawbacks of U-Net? Is it broadly used today?

Summary: This paper introduces the design of U-Net, a network interface running in user mode. U-Net is to replace the network interface provided by the OS; it has the advantage of flexible and efficient.

Problem: Kernelized network interface has two drawbacks. First, it has very high overhead because of unnecessary memory copies and OS traps, making it very inefficient especially when sending small packets. Second, it is not flexible. Certain applications will benefit a lot from its own network protocols. If network protocols are all implemented inside the OS, it is very inconvenient to install new protocols or to support new hardware.

Contributions:
1. A user-level network communication architecture that resides completely out of the kernel. The U-Net provides virtualized network interfaces to every process that needs to do network communications. A process will communicate with an endpoint which is provided by U-Net. There are two types of endpoints: a U-Net endpoint, or a kernel emulated endpoint. Communications with U-Net endpoints will not go through OS kernel, and is fast; kernel emulated endpoints instead will invoke syscalls to do net IO.

2. Multiplexing and demultiplexing messages. As a user-level process, U-Net must manage all incoming and outgoing network packets through the network card, and doing a process-to-process delivery. For this purpose, U-Net will associate a tag with each message. Tags will help identify the process the message is to be delivered.

3. Zero-copy techniques. U-Net supports zero-copy and true zero-copy. Zero copy will copy the data once from user process memory to the network device memory. True zero copy will make the network device directly access user memory, thus no need to copy.

Evaluation: The authors implemented U-Net on SunOS and measured the performance of U-Net against kernelized network interfaces. The results show U-Net achieved the performance limit by the device.

Confusions: What is the fundamental thing that the kernel cannot do but U-Net can which makes U-Net more efficient? It's unclear to me whether the performance improvement is really because of its user-mode architecture or it's because SunOS's inefficient implementation on networking interfaces.

Summary:
This paper describes U-Net, a novel approach in moving the network architecture to the user space to provide a virtual interface of the network to an application. U-Net uses new abstractions such as endpoints and communication segments to send and receive messages in an isolated manner and takes over the role of multiplexing and demultiplexing the messages from the kernel to provide this virtual view of the network in user space.

Problem:
Increasingly the bulk of messages communicated by applications in both distributed and general systems were of a small size. Low communication latency becomes important for such messages. The higher bandwidth being provided by improved networks also needed to be utilized by achieving a higher small message bandwidth. The limiting factor in achieving this throughput is the processing overheads caused by the indirect path through the kernel and its expensive buffer management. It is also difficult to change kernel level code for experimenting with new protocols or tuning the the network architecture for application specific processing. The authors proposal of U-Net tackles these problems by moving the communication layers outside the kernel.

Contributions:
Endpoints provide a clean and simple abstraction to virtualize the network interface for different processes. The expenses of this abstraction can be controlled by having applications, that benefit the least from this scheme, use kernel emulated endpoints. Using channel identifiers to tag outgoing messages allows U-Net to successfully mux/demux the messages to their intended processes. Protection and isolation are enabled through this and making the endpoints accessible only to their owners. Extending communication segments across the entire address space makes it possible for U-Net to operate in direct-access mode which allows true zero copying. The message queues allow a further optimization to hold small messages directly so they can be sent directly by avoiding the communication segments.

Evaluation:
The authors compare the latency and network bandwidth of U-Net against the conventional communication architecture in kernel mode. The experiments are conducted on two implementations of FOre Systems(SBA-100 SBA-200) on SPARCstations running SunOS. U-Net performs better than its kernel counterpart with reduced latencies and higher throughput approaching that of raw throughput rate on the network. The authors also measured the performance of TCP and UDP protocols on U-Net.

Confusions:
If the performance results are anything to go by, how is this not being used in mainstream OSs?

Summary: This paper introduce U-net, a new network interface that makes the user-level processes has control of the network devices. This interface reduces the overhead of communication, increases the flexibility of integrating new subsystems. Experiment on 8-node ATM cluster shows very low latency and high bandwidth.


Problem: As the network hardware evolves, it is imperative to evolve the system to make the communication on especially small messages fast (reduce overhead). Also, flexibility of using new communication protocols is important. To achieve this, we need to move the key network architectures (e.g. buffer management and protocol processing) from kernel level to user-level, to remove the communication subsystem's boundary with the application specific protocols.

Contributions:
1. Network communications bypasses the kernel by remove the kernel from the critical path of sending/receiving messages (except the set-up and tear-down phase).

2. The data is directly send from the buffer space owned by the user process, it avoids the overhead of copying data from/to kernel (true zero-copy)

3. It separate the network for each user process by tags and communication channels. Different processes cannot interfere with each other. It bypass the kernel to use network interface to provide protection.

Evaluation:
They build a prototype U-Net Active Messages (UAM) on a multiprocesser system.
They show that the overhead of UAM is very small (most of them are reliability checks). The comparison with standard kernel implementation on latency and bandwidth is provided.
The result shows that for small message, UAM do reduces the latency and increase the bandwidth.

Confusion: What are the modern systems that inspired by U-net? Are mobile phone apps allowed to send/receive messages directly (not through the kernel)?

Summary
This paper presents U-Net, a user level network interface that focuses on improving latency and bandwidth of communication by removing much of processing overhead from kernel to user space and giving user application direct access to a virtual network interface, enabling custom network protocol implementations.

Problem
In traditional networking architectures, software’s path traversed by messages through kernel comprises of many copies and had to cross levels of abstraction between the device driver and the user application and this resulted in processing overheads limiting the peak communication bandwidth and causing high end-to-end latencies. It was also observed that processing overhead is more dominant than n/w latency for small messages. U-Net presents an approach to reduce processing overhead by moving protocol processing from kernel to user space and allow the communication layers used by each process to be tailored to its demands.

Contribution
Major idea was to remove kernel from the critical path of sending and receiving messages. The U-Net architecture virtualizes the n/w interface and provide each process an illusion of owning the interface to the n/w. U-Net components end-points (application’s handle into the n/w) and communication channels together allow U-Net to enforce protection boundaries (with the help of tags) among multiple processes accessing the n/w. In addition, leaving buffer management to the application eliminates unnecessary copies and helps in achieving zero copy.

Evaluation
U-Net architecture was implemented in two systems SBA 100 and SBA200. The evaluation results show improvement in latency and bandwidth over the traditional architecture for multiple message sizes. The results show that TCP and UDP protocols implemented using U-Net achieve latencies and throughput close to the raw maximum and Active Messages round-trip times are only few microseconds over the absolute minimum.

Confusions
Is this model implemented somewhere? Also, won’t the virtualization of Network Interface without any fairness queuing affect the system in distributed environment.

1. Summary
This paper introduces the U-Net communication architecture which aims to provide applications with a more flexible and faster performing interface to the network.

2. Problem
- Processing overhead (the time spent by the processor at sending/receving ends in handling messages) dominates the the latency for small messages in local area communications and thus the full throughput capability of the network is not realized. (Note that for large messages, transmission time is still the dominating factor). It is also difficult to achieve high bandwidth when sending small messages. With small messages becomming more common due to techniques such as RPC, this was an important area of concern.

- Applications do not have direct access to the network interface and instead have to go through the kernel. This means that application-specific protocols are costly as it is difficult to add new protocols to the kernel and protocols implemented in user-space are inefficient. In particular reliable protocols incur overhead as there is no shared buffer management reference count.

3. Contributions
U-Net proposes removing the kernel from the critical path of the message transmission to avoid the system call overhead and enable user-level buffer management. The following ideas are introduced in the architecture.

Endpoints
---------
An endpoint is an application's handle into the network interface. An endpoint (which may be emulated) consists of communication segment (which holds the messages) and descriptors for the messages. To send a message, an application fills the communication segment and adds in the descriptor. The NI picks up the messages and inserts it in the network. When a message is received, it is demultiplexed, and the data/descriptors are transferred to the appropriate endpoint.

Message Tag
----------
Each message is tagged with an identifier of its endpoint. This helps place incomming messages into the correct channel and thus demultiplex messages to their intended endpoints. An OS service aids the application in registering a correct tag with U-Net. Note that this helps in enforcing protection boundaries among multiple processes as applications are not able to interfere with the communication channels of another process.

U-Net supports 2 modes of data transfer.
Base level
----------
In base-level (zero-copy), a intermediate buffer is maintainted between the application's data structures and the communication segment. This means the communication region may not be treated as a continuous memory region and the sender has to first construct the message in a buffer. However, the sender has control over the size, number of buffers. As an optimization, small messages can also be held in descriptors to reduce buffer management overhead.

Direct-Access
-----------
In direct-access (true zero-copy), the sender is able to specify an offset with the destination communication segment on which the data is directly deposited. However, this means communication segments may need to span the entire address space of the process. However, there are 2 problems with this. Firstly, the NI has to contain a memory mapping hardware
which is consistent with the main processors'. Also, the NI has to be able to access all of physical memory as there may be requests to unmapped virtual memory pages.

4. Evaluation
- The performance of U-Net on the SBA-200 is almost 3 times worse than on the SBA-100 due to added complexity of the kernel-firmware interface on the former.

- On the C-5 benchmarks, it is observed that the performance of U-Net and Meiko are equivalent for applications using bulk messages but lags both lag behind C-5 for small message transfers.

- The U-Net versions of TCP and UDP are shown to achieve latency close to raw maximum.

5. Confusion
How are the non-emulated endpoints implemented? How are the ideas of this paper used today?

Summary
This paper proposes a user-level network interface aimed to provide a flexible, low-latency networking performance by allowing direct user-level access to the network hardware.

Problem
Improvements in network latency resulted in processing overheads in the kernel becoming the limiting factor for communication latency. These overheads such as buffer management, multiple copies, interrupt overheads are caused by the multiple layers in the software network stack. Real-world network communication is dominated by small messages, where lack of opportunity for amortizing the overheads results in low bandwidth. Also, the networking abstractions provided by operating systems only allow standard communication protocols instead of novel protocols optimized for particular applications.

Contributions
The main contribution was to redesign the software networking stack in a standard operating system and on top of commodity hardware. The authors aim at presenting a virtualized, singly-owned network interface to applications, and moving the multiplexing/demultiplexing required into the interface itself. Their approach involved the use of endpoints to allow applications to directly interface with the network, and thus eliminated the kernel from the critical path. It guarantees protection through OS involvement at start-up and tear-down of the communication path, and by limiting access to endpoints and message queues to the owning process. The number of copies of message data being maintained are reduced to one in their base-level implementation and zero in their direct-access model.

Evaluation
The performance of this proposed architecture is demonstrated using SPARCstations running SunOS with Fore Systems ATM interfaces in both the kernel emulated and direct access modes. The U-NET bandwidth and round-trip times are reported for various message sizes, and reaches peak value with 512 byte messages. Split-C application benchmarks are used to show the comparable network performance of U-Net with optimized network stacks in supercomputers such as IBM SP-2 and Thinking Machines CM-5. The authors also reported the performance of TCP and UDP protocols as implemented on their U-NET model.

Confusions
Has this been implemented commercially? If not, what is wrong with their approach, which seems to follow some of the philosophies of the exokernel model?

1. Summary
This paper describes the U-Net interface. It is a virtualization interface for the network devices that provide every process with the illusion that it owns the network devices.

2. Problem
As the physical network links progressively become faster, the total transmission time is dominated by the processing time the message takes in the network stack on both the sender and the receiver. To enable faster communication it is essential to reduce this processing overhead.

3. Contributions
This paper introduces the U-Net framework. It abstracts the network devices such that every process believes that it is the owner of the network devices on the system. It also bypasses the kernel while sending and receiving messages. This reduces the processing overhead and provides flexibility to the application. In U-Net, each process has a communication end-point which is just a region in its address space. The endpoint contains a communication segment that is used to buffer the incoming and outgoing messages. It also has send, free and receive queues which contain pointers to data in the communication segment (except for the free queue). Since U-Net avoids trapping into the kernel for network I/O, U-Net is responsible for multiplexing and demultiplexing the messages between the processes in the system. This entails implementing isolation between the various processes participating in the communication. U-Net achieves this with the help of tags that it assigns to each incoming message to identify its destination endpoint. U-Net is also compatible with existing protocols like TCP and UDP.

4. Evaluation
The authors perform a series of experiments to measure latency and throughput of their U-Net system. They also provide comparisons with standard implementations that go through the operating system kernel. They observed considerable gains in both the metrics while using U-Net in their experiments.

5. Confusion
The paper does not go into the details of how the tag looks like and how it is mapped to the messages. The paper mentions that a collection of operating systems computes the routes between the nodes. It does not say anything about how this happens or why is that the OS has the job of computing the routes.

Summary

High speed networks started hitting bottlenecks as workloads leaned increasingly towards small messages, for which processing overhead was more significant than transmission overhead. This paper presents U-Net, an architecture that speeds up processing of small network messages by eliminating the kernel from the communication path while increasing flexibility and providing protection. The U-Net prototype evaluated here succeeds in providing low-latency communications, strong flexibility and performance that is comparable to absolute maximum capacity and minimal latency.

Problem

Network calls in a traditional UNIX environment are handled entirely in the kernel, which requires trapping, copying data from user-space buffers to kernel buffers and then performing additional operations. This is acceptable on non-local networks, or with large data packets, because transmission time outweighs processing time. However networks are increasingly sending large numbers of smaller packets on local networks. Since UNIX architectures typically implement the networking protocol stacks as part of the kernel, they become slow, inflexible and unable to meet the demands of specific applications.

Contributions

The main contribution of this paper is the U-Net architecture, which provides a simple user-level communication interface independent of the actual hardware. U-Net moves most networking functionality to user-space and multiplexes the hardware network interface for applications. Moreover, it does all of this using existing hardware, without requiring OS modifications and interfaces efficiently with existing protocols.

Evaluation

The report provides extensive evaluations of the U-Net architecture. The authors measure two separate implementations on Fore Active Machines interfaces, in the second case reprogramming the firmware to implement U-Net directly. The ATM tests perform very well; with large packets it runs at full bandwidth, although with smaller packets it suffers a bit from overhead of implementing reliability and adhering to the communication segment size. They also provide evaluations of commonly-used protocols such as TCP, UDP and IP over U-Net, which again provide low-latency communications that maximize available bandwidth.

Confusions

My biggest confusion is about the role the kernel plays in U-Net. The authors claim they are trying to remove the kernel from the communication path -- but it cannot be removed completely, at some low level they still need it to talk to the hardware. I'm confused about where exactly this separation takes place.

1. Summary
This paper summarizes the U-Net communication architecture, which separates network protocols from the kernel for increased bandwidth and latency as seen by an application. The mechanisms of data transfer in U-net are detailed, and the flexibility of the system is proved by several examples of protocols.

2. Problem
In the past, UNIX systems implemented networking protocols directly into the kernel. This meant that messages passing from a driver to a final application involved several unnecessary copies. Previously, this overhead had not been an issue as network speeds were the bottleneck. However, new high speed local networks meant this extra processing became a problem. As discussed in previous papers, most messages in this scenario were small in size and required low latency and high bandwidth. Current kernel networking implementations did not provide this. Additionally, the baked-in nature of networking protocols meant that new protocols had a barrier to entry; these protocols could be implemented in user-space, but would incur a large overhead due to the necessary kernel communication.


3. Contributions
The major contribution of the paper is the bypassing of the kernel for network operations. With proper hardware, a process and device can communicate directly, eliminating the need for unnecessary copying. The kernel only needs to handle channel set-up and tear-down to ensure proper protection. After the channel is set up, no kernel operations are necessary.
This is accomplished through the use of U-Net endpoints. An endpoint contains a memory region called a communication segment, and a number of message queues which hold descriptors for messages. Messages are sent by simply writing to the communication segment, and pushing a descriptor onto the send queue. The NI will then handle sending the packet from that location. Messages received by the device are demultiplexed (using some specific message tag), and written to the correct communication segment. The descriptor is then pushed onto the receive queue.

4. Evaluation
The paper provides a great deal of evaluation to show the success of their implementation. They first build a prototype called U-Net Active Messages which provides asynchronous communication on a multiprocesser system. They show that the overhead of the UAM is very small, and mostly due to reliability checks. The authors also present “ports” of UDP and TCP, which perform favorably with respect to latency and bandwidth to their kernel implementations. In all cases, they show that they approach raw latencies and bandwidth rates, yet also allow for new protocols like UAM to be implemented simply and efficiently.


5. Confusion
I’m confused on how TCP and UDP are actually implemented. If I write an application that needs to use TCP, how do I do this? Am I including some TCP library in my application, which will then create an end-point in my application?
Or am I using some IPC to send messages to a single TCP “handler”, which owns a single endpoint and handles all calls to the network?

1. Summary
In the paper "U-Net: A User Defined Network Interface for Parallel and Distributed Computing", the authors present a communication architecture which provides low latency communication in local communication, facilitates use of novel application specific protocols by removing the kernel from the communication path
and offers high degree of flexibility by moving the entire protocol stack to the user space.

2. Problem
With the increasing availability of high-speed local area networks, the bottleneck in local area communication has been shifted from the limited bandwidth of the network to the processing capability of the end hosts. In particular, for small messages the processing overheads dominate and improving transmission time is of no significance. Additionally, when the kernel does network processing, it decreases the flexibility of the system as application specific knowledge cannot be accommodated.

3. Contributions
- optimizing the processing overhead for small messages (citing examples of systems using small messages like RPC)
- virtualize the network interface and provide each process the illusion of owning the interface to the network (by providing a handle into the network - endpoint, consisting of communication segments and message queues)
- remove the kernel from the critical path of sending and receiving messages except for the initial setting up
- avoid the delay in data copying between user space and kernel by using shared buffer allocated by the kernel, which is used by the user process
- Base-level U-net supports zero copy ie requires intermediate copy into a networking buffer; Direct access U-net supports true zero copy ie no copy between kernel to the device's buffer
- provides protection among multiple processes using tags and communication channels so that an application cannot interfere with the communication channel of other applications in the same host

4. Evaluation
The authors have implemented U-Net on SPARC workstations running Sun OS, using SBA-100 and SBA-200 interfaces. They compute that the end-to-end round trip time of a single cell message is 66 microsecond. From the cost breakup of the send, it can be observed that the send and receive overhead is due to the inability of the interface to do DMA segmentation, reassembly of multi-cell packets, CRC calculation etc. requiring implementation in the kernel. Using SBA-200 interface, the measured round trip time was 160 microsecond which they attribute to the complexity of kernel firmware interface. On comparing U-Net UDP against kernel UDP, it can be seen that kernel UDP suffers from packet loss problems(due to kernel buffering issues). Similarly, for TCP, U-Net TCP achieves higher bandwidth and fast round trip even for small window size.

5. Confusions
It would be helpful if we have a discussion of kernel emulated endpoints with an example.
Will U-Net actually outperform RPC message communication when compared with other communication protocols designed for RPC?

Summary
This paper describes the motivation and implementation of U-Net, which is a communication architecture designed to reduce the overhead when sending small messages. The basic idea is to remove the network interface from the kernel into user space and provide each process with a virtual network interface to give the illusion that it is the only one communicating on the network. U-Net is a flexible architecture, meaning higher level protocols can be implemented on top of it that better exploit application level knowledge as well as older protocols such as TCP and UDP.
Problem
U-Net is an architecture that aims at solving the problem of inflexible protocols and high overhead for small messages. In traditional network interfaces, because the protocol processing was in the kernel, the system cannot easily support new protocols that take advantage of application knowledge. Additionally, when sending a message the data may be copied multiple times into intermediate buffers, which increases the processing overhead and increases the latency.
Contributions
U-Net provides a virtual network interface to each process. This virtualization is achieved through a handle into the network called an endpoints, which is made up of communication segments and message queues. The communication segments hold the message data to be sent and the queues point into the communication segments when sending a receiving data. The process simply pushes data into the send queue when it wants to send data and receives data via an upcall by U-Net which detects new data in the received queue through polling. U-Net provides multiplexing and de-multiplexing of the messages based on the destination of the message, which is just a “tag” on the message. When sending data, U-Net attempts to use “true zero copy” in which the data to be sent is sent directly from the application data structure. If the data cannot be addressed by the I/O device, U-Net must resort to “zero copy” which copies the data into an intermediate network buffer, then sends the data. Additionally, an Active Message implementation is described, which is a higher level protocol that provides flow control and reliability.
Evaluation
The paper compares the U-Net communication architecture efficiency to the kernel implementation and found that both TCP and UDP implementations were significantly quicker. U-Net TCP had a round-trip latency of about 400 microseconds whereas Fore TCP had a round-trip latency closer to 1400 microseconds. Additionally, U-Net TCP provides more bandwidth than Fore TCP, about 16 Mbytes/s as opposed to about 6 Mbytes/s. The paper also took many measurements of U-Net with different protocols, which UDP and TCP having the larges latency of the protocols measured.
Confusions
The Two U-Net implementations portion of the paper didn’t make much sense to me. What are the differences between the two implementations and what is the purpose of implementing the protocol on different hardware? Also, how does the operating system assist in “route discovery”? Is this separate from the U-Net protocol?

Summary :
This paper presents U-Net, which is a user-level network interface to improve the performance and latency of communication in a parallel or distributed networking setting.

Problem :
The speed of the local area networks has highly improved and this causes the software path that the messages traverse a bottleneck. This is because it involves making several copies of the message in the UNIX networking architecture and traversing many levels of abstraction. Most of this overhead is unnecessary especially when small messages are being sent across the network. The authors propose to remove this bottleneck by decoupling the protocols from the kernel space and adding them to the user space, and give user level direct access to the network interfaces.

Contributions :
1. Decoupling the network protocol implementation from the kernel to make it more flexible from the application’s point of view.
2. Moving the buffer management of the outgoing messages into the user space. This way, there is a lot more freedom for the application to buffer messages, and to decide if there are copies needed for messages (in case of need for retransmission).
2. Giving direct access for the application to the network interface, and virtualizing the networking layer. This way, the messages sent and received do not have to go through the kernel and this reduces a lot of message copying.
3. Using tags and channel identifiers which provide a pair wise protection mechanism between the application and the network interface. This is important because the kernel is not handling the messages anymore and the NI has to manage protection boundaries between applications.

Evaluation :
The authors have evaluated U-Net on two interfaces, SBA-100 and SBA-200. The round trip time for a single message is 66 us. In SBA-200 this was about 160us. The authors say this is because of the complexity of the latter in the message data structures. The authors have also implemented their version of Active Messages, UAM. For this, the measured RTT was ~71us. Block store bandwidth is also measured, which reaches about 80% of the limit with blocks of 2Kbytes.

What I found confusing :
Is U-Net used anywhere now? What kinds of applications are similar to this design?

Summary:
The paper argues that by leaving networking up to the kernel, the user loses important control over how network traffic is handled. In some cases, a user space application may know better how to deal with certain kinds of traffic allowing it to reduce per-message overheads. The authors note the issues encountered when moving the network stack into user space and discuss their implementation of U-Net.

Problem:
The paper argues that low latency network communication is becoming more important and cannot be offered by current kernel implementations of the network stack. The authors suggest that removing the kernel from the critical path will allow network applications to achieve much lower latency. Problems then arise as to how user level programs can share the network resource.

Contributions:
U-Net is an implementation of user-level networking, allowing applications as close to direct IO access as possible. Applications request U-Net endpoints, which serve as their interaction with the network device. Programs modify and read this endpoint in order to send and receive packets. The network device firmware is modified then to support reading from each of these U-Net endpoints.

Evaluation:
The authors provide some basic numbers on the overheads of their system, and give a comparison of the user level approach vs. the standard kernel implementation showing that U-Net does indeed reduce latency and increase bandwidth for smaller message sizes.

Confusion:
I am a bit surprised that their system outperforms kernel implementations of TCP to the extent that their graphs show. The messages sizes they use seem to be not entirely that small, around 1500 bytes is a rather standard MTU and even there they show speed ups around 2-3x for TCP and UDP. It seems totally bizarre to me that reducing per packet overheads can account for this sort of speedup. For a 3x speed that would be like more than 2/3rds of the time spent sending a packet was due to kernel overhead. That just seems really high to me, but it’s possible that it is in fact the case.

1. Summary
This paper presents U-Net, a communication architecture that moves most network protocol processing out of the kernel. This is mainly to reduce the latency of small messages and achieve network saturation even with small messages. User space implementation gives the flexibility to modify protocols through application-specific knowledge. U-Net is implemented using commercial HW and is compatible with existing protocols such as TCP/IP.


2. Problem
Existing OSs implement all protocol processing within the kernel. This increases per- message overhead due to multiple memory copies, system call, etc. This is particularly bad for small messages which are very frequent. U-Net tries to move all protocol processing to user-space and remove the kernel from critical path of sending and receiving messages.

3. Contributions
The kernel and HW combine to provide each process a virtual network interface. The amount of kernel involvement depends on HW sophistication. Each application has an endpoint which is its handle to the NW. Endpoint consists of a communication segment that stores messages and a set of queues for send/receive/free msg descriptors. Since endpoints cost memory, the kernel could allocate a single endpoint and emulate its functionality across multiple processes. The network interface will mux and demux messages to appropriate process using a message tag. Before communication begins, an OS routine will perform route discovery, sender authentication, tag registry and return a channel ID to user for future communication. Dedicated endpoints and channel IDs prove useful for providing isolation as well. With sophisticated HW where the network interface has address translation, U-Net can achieve true-zero copy communication where the application’s data can be directly sent without any intermediate buffering. Two implementations of U-Net are presented. In the second, the network card has a small processor & local memory. The endpoint is physically partitioned across host and network interface memory to minimize I/O. Mapping the endpoints to memory could become a scalability bottleneck as no. of systems increases. This is not explained satisfactorily.

4. Evaluation
The latency & network bandwidth for small messages are compared for kernel and U-Net implementations. U-Net is significantly better and achieves latency comparable to just transmission rates. This slightly increases when higher level protocols are implemented on top of U-Net. It is also compared against customized network HW+SW implementations using Split-C benchmarks. The results are little ambiguous, but seem to suggest that U-Net gets us in the small ballpark using available HW.

5. Confusions
How does server naming and identification occur? Or, is it application-specific?
How much of U-Net’s design is specific to workstation or distributed shared memory systems (on which this paper is based)?

Summary:
The paper describes Lightweight Remote Procedure Call, a communication facility designed and optimized for communication between protection domains on the same machine.

Problem:
The conventional RPC approach has high overhead for cross-domain communication and leads to performance loss. LRPC achieves high performance for cross-domain communication while retaining safety and transparency.

Contribution:
1. The control transfer is achieved by the client's thread executing the requested procedure in the server's domain.
2. The LRPC stub generator generates run-time stubs in assembly language directly because of the simplicity and stylized nature of LRPC stubs.
3. In multiprocessor environment, LRPC minimizes the shared data structure on the critical domain path and reduces context-switch overhead by caching domains on idle processors.
4. Shared argument stack accessible to both client and server, eliminates 3 redundant data copying.

Evaluation:
LRPC was implemented and integrated into Taos. The paper first gives experiment data about frequency of cross-machine activity, parameter size and complexity, and performance of current cross-domain RPC. The RPC latency is evaluated by the Null procedure call. LRPC shows about 3X better performance under different parameter size setting and increases throughput by avoiding locking hared data during the call on shared-memory multiprocessors.

Confusion:
The paper mentions in the "Binding" part that the kernel can detect a forged Binding Object. I wonder how is that implemented.

Post a comment