CS 736 Reviews - Spring 2017: U-Net: A User-Level Network Interface for Parallel and Distributed Computing

« Lightweight Remote Procedure Call | Main | Scheduler Activations: Effective Kernel Support for the User-Level management of Parallelism »

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Thorsten von Eicken, Anindya Basu, Vineet Buch, Werner Vogels U-Net: A User-Level Network Interface for Parallel and Distributed Computing in Proceedings of the 15th ACM Symposium on Operating Systems Principles, Copper Mountain Resort, Colorado, December 1995, 40-53.

Reviews due Tuesday, 2/28

Posted by Michael Swift on February 26, 2017 09:23 PM | Permalink

Comments

1. Summary
The paper presents a case for moving towards a user level networking architecture in modern systems. The system developed is called U-Net. The authors explore the various aspects of user level networking outlining its major challenges and benefits. The authors provide a detailed analysis of the developed system looking into performance while also discussing limitations of the current implementaion. All in all the paper does a good job at convincing the reader that removing the kernel from the critical path of communication is a good idea.
2. Problem
The paper is motivated by the availability of higher bandwidth and lower latency networks, availability of which leads to shifting the bottleneck of communication delays into the various overheads present due to network processing happening inside the kernel. The authors also argue that making the network protocols more application aware can lead to better coordination between compute and communicate. The authors attempt to solve three main problems:
a. Provide low-latency communication in a local area setting
b. exploit the full network bandwidth even with smaller size messages
c. facilitate the use of novel communication protocols
3. Contributions
This work has been hugely influential in the network hardware design for a virtualized environment (doing application level demultiplexing using message qeues inside the Network card). In my opinion some of the concepts in the paper form the basis of many modern day network features such as R-DMA (direct access Unet architecture). Also it seems to have formed the basis of the idea behind the development of Arakiss.
4. Evaluation
The authors do a thorough evaluation of their system. Since the evaluation is spread throughout the second half of the paper and mixed with the implementation details it is easy to identify the authors intentions and actual outcomes making the paper a pleasure to read. The authors developed multiple versions of U-Net for ATM machines since they were targeting mostly HPC scenarios of their time. They show using benchmarks that U-Net works well and achieves its design objectives.
5. Confusion
The paper seems to imply that a microkernel approach to networking is better. We have read a lot of papers that argue microkernels are better, so why aren't any of the popular OS's today microkernels ?

Posted by: Akhil Guliani | February 28, 2017 08:30 AM

1. Summary
This paper describes the implementation and performance of U-Net, a system that exposes network communication primitives at the user level rather than the kernel level. U-Net aims to improve latency and throughput by minimizing the processing overhead at the ends for each packet.

2. Problem
Low latency is especially important in local-area networks, such as in fault-tolerance algorithms and RPC in distributed systems. High throughput is especially important for small messages, which often fail to exploit the full network bandwidth. Even as network fabrics allow for higher throughput, packet processing time at both ends limits both latency and throughput. Furthermore, it is difficult to develop new protocols when the network interface is only available inside the kernel. The authors propose that the solution to all of these problems is to expose the network interface at the user level, while performing as few copies of the data as possible.

3. Contributions
The authors implement U-Net on SPARCStation machines with Fore Systems ATM interfaces running a SunOS 4.1.3 operating system. Network messages are sent and received in user space, but must still be multiplexed to different applications. Therefore, each application has a single endpoint, containing a send queue, receive queue, and free queue. The queues themselves contain descriptors into a communication segment, a region of memory that contains the message data. Hence, network communication can be true zero-copy IO, with no buffering between the application data structures and the network interface. For applications that do not want to implement their own network protocols, the kernel can emulate U-Net endpoints.

The authors also implement U-Net Active Messages, a network protocol that overlaps communication with computation in a multiprocessor.

4. Evaluation
The paper includes several macrobenchmarks and microbenchmarks. U-Net provides higher throughput for UDP than the Fore implementation, which only reaches maximum throughput for messages with 8000 bytes or more. U-Net also achieves nearly double the throughput for TCP than the Fore implementation. U-Net TCP and UDP also have much lower latency than Fore TCP and UDP, respectively. Microbenchmarks and the Split-C macrobenchmark show that U-Net Active Messages perform even better than TCP and UDP.

5. Confusion
(1) The terminology in the paper was very confusing. What are ATM interfaces and cells?
(2) It is not very clear why the system is useful for parallel and distributed computing.

Posted by: Varun Naik | February 28, 2017 08:02 AM

1. Summary
This paper introduces U-Net, a communication architecture that allows implementing the Network protocols at the user level without the intervention of kernel. Such an architecture is highly flexible and can result in improved performance by avoiding the otherwise copies of the message across multiple abstractions between device driver and user applications.

2. Problem
Even with improvement in network bandwidth through advances in the networking hardware, the network latency was limited by the high processing overhead incurred at buffer management, multiple copying, handling interrupts etc at the kernel level at both the sending and recieving devices. This overhead is dominant in case of transmitting messages of smaller size which is increasingly becoming a major part of network communication.
Using the application specific knowledge in designing of protocols was not easy as the the entire network stack was implemented in the Operating system kernel.

3. Contributions
The main contributions are
a. U-Net which is a simple user level communication architecture indepeendent of network interface hardware
b. Handling of mux and demux in the network interface and moving all the buffere management and protocol processing to user level
c. Virtualizing the Network interface i.e each process is given the illusion of owning the network interface through emeulated endpoints provided by kernel.
d. Two levels of support namely, 1. base-level U-Net - involves an intermediated copy into network buffer owing to limitations on I/O bus addressing
2. Direct-access U-Net - True zero copy with no intermediate copying to buffers
e. Efficient implementation of traditional TCP and UDP protocols

4. Evaluations
The U-Net architecture is evaluated with 2 implementations on ATM NICs – SBA-100 and SBA-200. Comparision of the round trip times was done against traditional network implementation and found to be better. The authors identify evaluation of memory requirements as important one but leave it for future work. The authors used Split-C application benchmarks to compare the with other architectures and different message sizes. They also demonstrate higher performance of TCP and UDP protocols implemented using base level U-net implementation.

5. Confusion
Can you throw some more light on Active message protocol.

Posted by: Lokananda Dhage Munisamappa | February 28, 2017 07:58 AM

Summary
U-Net provides low latency network communication and a high degree of flexibility by moving almost all components of network stack out of the kernel into user space while providing full protection.

Problem
The increased availability of high-speed local area networks has shifted the bottleneck in local-area communication from the limited bandwidth of network fabrics to the software path traversed by messages at the sending and receiving ends. This is particularly a problem for small message size communication. Many new application domain could benefit not only from higher network performance but also from a more flexible interface to the network. Processing of protocols in kernel makes it difficult to experiment with new protocols.

Contributions
U-Net architecture is designed keeping three main goals in mind
1. Low latency communication
2. Exploit full bandwidth even with small messages
3. facilitate the use of novel communication protocols
U-Net architecture provides each process the illusion of owning the network interface. It consists of building blocks like Endpoints, Communication Segments and message queues. Endpoints act as application’s handle into the network. Communication segments are regions of memory that hold message data and message queues hold descriptors for messages that are to be sent or that have been received. Regular endpoints can be emulated and are serviced by the kernel with some performance loss. U-Net saves one extra message copy with the help of communication segment which acts as network interface card’s buffer. Network interface card finds the destination process for the received message according to the header in message. It grabs a descriptor in destination process’s free queue and puts message inside process’s communication segment and creates a descriptor in process’s receive queue. Process will be notified of it either by polling or interruption.

Evaluations
U-Net is implemented on SPARCstations running SunOS 4.1.3 and using two generations for Fore Systems ATM interfaces. Using U-Net the round trip latency for messages smaller than 40 bytes is about 65 µsec. They use Split-C benchmarks to show U-Net's efficiency by comparing U-Net ATM cluster again CM-5 and Meiko CS-2. They also build TCP/IP and UDP/IP protocols on top of U-Net, U-Net achieves latencies and throughput close to raw maximum and Active Messages round trip times are only a few microseconds over the absolute minimum.

Confusions
As most of the network components are moved to user space, what are the parameters to consider in terms of providing protection/isolation in general, also in paper it’s not very clear how U-Net takes care of it.

Posted by: Om Jadhav | February 28, 2017 07:21 AM

1. Summary
This paper introduces U-Net, a user level network interface that can reduce the overhead of OS kernel in the data transmitting path, thus improves throughput, and reduces latency. This network interface can also provide flexibility to new network protocols and change existing protocols to gain better performance.
2. Problems
The network bandwidth has been improved due to new network fabric technology, but the processing overhead of OS kernel does not improve, so we fail to see an application speed up on this faster network. This problem is move obvious on small message latency because the OS overhead may dominate the transmission time. Since small messages are widely used in systems like RPC, NFS etc. it becomes important to reduce the overhead in network data transmission.
3. Contributions
The key idea of U-Net is removing the kernel from the critical path of sending and receiving messages while at the same time providing full protection. During the network establishing time, OS kernel creates a U-Net endpoint, registers a tag for future message multiplexing and setup a communication channel between network interface (NI) and process. Process send and receive messages using communication segments and queues, and communicate to NI directly without kernel. Network interface with a multiplexing agent can handle send and receive for multiple processes. This mechanism is very similar to Arrakis. U-Net also provides optimizations like zero copy to further improve the performance. U-Net provides flexibility to implement new network protocol in user level, this paper shows a U-Net active message protocol. We can also reimplement existing protocols using U-Net idea, this paper tries TCP and UDP. All of these protocols gain good performance with the help of U-Net.
4. Evaluation
The evaluation of U-Net is scattered in different sections of the paper. It first shows the latency and bandwidth for small messages using U-Net, with a latency of 65us for 40 bytes’ message and network is saturated with message size as small as 800 bytes. Using different workloads and settings, this paper shows the overhead is relatively small compares to previous works. This paper also reexamines TCP and UDP protocols using U-Net. U-Net UDP reaches the highest bandwidth with a much smaller message size, and U-Net TCP improves the bandwidth to two times higher. Latency for U-Net TCP/UDP is also much lower than conventional ones. Active message using U-Net can achieve a latency with just a few us of overhead compared to the absolute minimum.
5. Confusion
(1)This paper is so similar to the Arrakis paper, is there any major differences between these two works? For me, I think Arrakis is just a new version of U-Net using more advanced hardware, but the idea of these two projects are so similar.
(2)I am not clear about how Split-C benchmark is related to this U-Net work. The paper says Split-C is a programming language for global address space abstraction, so what is the connection between Split-C and U-Net.

Posted by: Tianrun Li | February 28, 2017 07:17 AM

1. Summary
The paper talks about U-Net which aims at providing efficient low latency communication as well as a more flexible interface to the network. The basic idea is to remove kernel from the network path so that the number of copies of data being transmitted/received and the number of expensive system call can be reduced. This allows applications to utilize peak network bandwidths and results in high performance gain.

2. Problem
Earlier low network bandwidth used to be the bottleneck for communication over the network. But with the increased availability of high speed internet, the processing of data at the kernel has become the dominant overhead. Thus, upgrading from low speed internet to high speed internet does not provide performance gain to users in a proportional manner.

3. Contributions
This paper proposes a simple user level architecture that virtualizes the network device and allows each process to directly use it. For each process, there is a separate endpoint which acts as the interface to the network. Endpoint consist of communication segment which stores the data to be transmitted and the data received by the process. Send, receive and free queues are also present within the endpoint and store the descriptor for the messages present in the communication segment. For multiplexing/demultiplexing of messages, processes register tags with U-Net by creating communication channels. Endpoints and channels make sure that no process can interfere with the messages of other process and thus provide protection. U-Net can be operated in two levels. Base level is similar to existing network and copies data into buffers before moving it to process’s address space. Direct access allows true zero copy in which messages can be directly copied in the process’s address space.

4. Evaluation
U-Net Architecture has been tested using Fore SBA 100 and Fore SBA 200 interface. Poor performance of Fore’s original firmware led to the complete redesign of SBA 200 firmware for U-Net implementation. Split-C application benchmarks were also used to compare the performance of CM-5, U-Net ATM cluster and Meiko CS-2.

5. Confusion
>> How can sender specify offset at the destination communication segment?
>> Why is this architecture is not in use today?

Posted by: Gaurav Mishra | February 28, 2017 03:27 AM

Summary:
U-Net is a communication architecture whose focus is on minimizing the processing overhead involved in communication of small messages over high speed local area networks. This design removed the kernel from the critical path of communication messages and moved buffer management and protocol processing to user-level. This allows application level optimizations and easy construction of new protocols. Moreover, this design is implemented on standard workstations using off the shelf hardware.
Problem:
Majority of the communication over the network deals with small messages for which the processing overhead dominates the transmission time. The availability of high speed LAN has increased the need to reduce processing overhead to achieve efficiency. The existing systems makes it hard to support new protocols by placing all the protocol processing in the kernel. The kernel spends lot of time in buffer management, message copies, flow-control handling etc. This can be minimized if most of these tasks are moved to user-level, thus reducing kernel intervention.
Contributions:
> A communication architecture that achieves low latency and high bandwidth for sending/receiving small messages.
> Kernel is removed from the critical path of messages to reduce overhead.
> Buffer management is moved to user level which allows application specific optimizations.
> Protocol processing in user-level enables new protocols to be added efficiently.
> NI is virtualized to provide each process the illusion of owning the interface. The NI does the necessary multiplexing/demultiplexing of messages to the correct processes thus ensuring protection.
> Every process that wants to access the network has its associated endpoint which is a handler to the network. It contains communication segment to buffer send/receive messages and send, receive and free queues which contain descriptors to these buffered messages.
> U-Net also allows (true-zero copy) transferring data off the data structure without buffering and place it in the destination process.
> The traditional TCP and UDP protocols can be implemented efficiently.
Evaluation:
Paper describes two implementations of U-Net, on SBA-100 and SBA-200. The U-Net Active Message is implemented and evaluated against raw U-Net which obtains performance gain. The Split-C application benchmarks are used to compare the performance of CM-5, Meiko CS-2 with U-Net version of Split C. The paper also shows higher performance of TCP and UDP in U-Net implementations.
Confusion:
Need more clarity on how tags are used for demultiplexing in NI.

Posted by: Pallavi Maheshwara Kakunje | February 28, 2017 02:54 AM

Summary: The paper argues that two most crucial factors contributing to performance of network communication are – transmitting time (sending and receiving packets) and processing time (such as in multiplexing/demultiplexing). Given that network bandwidth was improving rapidly at that time, the paper focused on optimizing processing time. This paper presents a new design U-Net which allows building protocols at application level in lieu of system level, and thus reducing kernel overheads leading to improved network communication time and allowing implementation of newer protocols with greater ease.

Problem: Processing of network communication by operating systems were not able to advance with the same pace as network bandwidth. The major problem lied with kernel path traversal by packets and hence causing generation of multiple copies of data at several points and traversing many layers of OS abstractions between device drives and application programs. These overheads were more visible in small messages which constitute a large part of network communication. Systems in such cases were not able to fully utilize the high-speed network due to lag in processing at system level.

Contributions: 1. Introduced an application-level network interface architecture, providing more flexibility to user programs.
2. Reduces kernel overheads and system calls. Provides greater control to application level, and allows them to use their own optimization techniques.
3. Provides each process illusion that they have their own network interface.
4. Each process has an endpoint which acts as a handle to network. Messages are composed in communication segments and managed through queues using descriptors. Destination endpoint is identified by incoming message tag.
5. Has two modes: 1) Direct-Access – uninterrupted access to messages in segments using descriptors. 2) Base Level – requires data copy to buffer before transmission.
6. Places an emphasis on optimization of small messages. Descriptors without communication segment store optimized version of these messages.

Evaluation: The paper is evaluated with two U-Net implementations with separate ATM NICs – SBA-100 and SBA-200. Round trip time was measured and found to be better than traditional network implementation. The paper also evaluated the design for varying message size. U-Net performs better in most scenarios except cases such as radix sort for small message sizes. The paper then tested TCP and UDP protocols with U-Net design. Also, it does a reasonably good job of describing the overheads associated with implemented design.

Confusion: How is resource fairness achieved in case of large number of endpoints?

Posted by: Rahul Singh | February 28, 2017 02:45 AM

Summary
The paper proposed U-Net, a communication architecture that provides processes a virtual view of a network interface to enable user-level access to high-speed communication devices. U-Net aims to reduce the processing overhead required to send and receive messages and provide flexible access to the lowest layer of the network.

Problem
There are two main problems U-Net trying to address.

The first problem is communication performance with a focus on processing overhead for small messages. With more and more high bandwidth local-area networks available, the software path traversed by messages at the sending and receiving ends became the bottleneck of performance. The development of distributed computing put more emphasis on small packets where the processing overheads dominated the performance and the improvement in transmission time is less significant in comparison.

The second problem is interface flexibility. Traditional networking architecture place all protocol processing into the kernel thus cannot easily support new protocols or new message send/receive interfaces. It is also hard to incorporate application-specific information.

Contribution
U-Net has three major contributions.
1. Providing protection/isolation, abstraction as well as multiplexing without kernel. Getting the kernel out of the critical path reduces much overheads for small messages enabling the low latency and high bandwidth communication even with small messages.
2. Provides flexibility for protocol design and integration. Enables the system to take advantage of application-specific information.
3. U-Net is a full system that works with widely available standard workstations using off-the-shelf communication hard.

According to the author, the major contribution of the paper was the proposal of U-Net, which is a simple user-level communication architecture independent of the network interface hardware. The paper also described two high-performance implementations on standard workstations and evaluated its performance characteristics for communication in parallel program as well as traditional protocols (Eicken et al., 1995).

The “true zero copy” architecture is also beneficial where data can be sent directly out of the application data structures without intermediate buffering and the NI can transfer arriving data directly into user-level data structure as well.

Evaluation
U-Net was implemented on SPARCstations running SunOS 4.1.3 and using two generations of Fore Systems ATM interfaces, SBA-100 and SBA-200. The authors modified the SBA-200 firmware to add a new U-Net compatible interface. Then the paper also compared U-Net ATM with CM-5 and Meiko running the Split-C application benchmarks. Finally, the paper evaluated TCP/IP and UDP/IP implementations on U-Net and benchmarked U-Net with different message sizes where the U-Net excelled.

Confusion
1. How does U-Net provide protection/isolation (enforcing protection boundaries) during multiplexing and demultiplexing?
2. As a question discussed in our reading group, how does U-Net deal with fair multiplexing among applications?

Posted by: Yunhe Liu | February 28, 2017 01:58 AM

1. summary
The paper argued that network stack should be eliminated from kernel and provieded user-level access interface to provide efficiency and flexibility. The advantage is that with removing the overhead of copying between kernel and user-space, the network latency will decrease and users are able to custom their own network protocol implementation, which seems related to SDN and NFV nowadays.

2. Problem
With the high-speed LANs at that time, passing through network stack in kernel and copying data became the botteneck of cross-machine communication. To some extent, it looks like exokernel. What is hard is that they need to cleary divide the responsibility of user-level and network interface, that is, providing suitable abstraction, supporting legacy protocols, and providing protection and isolation without assistantship of kernel.

3. Contributions
a). Puting network mux/demux into the network interface (NI), and placing all the other processing which requires more information and buffer management into user space.
b). They virtualized network interface to every process of holding a entire NI.
c). The architecture design of U-Net end-point, recv queue, send queue, free queue and the way of emulating kernel as a endpoint for initialization or legacy.
d). The architecture supports both traditional protocols such as TCP, UDP and others such as Active Messages.

4. Evaluation
They use four micro-benchmarks to evaluate the performance compared U-Net Active Messages (UAM) to raw U-Net. They also use Split-C application benchmars to evaluate the performance gain of application with underlying U-Net. The performance of U-Net matched Meiko but was lower than CM-5 for small messages. And it has far better bandwidth compared to tradional way of UDP and TCP. But what about the baseline, does the design reach the maximum of throwing kernel overhead?

5. Confusion
a). Is it always a good idea to move complicated work to user-level? Yes, micro-kernel is good, but it is not very main-stream. U-Net may be used in datacenter today with high-speed netwok. So what makes it prevelant and why some other simlar ideas not thriving?

Posted by: Jing Liu | February 28, 2017 01:50 AM

1. summary
This paper is about the U-Net system which is designed to remove many of the kernel overheads in network traffic.
2. Problem
Network interfaces can be rather inflexible because all traffic flows through the kernel. Adding new protocols to this model requires modifications to the OS itself which is not always feasible. The kernel middleman not only restricts the protocol flexibility but increases overheads increasing latency across the network. With the majority of network traffic shown to be small packets, this overhead is a significant component of the normal network latency. The overhead can also limit being able to fully utilize the available network bandwidth.
3. Contributions
The U-Net system introduced removes many of the kernel overheads by allowing the user level and network interface hardware to directly interact. An application allocates an endpoint which contains an area of memory called the communication segment as well as several queues. Sent and received data is placed in the communication segment and pointed to from the send or receive queues respectively. When the network interface sees that a send queue is not empty it can access the segment directly and send the data. Similarly, incoming data is tagged with an identifier that allows the network interface to place it into the proper program’s endpoint. This system allows new protocols to be easily implemented as they need only place data into the queues with the proper formatting for the new protocol desired.
4. Evaluation
They evaluated the U-Net system across a variety of systems and conditions. Most importantly, they were able to show that U-Net consistently performed better for small packet sizes. When standard protocols, such as TCP and UDP, were implemented on top of it, they could show that small packets, which they were trying to optimize, had better latencies than previously. They were also able to see TCP traffic fully utilize the full bandwidth that was available.
5. Confusion
How are the interfaces fairly multiplexed?

Posted by: Taylor Johnston | February 28, 2017 01:14 AM

1. Summary
Most Operating Systems have network stack implemented in Kernel. As a result, customizations and new implementations of network protocols require modifications in the kernel. This paper proposes user mode Network Interface that offers flexibility and performance benefits over traditional network stack.
2. Problem
The network stacks implemented in kernel have several disadvantages like lack of flexibility, overhead of processing headers in kernel, system call overhead, complexity to implement new protocols and many more. The paper claims to solve these problems via user mode network stack.
3. Contributions
The user mode network interface (U-Net) proposed in the paper provides multiplex and de-multiplexing the network among processes, protection from interference of other processes, flexible programming interface to network, etc. The main focus of U-Net architecture is performance gain for low latency messaging. The existing kernel layer network interface suffers from significant overhead in header processing and copying buffers into/from Kernel scope. Also, the kernel level isolation does not allow programmers to exploit some minor optimizations like re-transmitting message already sent message without repeating entire stack managed via reference count, transmitting data buffers with true zero copy mechanism, etc. Also, the complexity of implementing novel protocol in kernel layer further discourage many programmers from implementing novel protocols from ground up. U-Net provides virtualized view of Network Interface to each processes simulating the isolated access to Network Interface for each process. The U-Net multiplex/de-multiplexes the network packets from NIC and each process. U-Net provides network interface which can be easily implemented in user space. Also, U-Net offers loads of customizations to network protocols to enable application specific optimizations. The paper describes three variants of U-Net: Base-level U-Net, Kernel emulation of U-Net and Direct-Access U-Net Architecture.
4. Evaluation
The authors have described performance gains of two U-Net architectures – U-Net based on SBA-100 and U-Net based on SBA-200. The authors have performed extensive evaluation of the U-Net network interface. Paper contains the details of U-Net Active Messages which offer reliability and flow control. The paper also compares the performance U-Net active messaging and U-Net raw interface. The Split-C application benchmarks compare the performance of CM-5, Meiko CS-2 and U-Net ATM based on 7 programs. The paper also contains commonly used protocols like TCP, UDP implemented with U-Net interface and their performance gains.
5. Confusion
Why is the overhead of System call more than Invoking user-level network interface?

Posted by: Rohit Damkondwar | February 28, 2017 01:11 AM

1. Summary
Along with a higher network performance, many application domains can benefit from a higher degree of flexibility in the interface to the network. U-Net communication architecture removes the kernel from the critical path for communication without compromising the safety. It provides the user with a virtual view of interface and allows the construction of protocols according to the custom needs of the process.
2. Problem
Though the availability has increased, the bottleneck of per-message overhead persists, which has been neglected. This arises as a result of the software path that is traversed by messages, which includes going through kernel in both sender and receiver end. One approach to solve such a problem is to remove kernel from the critical path, and placing the networking interface in the user space. This gives rise to other issues like network multiplexing for different processes as well as providing them safety. Now that the kernel is out of the path, one has to think of managing the communication resources, while also providing efficient design.
3. Contributions
The paper presents a simple communication architecture at user-level, which is agnostic to the network interface hardware and explains the high performance implementations of this architecture. U-net architecture is implemented by designing three building blocks – endpoints, communication segments and message queues.
Endpoints are the gateway for processes to access network communication. Communication segment acts like a scratchpad where the process composes data and then pushes the corresponding message descriptor into the send queue. Three message queues – send, receive and free are maintained with each endpoint. A tag is used to determine the destination endpoint of the message. Communication channels together with endpoints provide an effective protection mechanism while providing efficient communication.
The base level U-Net architecture implements a queue based interface to the network, where the intermediate data is staged in the communication segment. This base level architecture is achieve3d by pinning the communication segments to physical memory. The paper implements true zero copy mechanism in the form of direct-accesses U-Net, which avoids intermediate buffering. To handle the scarcity of memory resources such as communication segments and queues while still providing a common interface to all the applications, kernel emulated endpoints are provided.
4. Evaluation
The paper presents and evaluates two implementations of U-Net – one on SBA-100 and the other on SBA-200. U-Net round trip time is evaluated starting with simple ping-pong benchmark. U-net Active messages mechanism is implemented and its performance is measured using four different micro-benchmarks. The authors then establish the superiority of U-Net activee messages by performing Split-C application benchmarks and comparing it against CM-5 and Meiko CS-2. The paper also demonstrates the improved performance of TCP and UDP in U-Net implementation.
5. Confusion
1. Can you please explain how tags help in multiplexing and de-multiplexing messages? The explanation in the paper seems incomprehensible.
2. There are many similarities between U-net and Arrakis. How does they fare compared to each other?

Posted by: Sharath Hiremath | February 28, 2017 01:08 AM

Summary
This paper presents U-Net, a communication architecture, that enables efficient low-latency communication and also provides virtual view of network interface to user applications/processes. The focus lies in avoiding kernel intervention in the communication path along with no compromise on security.

Problem
The bottleneck in communication has shifted from Hardware to software layers that are traversed during message communication (Kernel intervention). This is a result of advancements in the communication hardware. This results in high end-to-end latencies. Though throughput can be saturated for large message sizes, this is not possible for small message sizes which are becoming common use case (like rpc). Other aspect is to enable user-defined protocols to suit the needs of the applications.

Contributions
The main contributions of the paper are to provide low-latency communication in LAN; simple user-level communication architecture independent of network interface hardware; Utilize full network bandwidth with small messages and facilitate user-level communication protocols.
Each application are given the illusion of having private network interface through the virtualization of network interface hardware with the support of Operating System. This is achieved by providing U-Net end points for communication that obviate the need of kernel during message transfer and hence results in lower latency. But this requires hardware support (need MMU in hardware), thus U-Net also provides a emulated endpoints which are services by the kernel. As expected, performance cannot be similar to regular Endpoints.
For multiplexing and demultiplexing packets (which was earlier done in kernel), a tag is used on each incoming and outgoing messages. In ATM networks, ATM virtual channel identifiers are used. Using this tag, incoming packets are directly routed to the process.
U-Net provides two types of support: Base-level and Direct-Access. Base-level is called zero-copy which requires one intermediate copy to the networking buffer (not zero after all). Direct Access supports true zero copy in which messages are transmitted/received from/to directly application data structure. U-Net support in the paper is base-level as direct access requires memory mapping that hardware as to support.

Evaluation
Two implementations of were done on SBA-100 and SBA-200. The authors implemented UMA (U-Net Active messages) that could almost reach the raw U-Net performance. This is because of support of reliable communication and removal of restriction of segment size. Authors compare U-Net against other super computers. Split-C programs are run on U-Net Active Messages and authors demonstrate that U-Net, with lower configurations than other super computers, is competitive. Important thing to notice is the implementation of TCP and UDP on top of U-Net; End-to-End latency of TCP and UDP U-Net is almost an order of magnitude faster than Fore TCP. This proves that optimizations of U-Net also provides backward compatibility along with improvements.

Confusion/Question
U-Net type approach seems a viable plan to avoid software overheads. Why such architectures are not being used now ? Is it that bottleneck has switched to hardware again ?

Posted by: Pradeep Kashyap Ramaswamy | February 28, 2017 12:54 AM

Summary

The paper describes the U-Net architecture which allows user level processes to directly access the network. This improves the overall latency of the network by taking the kernel out of the picture thus reducing the processing overhead.

Problem

There has been a lot of improvement in the network fabric technology. This should logically lead to better latencies, but due to lack of any such improvements in processing overhead this was not observed especially for small messages in which case processing overhead is the dominant factor compared to the transmission time. This overhead is due to checksumming, flow-control handling, interrupt overhead, buffer management, message copies etc.

Another problem was that since all the networking protocol code is part of the kernel it is not feasible for the users to have custom protocols to suit the application.

Contribution
1. Messages are sent and received by storing them in a buffer area called communication segment in the address space of the process. The send queue, receive queue and free queue are used to point to the message in case of sending or receiving. A separate buffer in the network interface is not needed.

2. Multiplexing and demultiplexing messages. As a user-level process, U-Net must manage all incoming and outgoing network packets through the network card, and doing a process-to-process delivery. For this purpose, U-Net will associate a tag with each message. Tags will help identify the process the message is to be delivered.

3. A "true zero copy" architecture is attempted, where data is sent directly out of application level buffers without intermediate buffering steps and vice versa with incoming data. Due to lacking hardware support for the direct IO needed, the U-net implementations tested in the paper still required data to be placed in specified buffers which is more of a one-copy architecture.

Evaluation

The authors of the paper evaluated U-net on systems running SunOS 4.1.3 and using Fore SBA-100 and SBA-200 interfaces. They presented a large number of latency and bandwidth measurements on both interfaces with U-net performing generally faster. Also used were split-c benchmarks to highlight U-net's performance under more realistic workloads.

Confusion

Referring to section 3.6 in the paper: What is the logic behind sender specifying the offset in receiver’s address space? What this be decided at the receiver?

Is U-Net being used in any current commercial system?

Posted by: Mayur Cherukuri | February 28, 2017 12:38 AM

1. Summary
This paper proposes U-Net, a network interface that provides processes with a virtual view of its own network interface to remove kernel from the critical path of network communication to improve performance and increase flexibility.
2. Problem
The availability of high speed local networks has moved the bottleneck from limited network bandwidth to software, especially considering that traditional Unix networking architecture involves several copies and crosses multiple abstraction layers. Also, small messages are becoming increasingly important for many applications and the per message processing overhead caused by kernel interaction significantly degrades performance.
3. Contributions
This paper proposes U-Net which virtualizes network interface for each process and moves protocol implementation to user space. U-Net removes the kernel interaction from the path of sending and receiving messages, thus reduces the per message overhead critical for small messages. Implementing protocol in user space also increases flexibility, which allows more application protocol optimization to improve performance. Protection is assured through kernel control of channel set up and tear down. Specifically, in U-Net, each process is assigned an endpoint to buffer incoming and outcoming messages and related data, and a tag is attached to incoming message to determine its destination endpoint. The endpoint could be multiplexed among different processes to save resources.
Authors implement a U-Net prototype with SB-200 and modified firmware to make it U-Net compatible.
4. Evaluation
The first part of evaluation compares U-Net’s performance on workstation cluster with supercomputers including CM-5 and Meiko, using split-C benchmarks, and demonstrate that U-Net based workstation cluster can rival supercomputers in terms of performance.
The second part compares U-Net based UDP and TCP protocols on a proof of concept implementation with conventional kernelized TCP and UDP. Results show that U-Net based UDP and TCP outperform kernelized implementation in terms of throughput, especially for small message sizes.
5. Confusion
What are UAM (U-Net Messages) and GAM (Generic Active Message)? What are they used for?

Posted by: Yanqi Zhang | February 28, 2017 12:21 AM

1. Summary
The paper talks about U-Net, a network interface to facilitate user level access to high-speed communication devices thus removing kernel from the communication path and still ensuring full protection.

2. Problem
With the advent of high speed LANs, the communication path traversed by messages became the bottleneck. In fact, the path involved multiple levels of abstraction between device driver and user application. What this also means is that any upgrade to faster network did not necessarily come with application speed-up. There was thus a need to address this limitation.

3. Contributions
The important contribution is the architecture design that reduces processing overhead required to send and receive messages as well as providing flexible access. As such the design focusses on low latency and high bandwidth using small messages. The user level network interface provides every process the illusion of owning the interface to the network. To do this, U-Net primarily considers multiplexing access to actual network interface and simultaneously enforce protection boundaries. There are three main building blocks - endpoints which serve as application’s handle into the network, communication segments that are memory regions to hold message data and message queues that hold descriptors for messages to be sent/received. The endpoints and communication channels together allow for enforcing protection boundaries among processes accessing the network.

4. Evaluation
U-Net is implemented on SPARCstations using ATM interfaces from two generations. With 8-node ATM cluster, U-Net provides 65 microseconds round-trip latency and 15MBps bandwidth. On Split-C benchmarks, it also achieves TCP performance at maximum network bandwidth and performs roughly same as Meiko and TMC supercomputers. This demonstrates that carefully designing the communication interface can rival specially designed supercomputers.

5. Confusion
In today’s context how commonly do we see network interfaces that avoid going through the kernel ? Additionally, can’t there be applications wherein abstractions provided by kernel become important ?

Posted by: Dastagiri Reddy Malikireddy | February 27, 2017 11:39 PM

1. Summary
The paper describes a network interface through which high speed network devices can be used at full capacity by removing the kernel from the communication path. Low latency and high network bandwidth are achieved even for small messages. Protocol processing is moved from the kernel to the application allowing use of new protocols and application-specific knowledge.

2. Problem
As users upgraded their hardware to make use of faster networks, they failed to observe performance gains in proportion to the improvement offered by hardware. This was due to the high overheads involved with messages passing through the kernel. This overhead increased latency for small messages which is crucial for techniques such as remote procedure calls. Also, due to the kernel handling protocol processing, it was difficult to support new protocols and make use of application specific knowledge.

3. Contributions
The paper motivates the problem well by describing in detail the inefficiencies of network interfaces at the time. The proposed network interface architecture virtualizes the network interface and provides direct access to applications while still providing full protection. Message data is held in memory regions called communication segments. Send, receive and free message queues hold descriptors for data that can be used by U-Net for transfer of data between the application and the network. Communication segments and queues together form an endpoint. U-Net uses a tag in each message to determine its endpoint. U-Net can support data transfer to user-level data structures without intermediate buffering. TCP and UDP protocols are implemented using U-Net and latencies and throughput close to maximum are achieved. Active messages is used to show that U-Net is suitable for communication in parallel languages and runtimes for bulk data transfers.

4. Evaluation
This paper goes into excruciating details of the hardware and this makes it a difficult read. It is not clear what ideas can be used widely, and what techniques were forced because of the quirks of the hardware used for evaluation. Apart from these issues, the evaluation seems fair as the authors measure round-trip latency and bandwidth for a range of message sizes. Benchmarks in a parallel language named Split-C show the advantages of U-Net with active messages for bulk transfers.

5. Confusion
Scalability of memory usage by applications using U-Net is stated to be an open problem. Was there any subsequent work that addressed this successfully?

Posted by: Suhas Pai | February 27, 2017 11:37 PM

1) Summary

With increasing network speeds, the OS is increasingly becoming the bottleneck in modern networking. Multiplexing/Demultiplexing messages and message delivery comprise a large part of the overhead. Additionally, current networking stacks are not flexible for use with custom protocols. The authors propose a user-mode network design which moves the kernel off the critical path of networking and reduces overheads.

2) Problem

Network traffic typically goes through the kernel. The kernel must demultiplex traffic, deliver an interrupt, and copy messages from network buffers. Altogether the overhead of this networking stack is beginning to dominate the latency of the network itself, and it has become a bottleneck in networked computing and online services.

At the same time, the traditional network stack does not offer flexibility to developers who wish to develop and use their own protocols, rather than built-in TCP/IP stacks. This leads to tedious implementation of fast custom protocols.

3) Contributions

The authors propose removing the kernel from the critical path of the networking stack. Instead, the NIC places packets directly into a process's address space, reducing multiplexing and copy times. Their proposal includes both an ideal case (direct-access u-net) and an implementable case using available hardware (base-level u-net).

They prototype their design to show feasibility and performance potential using two commercially available NICs. Moreover, they analyze the reduction in latency and propose a rewritten firmware for the NIC which improves performance.

4) Evaluation

The author's idea seems ahead of its time. A similar idea was later proposed by Arrakis (user-mode device/network control). The authors do a good job of motivating user-mode networking stacks. They propose a design that looks very much like how a modern VNIC might be used in Arrakis. Their extensive analysis and prototype implementation illustrate the reductions in latency by moving the kernel out of the critical path of the networking stack.

However, the paper itself is a bit difficult to read. A number of obsolete acronyms and jargon are used without definition which make the paper a bit difficult to follow in places. Also, design choices are mixed with details of the hardware, making some sections needlessly long-winded and obscuring important choices with choices forced by available hardware.

5) Confusion

What interface exactly does the hardware provide? It would have been nice if they defined their assumptions about the hardware, as Arrakis did.

Posted by: Mark Mansi | February 27, 2017 11:15 PM

1.Summary
U-Net architecture moves the network stack out of the kernel to the user space, which considerably eliminates overhead of the processing of the send and the receive packets. By placing the network protocols into the user space, new and application specific network protocols can be easily and efficiently implemented. The U-Net is optimized for providing full network bandwidth for small messages and also for providing low-latency communication in local area networks.

2.Problem
With the advent of the high speed local area networks, the network communication is limited by the time taken to process (which includes buffer management, message copies, checksumming, flow control handling, interrupt overhead, controlling the network interfaces) send and receive packets.
In case of communicating using small messages, the processing overhead dominates the improvement in the transmission time.

3.Contributions
The approach used in the U-Net architecture is multiplexing the network interface among all processes accessing the network so that each application has direct access (without kernel intervention)to its own virtual network interface.
Endpoints act as process’s handle for network communication. It consists of a communication segment, which has send, receive and free queues which store the descriptors to the actual message to be sent and received. Any message to be sent is stored in the communication segment and the descriptor is pushed to the send queue, which is then inserted to the network.
The OS service sets up a communication channel between the endpoints after performing necessary authentication and authorization checks. Each end point is given a tag which is registered with the U-Net and each communication channel is given an identifier which is returned to the application requesting it.
U-Net supports also traditional network protocol along with the novel communication abstractions like Active messages.

4.Evaluation.
The U-Net was implemented on SBA-100 interface card initially followed by implementation using SBA-200. The poor performance in case of SBA-200 was pinned down to complexity of the kernel-firmware interface. The TCP/IP and UDP/IP were also implemented on U-Net and it is observed that the U-Net versions perform much better than the traditional implementations of TCP and UDP.

5. Confusion
In the section 3.5, they talk about kernel emulation of U-Net. Wouldn’t it add to the processing overhead?

Posted by: Sowrabha Horatti Gopal | February 27, 2017 11:09 PM

1. Summary
This paper put forth a new communication model called U-Net that allows for user-level access to the communication network. This allows for significant improvement in bandwidth utilization over kernel based communication and also allows for more flexibility in programming.

2. Problem
With increasing speed of local area networks the main source of slowdown now is on the software side. Communication through the kernel introduces severe overheads and speedups in network speed are actually brought down by this overhead.

3. Contributions
The purpose of U-Net is to remove a lot of the kernel interaction and provide user level access to the network. U-Net also allows a lot more flexibility in programming provided to the application to utilize the network. The main aspect of the U-Net is the endpoint which is the handle into the network. Each endpoint contains the memory, communication segments, that hold the message data and message queues. These endpoints can then directly connect to the network interface or, to save interface resources, can be emulated with the kernel serving as the actual endpoint.

4. Evaluation
The paper used two different implementations on ATM interfaces but focuses more on the SBA-200 which provides more hardware support for U-Net. The paper modifies the firmware of the SBA-200 to add the U-Net interface. U-Net is able to hit a peak bandwidth very close to the theoretical maximum proviving the designs usefulness in removing the software overhead. In order to prove the compatibility of U-Net the paper also uses the conventional TCP and UDP protocols built on top of U-Net to show the large gain in bandwidth over kernel based protocols. It seems that the paper never looks at the performance of the U-Net that is emulated and the possibility that the emulated version might perform worse than typical kernel based protocols.

5. Confusion
I was confused on the active messages section and how U-Net was applying it to their system

Posted by: Brian Guttag | February 27, 2017 10:14 PM

1. Summary

This paper presents U-net which is a system that allows user level access to NICs. By allowing direct hardware access to the applications the software overhead in the kernel is avoided and applications which have less data to send are able to achieve much better throughput. The communication protocols can also be implemented much easily in the userspace.

2. Problem

The availability of very high speed network fabric ensures that the network is no longer the bottleneck for local area communication. However, since the unix network stack is part of the kernel, every packet traverses through the kernel space before being sent on the network. This additional overhead is fine for throughput hungry, latency insensitive applications that have a large amount of data to send. However, for applications which have one packet or few packets to send on the network, e.g, RPCs, the kernel network stack becomes the performance bottleneck. So, the average bandwidth available to the applications is lower than what is being provided by the network. This stack needs to be floated up to the user level to avoid double copying of data and expensive kernel processing.

Secondly, by placing all the protocol processing inside the kernel the innovation becomes difficult because of the difficulty in changing the kernel.

3. Contributions

U-net proposes that the entire network stack should be in the user space and the applications should have direct user level access to the networking hardware. This allows the applications to get the full network bandwidth even with small data transfers. U-net architecture also allows a multitude of communication protocols to be built easily since now they are deployed at user level. The basic building blocks of U-net include application end points and a communications segment. The application end point has send and receive queues associated with it. When sending the data, the user process places the data in the communication segment and puts the reference to that data in the send queue. The NIC then picks that reference from the send queue and sends the data onto the network. Similarly, at receive time, the incoming data is placed in the communication segment and the reference to that data is pushed to the receive queue and the application can then read the data by accessing the queue to get the message address. U-net also supports true zero copy in which the data from the sending application isn't even copied into the networking buffer. It is just copied from the sending process's address space to the receiver's address space.

4. Evaluation

U-net was implemented on Fore SBA-100 and SBA-200. The authors evaluated U-net using different communication protocols. That includes, active messages, split c and tcp/ip. The ping pong bench mark using active messages shows that bandwidth achievable with U-net on active messages is not that impressive for message sizes less than 2 kilo bytes. For higher message sizes, It is close to AAL-5 theoretical limit. The TCP/IP evaluation shows that the bandwidth achieved through U-net is almost double the bandwidth of Fore TCP for all message sizes. Same for UDP.

5. Confusion

How does U-net ensure fairness among different sending processes? Is that done by the NIC while queuing messages from send queue? Since the queue only has a reference, what if it was doing round robin dequeueing and one queue has much more data to send? will it be fair?

Posted by: Hasnain Ali Pirzada | February 27, 2017 09:29 PM

1. Summary
U-Net highlights the increased availability of high speed networks and the heavy reliance of short round trip times as the common case. Specifically, per message overhead (processing overhead) is becoming the dominant consumer of time in message latency. U-Net offers a new model avoiding the kernel and providing a virtualized network device to each process demonstrating significant performance improvements.

2. Problem
As high speed networks become the norm, the kernel is becoming a major bottleneck in processing overhead. How do we alleviate the processing overhead issue in high speed networks?

3. Contribution
U-Net is a well thought out architecture for removing the kernel from the critical path to mediation and discusses many of the caveats to doing this. They provide an outline of how to virtualize the NIC like how CPUs are virtualized and created a prototype with very promising numbers. After the kernel sets up the communication bridge, the endpoints require no kernel intervention unless they are being virtualized by the kernel due to a limited number of true physical endpoints. Communication is done with a communication segment and send/receive queues.

Unfortunately, the idea was ahead of its time as was demonstrated with the SBA-200. It is suffice to say that U-Net was an idea that stood the test of time being furthered by IX Plane from Stanford and Arrakis from University of Washington.

Ultimately, this work demonstrates the conflict of the kernel and performance. I think the discussion regarding zero-copy was the most interesting and shows how important it is to understand the critical path. Lastly, true zero-copy is difficult because to access the entire address space, you need to be able to handle page faults and other issues.

4. Evaluation
They evaluated this using the SBA-100 and SBA-200 and compared against supercomputer offerings at the time. The SBA-100 suffered unfortunately due to its simplicity requiring virtualization of U-Net endpoints in the kernel. The SBA-200 suffered due to its complex kernel-firmware interface. They observed the backfiring of off-loading as well. The complete redesign of the firmware provided a significant improvement to the results.

5. Discussion
It’s interesting to see how the improvements to modern hardware evolve the ideas here with Arrakis. How did the ideas of Virtual Functions come about; more so how do companies decide that virtual functions are a worthwhile investment and will catch on? At what should level should hardware and software be codesigned versus abstracted?

Posted by: Dennis Zhou | February 27, 2017 09:24 PM

1. Summary
The paper describes an architecture to enable user processes to directly access the network and remove the kernel from the critical path. It does this by presenting a virtualized network interface (NI) to the user-level processes, involving the kernel only for channel setup and tear down, and uses channel tags for multiplexing and demultiplexing traffic.

2. Problem
Most applications are unable to reap the benefits of faster networking hardware because of the overheads of the software path in the OS kernel. These overheads are even more significant for communication needing low latencies, whose demand is only going to rise with the increased use of techniques such as RPC and distributed shared memory. Another issue with having the kernel handle all the packet processing code is that new, more efficient protocols such as active messages are hard to integrate. The authors argue that the entire protocol stack should be placed at user level, and kernel be removed from the critical path. Existing systems such as CM-5, Meiko CS-2, IBM SP-2 also allow user-level access to network, but they have a custom network interface, and they limit the degree of multiprogramming.

3. Contributions
In order to provide direct network access to user space processes, UNet makes systematic changes to support functionality traditionally supported by the kernel:
1. Sending & receiving messages: Done through the virtual NI abstraction called endpoint, which allows buffer management at user-level. Process writes data to segments and updates descriptor in endpoint. Physical NI will pick up data or leave it to apply backpressure. UNet supports upcalls or polling for receiving messages.
2. Multiplexing / demultiplexing: Process receives tag during channel creation, which is done through the kernel. NI uses tag for demultiplexing. OS also assists in route discovery, path setup, and authentication checks are also performed.
3. Managing limited communication resources: Through kernel emulation of the virtual NI. Segments have to point to pinned pages to avoid an MMU in the NI, and thus are scarce resources. Emulation provides consistency and helps with resource preservation.

UNet also provides the following features
1. Zero copy network transfer: This is essential for reducing the latency.
2. Optimizations: Hold message data in descriptors instead of pointers for small messages. This avoids buffer management overheads and can reduce RTT latency substantially.

4. Evaluation
UNet implemented on SPARC workstations with two generations of Fore Systems ATM interfaces. Fore SBA-100 forces the use of emulated UNet because of hardware limitations. It achieves a single cell message RTT of 66 us and bandwidth of 6.8MB/s, which is less than the theoretical limit of 17.5 MB/s because of the overheads of CRC computation in software. Fore SBA-200 achieves RTT of 65 us and saturates bandwidth with 800 byte packets, after rewriting of the firmware. Split-C benchmarks are performed to compare UNet with existing systems having kernel bypass support. UNet ATM roughly equal to Meiko CS-2 and performs worse than CM-5 for small messages and better for bulk transfers. Finally, Fore ATM TCP and UDP protocols' network performance comparison with regular protocols is given.

5. Confusion
1. The idea of kernel bypass for fast network I/O is again becoming popular in recent years. Why did the idea lay dormant for 10-15 years? Was there a systemic drive to reduce OS overheads which led to the dropping of this idea?
2. UNet seems to be worse than the existing systems such as Meiko CS-2 or CM-5 for certain benchmarks. Do those systems have any interesting design aspects which enable them to perform better or is just different hardware?

Posted by: Karan Bavishi | February 27, 2017 09:22 PM

1. summary
This paper present U-Net, a user-level network interface. The main point is to remove the kernel from the message path so that let user application interact with the network interface directly.

2. Problem
There are two main problems to solve. The first one is overhead. Due to the high-speed local area network, the limited bandwidth is no longer the bottleneck. Instead the software path traversed by messages matters. In traditional Unix systems, messages go through several copies and crosses multiple layers of abstraction between the driver and the user application. Such an overhead limit the peak communication bandwidth. However, most applications rely heavily on quick round-trip requests and replies. An other problem is the flexibility. Placing all protocol into the kernel makes it hard to support new protocols and interfaces, which limits the benefits of integration application specific information into protocol processing.

3. Contributions
This paper is the first work to present a full system that can be directly run on off-the-shelf and supports tranditional network protocols and parallel language implementations. The authors point out that the entire protocol stack should be placed at user level and that the operating system and hardware should allow protected user-level access directly to the network. The authors solve four technical challenges: multiplexxing the network among processes, providing protection, limited communication resources management without the aid of a kernel path, designing an efficient yet versatile programming interface to the network. Besides the U-net communication architecture, the paper gives two high-performance implementatios on standard workstations and evaluates its performance characteristics for communucation in parallel programs.

4. Evaluation
The authors evaluate the U-Net architecture using Split-C application benchmarks. Split-C is a parallel extension of C. A comparison is made among Split-C, CM-5 and Meiko Cs-2. The benchmark consists of seven test cases. The results show that with respect to some settings, the U-Net has equivalent performance to CS-2/CM-5 or even outperforms them.
5. Confusion
The author says U-Net is OS kernel free, but why it uses an OS services in multiplexing and demultiplexing messages.

Posted by: Huayu Zhang | February 27, 2017 08:38 PM

Summary:
The main goal of U-Net is to provide low latency network communication and a high degree of flexibility. U-Net communication architecture removes kernel from the critical path while providing full protection and provides processes with user level access to high-speed communication devices by virtualizing the network interface. U-Net implementation does not require specialized hardware or new OS.

Problem:
Due to the increased availability of high-speed local area networks, bottleneck in local-area communication has shifted from limited bandwidth of network fabrics to the message processing overheads in software path. This is particularly a problem for small message size communication. Besides, implementation of protocol stacks in kernel makes it difficult to experiment with new protocols.

Contributions:
U-Net architecture provides each process the illusion of owning the network interface. Communication segments are regions of memory that are used to hold message data, and the message queues (send, receive, free queues) hold descriptors for messages that are to be sent or that have been received. Endpoints act as application’s handle into the network. U-Net endpoints can be emulated in kernel to multiplex communication segments and message queues which are essentially scarce resources. Although emulated endpoints does not provide same performance benefits. The role of U-Net is limited to multiplexing the actual NI, enforcing protection boundaries and limiting resource consumption. U-net provides two architectures: “base-level U-Net” which requires an intermediate copy into a networking buffer and “direct access U-Net” which supports true zero-copy without any intermediate buffering.

Evaluation:
U-Net was implemented on SPARCstation running SunOS 4.1.3 and using two generations of Fore systems ATM interfaces. Using U-Net, round trip latency for smaller message sizes is about 65 µs. Raw U-Net is shown to achieve theoretical peak bandwidth of fiber with packet size as low as 800 bytes. The authors also implemented TCP/IP and UDP/IP protocols on U-Net which achieve latencies and throughput close to raw maximum. Using Split-C benchmarks, U-Net is compared with Meiko CS-2 and CM-5 supercomputers. The authors have provided all sorts of evaluation details but I found it really hard to conclude this section very well.

Confusion:
Due to unavailability of hardware, they could not implement direct access U-Net. Do you think that the Direct access implementation can provide the claimed performance benefits? Are there any hidden costs to the strategy that the paper might be missing.

Posted by: Neha Mittal | February 27, 2017 07:26 PM

1. Summary
This paper introduces U-Net, which is a user-level network architecture. The key feature of U-Net is to avoid OS overhead when processes send and receive network message between each other.

2. Problem
The problem is traditional OS design incurs overhead for network packet transmission. When a process issues a system call write() to send network message, extra buffer copy and context switch happens. (1).Process may copy data to the buffer in write(..., buffer, ...) from its other memory space. (2).Process is trapped into kernel which incurs one context switch from process to kernel. (3).Kernel copy data from user's buffer into network interface card's buffer. (4).Kernel returns back to process which incurs another context switch. Suppose process can directly copy data into network interface card's buffer, all (2)-(4) overhead can be avoided. In addition, low latency of small size network packet transmission becomes important at that time for distributed system implementation. Distributed file systems generally have a significant amount of small size network message exchange (e.g. client gets file attribute periodically to validate cache in NFS, and server breaks client's callback in AFS).

3. Contributions
The contribution of U-Net is to show us how to virtualize network interface card among processes, instead of letting kernel to control network card. The basic design is per-process endpoint (communication segment, and send/recv/free queues). First process needs to get an endpoint from OS. OS is responsible for assigning one endpoint to process, including set up communication segment, send/recv/free queues and channel identifiers (ATM VCI in paper to realize routing). When process wants to send a message, it will put message into communication segment and create a descriptor (including pointer to message, length of message, destination, etc) in send queue. Network interface card will notice the new descriptor in process's send queue, and send out the message in process's communication segment. Communication segment is actually network interface card's buffer, so here U-Net eliminates the extra one message copy from user process to kernel, and two context switch between process and kernel. When network interface card receives a message, it first finds the destination process according to header (tag in paper, contains destination's ATM VCI) in message.network interface card grabs a descriptor in destination process's free queue, puts message inside process's communication segment and creates a descriptor in process's recv queue. Process will notify the new coming message either by polling or interruption.

4. Evaluation
First the authors implemented kernel emulated U-Net on SBA-100 interface card (mainly due to no DMA support). Then they implemented U-Net without kernel emulation on SBA-200 which has DMA support, and Active Messages on U-Net (UAM). Use Split-C benchmarks, they show U-Net's efficiency by comparing U-Net ATM cluster again CM-5 and Meiko CS-2. The result shows in several benchmarks, U-Net performs better than other two in terms of execution time. Then the authors implemented TCP and UDP upon U-Net. The result shows U-Net TCP and UDP has better latency and bandwidth than traditional one.

5. Confusion
How current Linux works with network transmission? What strategy they use to mitigate kernel overhead during network message transmission?

Posted by: Cheng Su | February 27, 2017 03:54 PM

Summary
The kernel slows network I/O too much with it's whole business of system calls and context switch. They try to get rid of the network bit in the kernel by pretty much writing the kerns network bit as a separate entity known as a Unet architecture. They argue this is a great idea as having network code in kernel prevents us from optimising for certain kinds of workloads and protocols which would give us huge gains in latency.

Problem

With faster ethernet available, the bottleneck of network communiation
is no longer bandwidth. The paper claims that it's the software
implementation of networking protocols that slow the process down. In
particular, the OS/Kernel plays too much of a role in handling network
code, which causes delays due to system calls.

So the authors would like to implement the networking stack at the
user level by removing the kernel from the critical path which would
also allow users to streamline buffer management [**EXAMPLE**].
Kernel has minimal functionality.

Contributions

3 major goals:

provide low latency communication especially
for small messages. They go on and on about how small messages are
important

exploit full network bandwidth

allow the use of novel communication protocols which are very
hard to implement cunrrently

Keeping these 3 goals in mind the authors came up with the U-net
architechture. The general idea is that each process/application
thinks they have access to their own network interface.

At the heart of it all are 3 building blocks:

* endpoints: process handle into network,

* communication segments: memory segments for storing data

* message queues: hold descriptors for messages that are about to be sent

To send a message data is composed in the communication segment, a
descriptor for the message is created and pushed onto send q. The
U-net interface has some policy for picking things out of the q, but
eventually it d-qs this handle and sends the data across

To receive the network interface demultiplexes the packet and sends it
to the right process. The message has a tag which the unet uses to
find the right guy.

Receving for process maybe polling or interrupt base (think they call
it upcall).

A neat little feature is that it supports two types of message buffering:

* zero copy: where there is no buffering, it directly sends data
fromone process segment to other.

* base-level copy: where there is * intermediate copy to network
buffer. Tbe management if these buffersis configurable based on user
needs.

Confusion/Evaluation

1) Does buffer management really slow down network traffic that much
even today? The true zero copy seems like a tonne of work for not that much gain to me.

2) The whole business of emulated endpoint. That's the same thing as
going through the whole kernel business all over again right? So this
whole UNet business seems like a whole lot of of work for some
microoptimisations. They argue for one whole page that small messges
make the world go round and I can see that there architechture would
shine there, but given that it's an academic paper I don't know what
to believe.

Posted by: Ari | February 27, 2017 03:20 PM

1. Summary
U-Net allows more flexibility and performance from networked communications by moving almost all components of the network stack out of the kernel and into user space.

2. Problem
As high speed local area networks become widely available, the bottleneck in the network shifts from the physical network speed to the messaging overhead. Previously, the physical networks were so slow and low bandwidth that optimizing the protocol stack would not have yielded any performance. But the rise of fast LAN (they specifically mention ATM) means that these unoptimized stacks are now a serious bottleneck.

3. Contributions
The authors identify a very important workload for testing communications protocols: a large number of small messages. As they point out, companies frequently use bulk-data transfer workloads to test and advertise the performance of protocols and networks. The latency and bandwidth afforded to a large volume of small messages is very important for modern distributed workloads.
Having identified this problem, the authors suggest pushing a large amount of the network stack, usually offered by the kernel, into user space. They attempt to do so without requiring modifications to the network hardware itself, but comment that much greater performance could be achieved with customized hardware.

4. Evaluation
U-Net is implemented on two generations of a particular ATM network card. The older card has pretty inflexible hardware and U-Net had to be implemented in the kernel. They hacked the firmware of the newer card to support the U-Net interface.
They ran microbenchmarks to measure the latency and bandwidth of the raw U-Net interface, as well as a higher level Active Messaging protocol built on top of it. The U-net based protocol gets impressively close to the raw performance of the physical network.
They also test U-Net by running several distributed computation benchmarks on a cluster of average machines and compare their results against two contemporary parallel supercomputers. I am unclear what conclusions I was supposed to draw from this comparison.
Finally, they build TCP/IP and UDP/IP on top of U-Net and compare against the performance of the vanilla kernel networking stack. U-Net does very well in this comparison.

5. Confusion
What was the purpose of the two implementations on the old and new network cards? Was the old card supposed to serve as a baseline, since the whole system had to be relegated to the kernel anyway?

Posted by: Mitchell Manar | February 26, 2017 11:12 PM

CS 736 Reviews - Spring 2017

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Comments

Post a comment