CS 739 Reviews - Fall 2014: Active Messages: a Mechanism for Integrated Communication and Computation

Summary:
- The paper presents active messages, a mechanism for reducing communication overhead by combining communication and computation. Active messages use non-blocking and non-buffering communication to extract data from the network for computing the critical thread.

Problems:
- The traditional programming model is blocking, and requires three phases for synchronous sends and receives.
- Poor communication and computation overlap. Processes had to stop computing while waiting for messages to/from other processes.

Contributions:
- No buffering (except necessary network stacks) for immediate data extractions, which allows processes to integrate extracted data into their ongoing computation.
- Split-C to demonstrate the utility of Active Messages with PUT and GET.
- TAM helps compilers manage memory allocation and scheduling for active messages using an activation tree. This helps people compile and utilize active messages.

Confusing:
- Figure 5 shows the joint utilization of the processor and the network. However, it is not clear how what the joint utilization is measure and weather is good or bad that is somewhat in the middle. Is Split-C paying processor utilization for better network utilization?

Learned:
- Combining computation and communication can greatly improve parallel computing.

Posted by: Kai Zhao | November 20, 2014 07:59 AM

Summary
The paper discusses a message passing protocol called Active Messages with the key objective of minimizing communication overhead and overlapping communication and computation to improve performance of parallel programs.

Problems
1.At that time there was no existing message passing mechanism that efficiently designed a scheme that considered network and cpu utilization simultaneously

2.The overheads while sending and receiving a message were very high

Contributions
1.Asynchronous Messaging to overlap computation and communication

2.No buffering required since the space is pre allocated

3.Developed Split - C that incorporated Active Messages paradigm in its PUT and GET

4.Deadlock avoidance using separate channels for receive and response

5.Active Message's handler whose header contains the address to the instruction so computation can instantaneously begin .

Thoughts
The idea of directly executing the contents of the message , which are sent and received asynchronously is a simple yet powerful way of overlapping computation with communication

Without dynamically allocating buffer size how does active message deal with large arguments in the messages ? .

Posted by: Arkodeb Dasgupta | November 20, 2014 07:56 AM

SUMMARY: The paper presents a new message passing mechanism design called "Active Messages" and an implementation called "Split-C".

PROBLEM: Different large-scale multiprocessors implement different message passing paradigms, and the methods for programming them are not one-size-fits all. Too often, programs operate in distinct "computation" and "communication" phases when in fact these two things can happen simultaneously. Typical models of communication have an active sender and receiver working in concert to move data, and while doing so are blocking computation from occurring.

CONTRIBUTIONS: Active messages provide an asynchronous communication model, in which the sender does not block and computation continues while the network is delivering the message. Likewise, on the receiving side, computation is only temporarily interrupted while a message handler is invoked. The receiving process does not have to actively read the data, figure out what to do with it, or take immediate action on it. Rather, the handler has already been specified by the sender, and the handler should not do any computation of its own, but merely integrate the received data into the computation as quickly as possible and return control to the main computation process. By using this mechanism, the authors are both able to generalize other paradigms such as send/receive or shared-memory, and when programmed correctly can also achieve a significant performance increase as well by better parallelism of communication and computation. They provide an implementation of this design called "Split-C" and demonstrate performance metrics. Finally, they describe a compiler called TAM which is meant to aid in converting typical multiprocess code into a model that integrates better with Active Messages to better overlap I/O and computation.

CONFUSION: I was somewhat confused by what it means to "integrate the data with the computation as quickly as possible". It seems that storing data into an existing (and ongoing, even if temporarily suspended during the handler) computation would be very tricky to do correctly due to data races. For example, if I get the width of a matrix and then the height of the same matrix as two separate operations, there's no guarantee that a handler was not called between those two operations that updated the size of the matrix.

LEARNED: What I learned is that even "fancy" multiprocessors like the CM-5 were designed to showcase processor performance and didn't necessarily provide easy programmer tools to focus on overall system throughput. Parallel programming is hard, made more so by the fact that different systems require vastly different communication models, and it seems that even making a generalized version can still be tricky with regards to deadlock and data races.

Posted by: Zach Miller | November 20, 2014 07:54 AM

Summary:

In the paper the authors introduce a simple asynchronous communication mechanism called Active Messages that intends to avoid the unnecessarily high communications costs while having good processor cost/performance while offering flexibility and cost effective use of the hardware. In an Active Message the control information at the head of the message is the address of a user-level instruction sequence that will extract the message from the network and integrate it into the on-going computation. The authors describe and evaluate implementations on nCUBE/2 and CM-5 using a split-phase shared memory extension on C, Split-C.

Problem:

The existing message passing multiprocessors had unnecessarily high communication costs, on the other hand research prototypes of message driven machines demonstrated low communication overhead but suffered from poor processor cost/performance. Some of the design challenges for large-scale multiprocessors is to minimize communication overheads, allowing communication to overlap computation and coordinating communication and computation without sacrificing processor cost/performance. Existing systems did not achieve the fore said goals and hence the authors try to address that in this paper.

Contributions:

The use of Active Message (which is not a new parallel programming paradigm) to implement other paradigm simply and efficiently while exposing the full hardware flexibility and performance.
Split-C programming model: Overlapping communication with computation instead of alternating between computation and communication phases.
Provided implementation and evaluation on nCUBE/2 and CM-5 using a split-phase shared-memory extension to C, Split-C
Elimination of message buffering.
The authors describe a range of enhancements to mainstream processors for hardware support for active messages.

Learned:

It is interesting to learn the authors used a more primitive communication mechanism of active messages to achieve cost effective use of the hardware while having good processor cost/performance and low communication costs.

Confusion:

I am wondering about the applicability and scalability impact in today's environment.

Posted by: Saikat R. Gomes | November 20, 2014 07:53 AM

Summary:

In an ecosystem of nodes which contains a processor chip, DRAM, and interface card which contains channel and interconnected using hypercube or some kind of interconnect, which can be used for parallel programming frameworks to speedup long running SPMD programs. In a traditional system of message passing interface, run time of the program depends on Tcompute and Tcommunication, where Tcompute includes the computation time and Tcommunication includes Tsetup and Ttosendmsg. In traditional systems, a three phase handshake is used to implement blocking sends and receive in the systems, which has a lot of communication overhead and it doesn't overlap communication with computation. In order to improve the efficiency, they had asynchronous sends and receive, where they are buffered. But this also doesn't overlaps communication with computation. Active messages proposed by this paper, implements a new parallel programming framework user library called splitC, which implements asynchronous GET and PUT protocols, which contains a message handler and a message body. This packet when received, executes the function handler, which copies the message body into the ongoing computation with minimum overheads. Since the mechanism has very less overhead, this can make small messages to be transferred efficiently and the asynchronous nature of the protocols overlaps computation and communication. But for large messages, since packet ordering is needed, it needs a 2 phase protocol for GET and 3 phase protocol for PUT. By introducing this new mechanism called active messages, they show that they get a good performance when compared to traditional programming model and they also say that the performance can be greatly increased by implementing this functionality in hardware.

Contributions:

1. Performance/Cost: They propose that even with the high speed networks in the standard programming model, to get 90% peak performance, the network will be 90% idle. Hence we need some other ways to get high peak performance.
2. Communication Latency: Reducing communication latency is not needed. Since, overlapping communication with computation can give a peak performance, it is much easier than to reduce communication latency.
3. Active Messages: Messages contain a function handler which gets scheduled immediately in the receiving end, and copies the message body.
4. Asynchronous GET and PUT: Implemented new GET and PUT, where GET gets the data from remote node, PUT puts the data to remote node.
5. GET is implemented by sending a PUT message to remote node. Synchronization is done using shared memory flag.
6. TAM, another framework to implement active messages using activation frames, which has threads (computation) and inlets (setup).

Confusing:

I am confused about why they say eliminating buffers improves performance in active messages ?

Learned:

Since the paper focused on optimizing performance in nano scale, the details of how those optimizations can be done is very interesting like using DMA access for large messages, using registers, compiler optimizations to increase fast polling etc,.

Posted by: Dinesh Rathinasamy Thangavel | November 20, 2014 07:53 AM

Summary:
The paper introduces Active Messages communication mechanism, and shows that it allows cost effective use of the hardware, and offers tremendous flexibility.

Problem:
Existing message passing multiprocessors have unnecessarily high communication costs. The real architectural issue is to provide the ability to overlap communication and computation, which requires low-overhead asynchronous communication.

Contributions:
(1) Introduce active messages. Each message contains at its head the address of a user level handler to be executed on message arrival with the message body as argument. It generalize the hardware functionality by allowing the sender to specify the address of the handler to be invoked on message arrival. It will extract the data from network and integrate it into ongoing computation with a small amount of work. To minimize the communication overhead, active messages are not buffered except as required for network transport. Active messages can have fast execution.
(2) Introduce active message implementation on nCUBE/2 and CM-5.
(3) Develop the Split-C programming model that provide GET and PUT remote memory operations in C.
(4) In the message driven model, the handler may perform arbitrary computation, in particular it may synchronize and suspend. Because a handler may suspend waiting for event, the lifetime of the storage allocated in the scheduling queue for messages varies considerably. Active messages are sufficient to implement the dynamically scheduled languages for which message driven machines were designed.
(5) Introduce the TAM, a fine-grain parallel execution model based on Active Messages. It can improve the locality of computation by batching the active messages.

Confusion:
The active messages are not buffered to minimize the communication overhead. But in reality, I think buffer is the a method to improve locality and minimize communication overhead?

Learned:
It's interesting that this paper tries to combine the message passing in distributed system to hardware. Most of the researches were focused on how to use message passing or how to reduce the number of messages. However, this paper focus on the hardware support for active messages.

Posted by: Jing Fan | November 20, 2014 07:50 AM

Summary:
The paper introduces Active Messages, a message passing michanism to making overlap between computation and communication in distributed computing. The overlap makes the utilization of each processors and the bandwidth higher, and leads better performance. Also, acompanied with the Split-C Get/Put programming model, it achieves an order of magnitude improvement than previous mechanism + programming model pairs.

Problem:
One method to utilize each processor in distributed computing it to minimize the communication overhead (in synchronized message passing). However, traditional approaches typically use 3-phase passing with blocking, which has huge communication overhead. Another approach it to make computation overlap with the communication (in asynchronized message passing), however, traditional asynchronized methods (use buffer, no blocking when send/wait messages) fails because of the start-up cost on buffer management.

Contributions:
1. Make the compunation and communication (seamlessly) overlap by handler in messages. Handler contains the user-level instructions to extract the information in messages. Handler makes the computation thread available immediately after it receives the message.

2. Based on handler, Active Messages do not use any buffers. Because the receiver thread can be handled by the user program.

3 The acaompanied Split-C programming model: The PUT and GET operations are supported. Each thread's computation are switching between PUT and GET so that the communication and computation are interleaved and optimize the efficiency.

Confusings:
The paper claim that the message is processed by the handler and the result information is prelocate to the specific memory address. However, what about consistency, what if two threads are writing to the specific memory address at the same time?

Learned:
A good message passing michanism need a corresponding programming model to achieve performance advancement.

Posted by: Shike Mei | November 20, 2014 07:26 AM

Summary
The authors have discussed about a new message passing paradigm which leverage asynchronous communication benefits and paralleling computation and communication. They have shown their architecture of communication, named Active messages are orders of magnitude faster than the existing systems.
Problem
In a distributed system or multiprocessor or multiprocess system communication overhead often becomes the bottleneck. To increase the processor utilization we need a high performance network is needed to minimize the communication time while the network will be idle 90% times. So, they set out to build a co-ordinated communication with the on-going computation and low start overhead.

Contribution

latency vs overlap: Latency is not the main issue in architecture but overlapping communication and computation is an issue.
Having instruction pointer to the user level extraction program with the message keeps it easy for the process to continue communication while a separate handler is set to work on the message to work on.
They removed the user level buffers and network stack or the special handler takes care preallocating space and operating on it
Split-C : based on Active message they developed this programmig model to support basic operatoin liek GET and PUT in C.
TAM: Given the intricacy of active messaging, they developed a compiler that will convert a program written for parallel architecture to aid itself with AM.

Learned
Communication latency bottleneck can be greatly overcome by interleaving computation and communication overlap.

Confusing
The paper is extremely through and well explained. I don't think I have any genuine confusion in the paper.

Posted by: Rahul Chatterjee | November 20, 2014 05:20 AM

Summary
The authors have discussed about a new message passing paradigm which leverage asynchronous communication benefits and paralleling computation and communication. They have shown their architecture of communication, named Active messages are orders of magnitude faster than the existing systems.
Problem
In a distributed system or multiprocessor or multiprocess system communication overhead often becomes the bottleneck. To increase the processor utilization we need a high performance network is needed to minimize the communication time while the network will be idle 90% times. So, they set out to build a co-ordinated communication with the on-going computation and low start overhead.

Contribution

latency vs overlap: Latency is not the main issue in architecture but overlapping communication and computation is an issue.
Having instruction pointer to the user level extraction program with the message keeps it easy for the process to continue communication while a separate handler is set to work on the message to work on.
They removed the user level buffers and network stack or the special handler takes care preallocating space and operating on it
Split-C : based on Active message they developed this programmig model to support basic operatoin liek GET and PUT in C.
TAM: Given the intricacy of active messaging, they developed a compiler that will convert a program written for parallel architecture to aid itself with AM.

Learned
Communication latency bottleneck can be greatly overcome by interleaving computation and communication overlap.

Confusing
The paper is extremely through and well explained. I don't think I have any genuine confusion in the paper.

Posted by: Rahul Chatterjee | November 20, 2014 05:20 AM

Summary
This paper describes Active messages, a low level messaging protocol aimed at optimizing network communications and reducing latency.

Motivation
Active message try to improve parallel programs performance by
1) reducing communication overhead
2) overlapping communication and computation
3) taking into account communication, computations and their interactions.

Presently there are two prominent models : message passing and message driven. Message driven has low communication overhead but it requires special hardware to have great performance. Message passing has higher communication overhead, requires extensive buffering for overlapping again increasing overhead but does not require special hardware. Active message is trying to get best of both features - a message protocol with low communication overhead with better performance even without special hardware.

Contributions

1) They decrease overhead by only buffering when it is absolutely required.
2) They restrict message handlers to have simple functionality and run atomically to implement handlers with no buffering and no special hardware.
3) They prevent deadlock by making communication request reply, so as to have acyclic dependencies.
4) They provide proof of concept in the form of Split-C
5) They also show hardware in support of Active messages.
6) They try to increase cache locality with TAM by batch processing.

Confusion
How is handler atomicity with respect to main computation is achieved?

Learned
Pipelining phases increases performance and utilization of resources. Resource utilization and performance in case of parallel computation is increased by overlapping communication and computation.

Posted by: Sreeja Thummala | November 20, 2014 04:23 AM

Summary
This paper describes Active messages, a low level messaging protocol aimed at optimizing network communications and reducing latency.

Motivation
Active message try to improve parallel programs performance by
1) reducing communication overhead
2) overlapping communication and computation
3) taking into account communication, computations and their interactions.

Presently there are two prominent models : message passing and message driven. Message driven has low communication overhead but it requires special hardware to have great performance. Message passing has higher communication overhead, requires extensive buffering for overlapping again increasing overhead but does not require special hardware. Active message is trying to get best of both features - a message protocol with low communication overhead with better performance even without special hardware.

Contributions

1) They decrease overhead by only buffering when it is absolutely required.
2) They restrict message handlers to have simple functionality and run atomically to implement handlers with no buffering and no special hardware.
3) They prevent deadlock by making communication request reply, so as to have acyclic dependencies.
4) They provide proof of concept in the form of Split-C
5) They also show hardware in support of Active messages.
6) They try to increase cache locality with TAM by batch processing.

Confusion
They mentioned handlers are computed atmoically with respect to one another. It is not mentioned how they are computed with respect to main computation. How are they integrated successfully with main computation?

Learned
Pipelining phases increases performance and utilization of resources. Resource utilization and performance in case of parallel computation is increased by overlapping communication and computation.

Posted by: Sreeja Thummala | November 20, 2014 04:15 AM

Summary:
The authors discuss active messages as a means to achieve cost effective use of hardware and to integrate communication with computation. Through active messages they are able to overlap communication with computation and reduce the overhead of communication. They also implement a split-phase shared memory extension to C (split-C) to show the effectiveness of their new paradigm. They also outline a range of hardware enhancements to processors that would enable even better performance out of active messages.
Problem:
Commercial multiprocessors tend to focus on processor performance and neglect the interaction with network while research efforts tend to focus on and solve certain aspects of it at the cost of processor performance/cost. The resulting poor overlap between computation and communication results in poor performance overall. Finding an effective communication mechanism so that both computation and communication can be overlapped effectively is the main problem being solved here.
Contributions:
- They provide an asynchronous message passing model which allows for communication and computation to overlap, increasing the performance of the system by an order of magnitude.
- They provide a mechanism to avoid buffering beyond what is necessary at the networking interface buffer.
- Having the destination’s handler address be mentioned on the message for quick handling of the message.
- The capability of the handler to extract the message and then seamlessly integrate it with on-going computation.
What was not clear:
Does it have a mechanism for detecting message drops or is TCP assumed to be theprotocl used to transfer? If it does not deal with drops well, the system might enter deadlock state.d
My Key takeaway:
When building systems, especially large ones, it is pays to overlap the different components.

Posted by: Chaithan Prakash | November 20, 2014 03:40 AM

Summary:
This paper introduces a communication protocol of sorts and computation methodology called Active Messages. These messages are similar to RPC calls or JSONP queries with subtle, but substantial differences, which can make them an order of magnitude faster than these naïve approaches.

Problems:
Active Messages attempt to integrate communication and computation so there is minimal overhead in message passing and efficient utilization of compute resources and network bandwidth.

Contributions:
The major overarching contribution that this paper describes with Active Messages is non-blocking computation on network messages. In modern terms this may be described as reactive programming, such that data transmitted isn’t just a remote procedure call, but acts more as an online stream which is integrated into existing computation. Other notable contributions include:
• Zero buffering in the application code (messages should be non-blocking and utilized immediately); buffering is only done in the network stack when necessary.
• A programming model called Split-C, which provides remote memory operations (such as PUT and GET) using Active Messages.
• A Threaded Abstract Machine (TAM) that improves cache locality when processing Active Messages by running batches of related messages together.
• A “zero copy” methodology when new messages arrive, memory allocation is only done for new “Activations Frames” which contain the thread context for the message and a pointer to it.

Unclear:
It was a little unclear with the author’s algorithmic communication model (Section 1.1) why they didn’t include computation start up costs. This is clearly something to keep in mind, as cache warming may be a bigger issue with data heavy applications. They accounted for locality with the Threaded Abstract Machine (TAM) later on, but for some reason forgot to include it in their model to begin with.
As an aside, what the heck is a “binary hypercube network”, which apparently supports “cut-through routing across a 13-dimensional hypercube” (Section 2)? This was really unclear.

Learned:
I learned how designing for both computation resources and network infrastructure in a combined, integrated approach could realize higher utilization in a distributed system. Active Messages provide an interface where data computation does not stop and likewise network communication doesn’t stop when there are plenty of resources to continue work.

Posted by: Peter Collins | November 20, 2014 02:20 AM

Summary:
The main idea in the paper to provide a mechanism to parallelize computation and communication using asynchronous messaging. They name their system as Active Messages which provides a message passing communication framework between different processors in a multi processor architecture.

Problem:
The synchronous way of communication where processors send and receive messages in a blocking fashion takes a lot of time in communication and so the time taken for communication exceeds the time taken for computation. If asynchronous communication is used then the process of buffer management and the initial startup costs pose a great problem. Active messages tries to solve this problem by making the computation and the communication stages to overall with one another so the maximum performance is achieved.

Contributions:
1. The concept of having a handler for messages which helps in separating the process of message handling from the computation. Whenever a message is received the handler is invoked with the help of an interrupt like mechanism and the handler takes the responsibility of processing the message and providing it to the computation while the computation is still going on in parallel. This idea forms the basis for introducing overlap between communication and computation.

2. The method used to handle deadlocks by separating the channels used for sending and receiving messages was also an important contribution.

3. The idea of removing the buffers and complex buffer management methods that were used in asynchronous communication helped in reducing the cycles used for buffer management.

4. The development of Split-C programming model helped programmers to use Active Messages with get and put primitives.

Thing I learnt:
I learnt that it is possible to communicate asynchronously without the use of buffers. I also learnt that computation and communication are not necessarily serial operations and can be performed in parallel.

Thing I found confusing:
Although this seems to be a great idea, I'm confused if this idea of active messages is used somewhere in real world. I'm not sure if the present day multi processors use this technology to overlap communication and computation. I think that asynchronous communication using buffers was simple even though it had some overheads.

Posted by: Adalbert Gerald | November 20, 2014 12:45 AM

Summary:
This paper presents the concept of active messages. Active messages provide a mechanism to closely couple computation and communication. Therefore, active messages are a solution to the inefficiencies of common computation and communication architectures.

Problem:
The problem is that most communication mechanisms built into high speed multiprocessor units are extremely inefficient and therefore introduce high overhead. The issue is that the communication and computation functionality are built as distinct units and do not integrate well. This leads to high overheads when communication and computation are interleaved as is usual in normal operation.

Contributions:
- The concept behind active messages of integrating communication and computation to a small degree by providing the address for the message handler directly at the beginning of passed messages.

- Eliminating the need for buffering messages by pre-allocating the memory needed for data structures of incoming messages.

- The simple implementation of active messages in the split-c programming language and doing all sends, some computation, and then all receives.

- The evaluation of their active message architecture demonstrating good results and the discussion of achieving even better results with direct hardware support for their concept

Learned:
I was intrigued by the simple yet extremely effective concept of active messages and the performance benefits of having a relatively small amount of integration between communication and computation in a multiprocessor environment. It felt very intuitive from a practical standpoint.

Confused:
I would like a bit more background on the threaded abstract machine model. I think that would help to illuminate the explanations in that section that were lost on me having now knowledge of the model.

Posted by: Aaron Cahn | November 20, 2014 12:42 AM

Summary:
Active Messages is an asynchronous communication mechanism for efficiently sending messages in large scale multiprocessor environments. Active Messages allows more cost efficient use of hardware by allowing communication and computation to overlap. The goal is to keep the pipeline full, so multiple communication operations can come from a single node and messages are not buffered except for network transport. A privileged interrupt handler executes immediately to quickly to get the message out of the network and headers in the message specify extra instructions to help integrate communication with computation.

Problem:
Existing message passing multiprocessors have generally neglected network performance and they tend to have unnecessarily high communication costs. This is partly due to inconsistent programming methodologies and system goals applied to large parallel machines. This is also caused by hardware designers who focused on raw processor performance (and neglected network performance). The goals of Active Messages are to: 1-minimize communication overhead, 2-allow communication to overlap computation, and 3-coordinate the two without sacrificing processor cost / performance.

Contributions:
Active Messages Protocol:
Traditional 3-phase protocol can’t overlap communication, and standard non-blocking solutions involve using buffers on both ends which creates too much start-up cost in terms of processor cycles. The key optimization in Active Messages is the elimination of buffering, which makes small messages more efficient, making programming easier and network-based buffering then becomes sufficient. Deadlock is prevented by using special non-blocking interrupt handlers. Messages are also kept simple so handlers can reply quickly.

Split-C:
This is an experimental shared memory extension of C for SPMD programs (which uses Active Messages). Operations include put & get which perform split-phase copies of memory blocks to & from nodes. Split-C allows shared memory read/write (but does not address naming issues).

What I Found Confusing:
Threaded Abstract Machine: TAM - I thought involving the compiler to help manage memory allocation and scheduling was a novel idea, but it seems like a lot of assumptions were made about this model because I found the explanation of how continuation vectors & activation frames work in conjunction with the compiler a bit confusing.

What I Learned:
Not using buffers at all, yet still managing to create a functional non-blocking system is a great achievement. The way they achieved non-blocking and made it efficient was unexpected. I also learned about some system architectures I had not encountered before.

Posted by: Jason Feriante | November 20, 2014 12:39 AM

Summary:
The authors present Active Messages, an asynchronous messaging primative which brings order of magnitude improvement in communication overhead on non-specialized hardware by for example removing buffering at send/receive endpoints. In addition, evolution in hardware support is suggested to further the benefits.

Problem:
Computers need to both compute locally and manage communication. Processors which are designed for computation see high communication overheads due to buffering data or waiting for multi-phase protocols. Hardware specialized for messaging may better overlap communication and computation, reducing the impact of communication costs, but general efficiency in computation suffers.

Contributions:
- Idea to include control information in the header of a message to direct recipient to the correct handler for the transfer. Sender does not buffer waiting for receiver, and receiver is interrupted to execute the handler, thus also avoiding buffering.
- In order to achieve overlap in computation and communication, programming patterns must take into account the latencies, so the authors provide Split-C programming model with a matrix-multiply example.
- Hardware support possibilities outlined, including DMA (with a small Active header sent before data to instruct the setup of DMA), Message Registers for efficient transfer of processor state without using memory buffers, etc.

Learned:
Even with asynchronous I/O, processes can be made more efficient by consciously overlapping local work with expected network latencies of communication while writing code.

Confusing:
Are addresses necessary? Requiring SPMD for correctness restricts flexibility in software upgrades. Something else that wasn’t answered is: If programmers do not correctly account for latency, do Active Messages still help at all/does anything get worse?

Posted by: Brandon Davis | November 20, 2014 12:33 AM

Summary:
This paper introduces a simple communication mechanism, Active Messages, which allows cost effective use of the hardware and offers tremendous flexibility. To demonstrate the utility of Active Messages, the author introduces a new programming model, Split-C, which can overlap communications and computations. In this end, some hardware support suggestions are presented for better support Active Messages.

Problems:
The design challenge for large-scale multiprocessors is to minimize communication overhead, allow communication to overlap computation and coordinate the two without sacrificing processor cost/performance. In message-pass architecture, synchronous model cannot overlap communication and computation due to its blocking mechanism, while asynchronous model have the problem that start-up costs prevent overlap of communications and computation except the message is large enough.

Contributions:
1. Detailed analysis on traditional message-pass model. For two message-pass model, synchronous model and asynchronous model, several aspects are inspected, like the protocol of message delivery, costs of set-up and the utilization of network bandwidth.
2. Separate the computation logic with communication logic. The Active Message handle does not perform computation operations but just extract data contents from network or integrate data into ongoing message.
3. A novel structure of message. Each message contains at its head the address of a user-level handler which is executed on message arrival with the message body as argument. Receiving buffering is eliminated due to that arriving data is pre-allocated in the user program or it can be replied immediately by handler. Sending buffer is only required for large messages.
4. The programming model Split-C that uses Active Messages. Also a matrix multiplication is used to show the performance and simplicity of Split-C.

Confusings:
1. The thing that confused me is that why Active Messages does better than the asynchronous message passing model. It seems that the difference between these two only comes from the buffer management and handler of Active Messages only concerning data integration and extraction.

Things I Learned:
I learned decoupling logics of a complex protocol could improve the efficiency. Like here, decoupling computation and communication would improve the overlap of these two in message passing protocol.

Posted by: Lichao Yin | November 20, 2014 12:31 AM

Summary :
The authors introduce active messages, a mechanism to overlap communication and computation. The goal of this would be to make better use of the underlying hardware which many systems tend to neglect.

Problem :
Systems which involve message passing focus more on minimizing time spent on computation as a result of which, the utilization of network hardware becomes very low. Also, since there are no standards for programming, it is very difficult to create optimizations for balancing communication and computation.

Contributions :
1. Parallelize communication with computation by using a handler which stores the user-level instruction pointer to extract the message. This way, the computation thread has access to the instruction all the time.
2. The authors state that they eliminate the need for using buffers because now, the user program either deals with it or the handler replies directly if it is a simple request. This was done by pre-allocating storage.
3. The system has two simple channels, one for send and another for receive which, in a clean and simple way, avoids deadlock.
4. Programming Split-C using GET and PUT. This method of using these as separate operations results in a high performance gain by using messages for communication.

What I learned :
I learned how communication and computation can be effectively overlapped to increase net performance of the system.

What I found confusing :
How does polling work?
I also found the TAM model highly confusing.

Posted by: Anusha Dasarakothapalli | November 20, 2014 12:22 AM

Summary:

This paper discusses the idea of active messages that is proposed to solve the problems of inefficient communication and computation as exhibited by then existing message passing and message driven multiprocessors. The paper shows how active messages are better than existing techniques and also how active message performance can still be improved if there is hardware support for active messages.

Problem:

Message passing multiprocessor machines have high communication overhead (synchronous) or inefficient buffer management (in async model). Message driven architectures integrate network deep into the processor so communication overhead seems less but the processor utilization is less. There is a disconnect between the programming model and the hardware functionality in existing methods. The authors seek a way to overlap communication and computation to better utilize the processor and the network.

Contributions:

1. The idea of AM itself - Novel way of including the address of the message handler at the head of a message. The message now contains the pointer to the code that does some processing i.e., it is an active entity - This idea seems to have inspired lot of other works like active storage etc. The handler just extracts the message from the network and integrates the extracted data into computation.

2. Elimination of buffering of messages - There is no need to buffer messages on the receiver because the space is preallocated by the user program and the handlers are very short and they don't allow piling up of messages at the receiver.

3. Split C programming model to show the effectiveness of active messages - The example shows one can have a highly efficient parallel matrix multiplication by getting the column for next iteration while computing the current iteration. The program has to take care of balacing the latency of the get operation with the compute time for the current iteration. If the program is in such a way that the wanted data just arrives when it is needed, we have a very good overlap between communication and computation.

4. The authors show good performance with active messages without proper hardware support. They also propose a set of hardware changes that are required to support active messages in hardware.

One thing I found confusing:

I didn't understand the details of TAM where scheduling and memory allocations are managed by the compiler.

One thing I learned from paper:

I learned how communication overhead can be hidden by overlapping computation and communication.

Posted by: Ramnatthan Alagappan | November 20, 2014 12:16 AM

Summary:
This paper describes Active Messages, a new message architecture which allowed for extremely fast and lightweight messaging in a parallel system.

Problem:
Messaging architectures of the day were inefficient. They either spent too long waiting for messages, under-utilized the network, or both. Part of the problem is the interplay between multiprocessors and the network were ignored, where the network was less important than the speed of processing.

Contributions:
The main contribution is the idea of an Active Message, which is simply a message that has a head which contains a pointer to computation that should be done upon reception of the message. This simple idea served to be an order of magnitude faster than previous messaging architectures.

An important takeaway is that messaging architectures that are application dependent can be important. While this was true before, the fact that Active Messages, who were significantly faster, only worked with SMPD programming models, makes the point very apparent. This, of course, begs the question of how to find similar gains in other paradigms.

Another important contribution is the extreme focus on overlapping communication and computation. It is well demonstrated that this fundamental idea is key to the utility of Active Messages, and indeed any messaging architecture which serves to reduce inefficiencies.

Did Not Understand:
I had a hard time understanding the different message driven architectures, and their related hardware descriptions (Monsoon and J-Machine).

Learned:
I knew very little about the basic messaging in computer architecture. Even the simple architectures I had never seen before.

Posted by: Frank Bertsch | November 20, 2014 12:11 AM

Summary:
In this paper authors present Active Messages, a new asynchronous communication mechanism for multiprocessors to improve their performance. This mechanism provides ability to overlap communication with computation and reduce the overhead in communication. It is mainly achieved by asynchronously sending across the messages which invokes a handler at the receiver to extract the message and integrate into the on-going computation. The handler is specified in the control information at the head of the message. To achieve further improvement in performance authors have also recommended additional hardware support for Active Messages mainly in network interfaces and processor support of message handlers.

Problem:
The existing message passing multiprocessors have reduced performance due to high communication overhead and non-support of overlapping in communication and computation. Also multiprocessor designers have invariably focused on either
processor performance or network performance, but ignored the interplay of processor and network. The models with blocking mode for communication with 3-phase protocol had high communication latency. And message passing implementation with non-blocking buffered messages suffered from high start-up cost.

Contributions:
-Bringing out the problem of poor/non overlap of computation and communication results in reduced performance.
-The idea to send messages asynchronously and using the control information at the head of message to invoke the handler at receiver to extract message and integrate to on-going computation.
-Allowing the sender to specify the address of the handler thus generalizing the hardware functionality.
-Elimination of buffering in Active Messages, since the handlers interrupt the computation immediately on message arrival and execute to completion.
-Split-C implementation using Active Messages with 2 base functions – PUT and GET, where the data computations are sandwiched between PUTs and GETs and
synchronization is achieved through checking a flag or busy-waiting.
-Several suggestions for hardware improvement in network interfaces and support for message handlers to further improve performance.

Confusing:
-Did not completely understand the message driven model TAM.
-Unclear how Active Message prevent deadlock when used in nCUBE/2.

Learned:
-Overlapping computation with communication can improve processor performance.
-Understood the difference and advantages of blocking/non-blocking, synchronous/asynchronous message passing schemes.

Posted by: Bhaskar Pratap | November 20, 2014 12:07 AM

In multiprocessor systems, message passing architecture involves a lot of communication overhead. Synchronous message passing follows a blocking send/receive. Therefore, there is no scope of overlapping computation with communication. Asynchronous message passing allows for overlap but at the cost of high buffering both at the sender and the receiver end. This paper comes up with a mechanism called Active Messages for communication in multiprocessors which is asynchronous and tries to achieve a high ratio of computation with communication with better processor performance.

Contributions :

1. Active Messages mechanism includes a handler in the message which executes when the message is received taking the message body as an argument.
2. The active messages use an asynchronous send and on receiving the message, the handler executes immediately in parallel with the existing computation and provides the message extracted to the computation. Active messages get rid of buffering by pre-allocating data structures into which the data is to be received.
3. Incorporated active messages into Split-C programming language and have achieved a measured network utilization far lesser than the predicted utilization for 128 processors.
4. The active message mechanism differs from message driven model in that the handlers do not suspend and also don’t perform any computation on the message retrieved. They only extract messages from the network to integrate them in the existing computation.
5. The results of using the active messages mechanism shows a significant reduction in communication overhead as compared to the message passing architectures used in the CM-5 machines.
6. Also discuss the modifications and extensions to the network interface and the processor that are required for active messages to work.

Learning :

The key thing that i learnt from this paper is high performance can be achieved by high overlap of computation with communication and not only reducing communication latency. Also that messages could be treated as active objects instead of just being passive entities and the processing of the messages could be embedded within the message itself.

Confusion :

The working of the TAM compilation model based on active messages wasn’t quite clear to me. If the user-level address of the handler specified by the sender in the message is corrupted during communication, there should be a mechanism to detect it?

Posted by: Krishna Gayatri Kuchimanchi | November 19, 2014 11:50 PM

Summary:
This paper discusses the efficiency problems of existing message passing multiprocessors system, and proposes a new communication mechanism, in which the message header is the address of a user-level instruction sequence that will extract the message from the network and integrate it into the on-going communication.

Problem:
In the parallel computing system built on multiple processors, the processors need to handle both computation and communication. In existing message passing system, the computation and communication can’t be overlap, and the overhead of communication is high.

Contribution:
- By allowing the sender to specify the address of the handler to be invoked on message arrival, Active Messages enables concurrent communication and computation.
- Active Messages eliminate buffering in application layer.
- The handlers of active messages are very simple and deterministic, they execute immediately upon message arrival, can’t suspend, and have the responsibility to terminate quickly enough not to back-up the network.
- Active Messages is independent of any programming model. The authors discuss the possible hardware support for active messages in two categories: improvement to network interface and modification to the processor to facilitate execution of message handlers. Improving active message with hardware support is helpful for all possible programming models.

Learned:
In parallel computing system, it’s nice to make computation and communication in parallel. This can be achieved by a small trick - including message handler in the message itself.

Discussion:
I am confused about the background story in active messages development. The hardware/software environment of them seem to be strange.
The active messages can only support homogeneous system, the hardware/software in every node should be exactly same. This is a big limitation in my understanding. Is it possible to add one layer of abstraction over the handler in the message?
I feel including handler in the message are not always good idea. For the processor with deep pipeline and large speed gap between cache and memory, fetching new instruction (handler) can’t get any help from prediction in that case, which would downgrade processor’s performance significantly.
The idea of active messages is a little bit similar to active network architecture - Including some logic data in addition to the pure data in the message (packet). The result system of such idea is highly flexible, but I doubt the necessity of the flexibility for the success of the system.

Posted by: Peng Liu | November 19, 2014 11:45 PM

Summary:
This paper introduces Active messages - a communication mechanism which minimizes communication overhead in the system. This mechanism tries to overlap computation and communication.

Problem:
In traditional message passing systems, there is a huge overhead in message passing. In synchronous message passing, the sender is blocked till the receiver is able to receive the message. Also in asynchronous method, there is a huge startup cost. Sender return immediately after send operation, but buffering is needed on receiver side so that application can consume message when it is ready.

Contribution:
- In active message scheme, the address of the handler is passed in the head of the message. On receiving the message, the handler is executed by the receiver which takes the data in the message as the argument. The handler is a short function whose role is to integrate the data into the ongoing computation or send an reply to the message and then returns immediately.
- There is no buffering needed on the receiver side because the handler is executed immediately on receive. If required buffering can be done by the user program.
- Author presented the split-c programming language which uses two simple operations: PUT and GET. These two operations use active messages protocol. Application developer uses GET to request the data in advance and then resume computation. And when the computation reaches the point where it requires the data, it polls to check a flag. The developer can fine tune when to request for data so that maximum time is spend in computation (requested data is available when required).
- Paper also suggested changes in hardware (the network interface and the processor) to improve the performance of active messages.

Confused: I didn't quite get the TAM execution model, specially the scheduling of activation frame using threads.

Learned: I learned that how communication overhead can be reduced and how to use different programming style to do it.

Posted by: Avinaash Gupta | November 19, 2014 11:32 PM

Summary: Paper presents Active messages, which is a processor to network communication paradigm that aims to overlap communication with computation by having messages point to a handler function that grabs the message and gets it into the computation. They show that it has an order of magnitude better performance in terms of cpu utilization than traditional methods of message passing.

Problem: When you have a super computer with hundreds of processors, it is important that those processors are actually utilized. However, when it comes to parallel programs that operate on a message passing interface, the network communication quickly becomes the main limiting factor in computation time thereby lowering constructive computational work to a great degree. Drastic steps have been taken to address this by trying to build complicated hardware mechanisms to try and deal with this, but perhaps an easier and more evolvable paradigm would be better suited.

Contribution: The main contribution of this work was the presentation of the active message design and implementation on varies hardware and the evaluation that showed that Active Messages perform extremely well compared to current practices.

The actual idea behind Active Messages is very simple. Basically, at the head of each message communication contains the address for a handler function that should deal with said message. The purpose of the handler function is to take the message off the network and interleave it into the ongoing computation (e.g. putting it into data structures). Traditional ways of handling messages were either have a three phase protocol or to have buffers to asynchronously send and receive. With Active Messages, there are no buffers and thus no complex buffer management that takes up cycles because the handler functions are called immediately and are non-blocking (otherwise deadlock could occur). Handler functions are predetermined, so Active Messages work best in a SPMD context.

They developed a programming model called Split-C that allows programmers to take advantage of Active Message paradigm via a put and get model; where a put places some memory at some location and get retrieves it.

One of the cool things about Active Messages is that their implementation is in software and they still get great efficiency. Hardware support would make it even faster.

Confusion: I was not really understanding what the other message driven architectures were doing (section 3).

Learned: About some of the drawbacks when it comes to different methods of sending and receiving messages. In a time-shared system, blocking on messages doesn’t seem like that big a deal, since the processor can switch to another task on the queue, thus overlapping computation and communication. I wonder if this is still a problem with modern distributed systems such as Condor, which we just talked about.

Posted by: David Tran-Lam | November 19, 2014 10:57 PM

Summary
The paper introduces a communication mechanism, Active messages, which with the help of hardware support helps reduce network latency without sacrificing much of processor performance and cost. The implementation of Active messages on hardware has also been described.

Problems to solve
Commercial multiprocessors have focused to improve on cpu performance only and have neglected optimizing network costs. Active messages seeks to improve upon performance and network latencies with help from hardware by overlapping computation of instructions and communication between machines. Design a programming model which helps overlap communication and computation.

Contributions
• Recognized issues in the send/receive non-blocking model. These include large start up times which reduce chances for overlap, inability to utilize full network bandwidth and also the mismatch between the programming model and hardware functionality.
• Active messages generalizes the functionality of hardware by passing the address of the handler to be invoked by the receiver within the messages. This is achieved with the help of the SPMD programming model which runs on each node.
• Elimination of buffering on the receiving end of the messages as memory is pre-allocated or the request is so simple that the handler can reply immediately.
• Described implementation of Active messages on the nCUBE/2 and CM-5.
• Description of a new programming model, Split-C, which utilizes Active Messages through the two primitives GET and PUT.
• Message driven architectures lack of locality is enhanced with the help of TAM scheduling hierarchy which removes need for memory allocation on message arrival.

Found confusing
Did not completely understand as to how TAM works.

Learnings
Learnt about the importance of overlapping communication with computation to get more cpu performance and lower network latencies.

Posted by: Shiva Prashant Chada | November 19, 2014 10:54 PM

Summary: active messages provides a way to parallelize communication and computation by using asynchronous messaging.

Problem: with synchronous messaging, a node calls send and waits for the other end to receive before resuming computation. By doing so, the communication and computation is serialized, thus suffering the consequences of high latency or poor throughput on the network. However, this approach is simple, because it requires little coordination, and is easy to reason over.
Contributions:

use asynchronous messages instead of synchronous ones

apply the new message passing system to a shared memory programming environment

take advantage of DMA in hardware

Confusing: why weren't people using asynchronous messaging before?
Learned: the people who work on hardware don't always know what people in software need. They may be trying to optimize something that doesn't have a significant effect on speed-ups in the computation. In this paper there was the example of hardware people making CPUs faster, and making networking faster, but they aren't tightly coupled enough for parallel computing.

Posted by: Theo | November 19, 2014 10:52 PM

Summary:
This paper is about Active Messages - an asynchronous,primitive communication mechanism that is used for building message passing models in a multi processor architecture where processors communicate with each other by sending and receiving messages. The basic idea behind active messages is that, it reduces the communication over head by eliminating buffering and overlapping communication with computation. The Active Messages model is evaluated via Split-C programming model, which is developed by the authors.

Problem:
Achieving low communication over head involves compromising on individual processor’s performance. Achieving good processor performance results in high communication over head. We have to combine best of both worlds.

Contributions:
1. Removed the communication overhead in traditional synchronous message passing models that involves blocking send / receive by eliminating the initial handshake and making the operations asynchronous.

2. Passing the address of user level handler to be executed at the receiver’s end which executes quickly and returns to the user.

3. Buffering is eliminated at receiver’s end, thereby reducing the buffer management overhead. This was made possible by ensuring that storage was pre allocated.

4. Simplicity in terms of scheduling the messages at the receiver’s end. The handlers just interrupt the ongoing computation.

5. Deadlock avoidance is implemented by a simple solution of having two channels for send and receive.

6. The handlers that execute at the receiver’s end are usually faster since they don’t do any computation. The handlers just extract the data and inject it onto the computation.

7. Development of programming model split-C with GET and PUT functions.

One thing I learnt:
Overlapping computation with communication can hide the underlying latency to a good extent. This can be achieved by exploiting the parallelism in the code.

One thing I found confusing:
I did not quite understand how polling works with active messages.

Posted by: Manasa Subramanian Ganapathy Subramanian | November 19, 2014 10:52 PM

Summary:
The paper talks about Active Messages, a communication mechanism for large-scale multiprocessor setting which overlaps computation and communication with effective use of hardware. Later on, hardware enhancements have been proposed to support the mechanism for effective message composition and handling.

Problem:
Traditional message passing architectures have large communication overhead. In synchronous mode, send and receive are blocking and does not utilize network bandwidth effectively. In non-blocking mode, additional buffers are used. But startup costs are high due to buffer management. Machines are not being used effectively due to poor or no overlap between communication and computation.
Active messages provide a handler to extract data and integrate it to ongoing computation.

Contributions:
- Overlapping computation and communication with handler: Handler stores address of user-level instruction to extract message. Now it is available to the computation thread whenever it requires it. Thus parallelizing computation with message communication. In traditional systems, the compute thread would have to wait till the message is received (low network bandwidth usage) or get it from buffer (overhead in buffer management and startup cost).

- Optimization through Buffer elimination: No buffers on receiver side as it is handled either by user program (pre-allocating memory) or its a simple request to which handler can reply immediately.

- Split-C programming model: Two split phase operations PUT and GET are supported, demostrates how active messages can be used as communication mechanism to achieve high performance.

- Improvements in network interface to reduce message composition overhead and Modifications in processor to facilitate execution of message handlers.

Unclear concept:
- Could not fully understand TAM execution model.
- The paper mentions that with active messages, latency tolerance becomes a programming/compiling concern. What are its repercussions and if its a good idea?

Learning:
- How computation and communication can be overlapped for performance enhancements in a multiprocessor setting.
- Difference between Active Messages, RPC and other message driven models was an interesting read, helped in better understanding and supported the introduced concepts really well.

Posted by: Harneet Singh | November 19, 2014 10:41 PM

Summary:
This paper discusses the ideas of Active Messages – an asynchronous communication mechanism to use the hardware capability to efficiently overlap the communication and processing. The main goal here is to spend most of the time ( > 90%) in computation and not in communication. It is achieved primarily by asynchronously sending across the messages which then trigger the handler at the receiver to extract the data and continue computation.

Problem:
Previous models such as message passing model, suffered from communication overhead, especially for not being capable of overlapping the computation and communication. Blocking messages usually used the expensive 3-phase protocol. Although non-blocking messages offered better performance they had huge startup overhead. Also most previous multiprocessors only concentrated on processor performance and ignored the interplay of network and processing.

Contributions:
• The idea is to send the messages across the network asynchronously (and non-blocking) with each message having the address of the handler at the beginning.
• Handler – implemented as a hardware interrupt handler on the receiver side, it extracts the message from the network as soon as it is received. It does not perform the computation itself, thus it’s responsive.
• Active Messages eliminates buffering (unlike conventional non-blocking) since messages are temporarily stored in network and are immediately brought into the user memory by handler.
• Split-C provides an implementation of Active Messages with 2 base functions – PUT and GET. The data computations are sandwiched between PUTs and GETs by estimating the time required for data computations.
• Hardware customizations can boost the performance of the Active Messages – like using DMA for faster message copy, message registers for quick transfer of processor state, reusing the message data in the network.
• The message handler is implemented using interrupts which causes flushing of the processor. Alternative to this is to do fast polling for frequent handlers (software based). Also using a dual processor can offload the handler and the computation to separate processors, it allows simultaneous execution but requires communication for coordination.

Confusion:
Why is it necessary for the active message model to have a uniform code image on all nodes? The paper mentions that it’s required for the handler address to be coherent but can’t they instead just pass the handler name or ID such that no uniformity in the code is required?

Learning:
I learnt how there can be inefficient utilization of processor in spite of using non-blocking mechanism. Also I understood how the isolation of message extraction and data computation can provide inherent parallelism to the execution.

Posted by: Chetan Patil | November 19, 2014 09:53 PM

Summary
The authors introduce Active Messages, a communication mechanism that allows for the overlap of computation with communication, while keeping the overheads minimal, thereby demonstrating an improvement in performance to existing message passing schemes. They also go on to implement a split-phase shared memory extension that demonstrates the effectiveness of this communication model. The authors finally suggest that with additional hardware support tailored to active messages, they can even achieve better performance than on current hardware.

Problem
Existing communication models in message passing architectures suffer in performance due to poor overlap of compute and communication and due to overheads that result from start up costs for messaging. By reducing the buffering to that required by data transport through the network, and having compute proceed as data travels through the network, it is possible to both reduce the start up cost, as well as overlap communication with compute, thereby masking the latency of the communication.

Contributions

The author’s observation that the masking of communication latencies can be better handled outside of the architecture of the system, leads to the insight that providing a mechanism to overlap communication with compute will allow the programming to handle when the communication needs to be initiated to best overlap with the compute. An ideal initiation would ensure that the data is available in a ‘just in time’ manner for the compute to utilize it, completely masking the communication overhead needed to get the data.

Keeping buffering at a minimal lowers communication overhead - senders block till messages can be injected, receivers act upon a message as soon as it arrives. Handlers are kept short - their role is to just extract the data and integrate it into the computation. This allows handlers to run to completion and reduces the need for buffering on the receiver.

The Split-C implementation and the PUT and GET primitives. This effectively demonstrates how a parallel programming paradigm can be built on top of active messages as a communication primitive. In the example provided, if the compute that follows the GET needs to be just long enough for the ‘check’ on the flag to not spin excessively. This is a demonstration of how a programmer can organize data communication effectively, by knowing when the data needs to be used for computation.

Suggestions on how changes both to the processor and the interconnect can enable active messages to perform better than their implementations on general purpose hardware.

What’s unclear
By delegating the responsibility of identifying the overlap to the programmer, doesn’t the system intrinsically kill portability? Across systems, there could be large variations in the factors T-compute and T-communicate. By tasking the programmer or compiler to find the sweet spot, aren’t we just shifting the problem somewhere else?

Key takeaway
Communication is the cost of distribution. Try and hide it as much as possible by overlapping it with useful work to improve performance.

Posted by: Vijay Kumar | November 19, 2014 09:14 PM

Summary:

The paper describes Active Messages, a mechanism for integrating communication and computation in multi-processor systems without having to trade-off a particular characteristic. It also describes the hardware changes that can be done in network interfaces and the processors in order to support it.

Problem:

Existing message passing multi-processors have high communication overhead. Message driven machines have low communication overhead, but poor processor cost/performance. Traditional send/receive models do a poor overlap of communication and computation. There is also a mismatch between the programming model and the hardware functionality.

Contributions:

- Recognized the problem of having a poor overlap between computation and communication and figuring out how to overlap these tasks to achieve the best performance.
- Novel mechanism of passing the address of destination's user-level handler so that it's quickly executed.
- Elimination of receiver buffer to reduce communication overhead.
- The idea of the handler not performing computation on the data, but to only extract it and integrate it with the ongoing computation makes the message processing quick. This is a drawback in event loop systems where a handler can block too long.
- The idea of allowing a user level handler to access a network interface, and immediate replies was present in some other paper I read in Advanced OS, but I can't remember which.
- The message register mechanism to eliminate overhead can be seen in the L4 microkernel (although the paper speaks about this as a hardware optimization, while the L4 uses existing registers for much simplified communication).
- Dual processor with different processing power can be seen in today's big.LITTLE technology.
- Simple mechanism for deadlock resolution by using two different channels for sending and receiving.

One thing I found confusing:

I didn't understand how the polling mechanism used for receiving an Active message would be more efficient than an interrupt based mechanism. Traditionally interrupt based mechanisms are preferred to avoid wasting time polling.

One thing I learned from paper:

I learned how important it is to carefully figure out which parts of a program need to be interleaved in order to achieve the best performance.

Posted by: Satyanarayana Shanmugam | November 19, 2014 09:11 PM

Summary:

In this paper, the author presents a message
passing mechanism without special hardware support,
but still can achieve good performance, meaning
overlapping communication and computation well.
A programming model is discussed given this
message passing mechanism.

Problem:

The key problem for implementing a multiprocessors
system via message passing is how to overlap
computation and communication. For synchronized
communication, the overlapping is not that possible;
but for asynchronized communication, the overlapping
is good as long as the cost for setting up
the communication is short. However, most of the
software-driven implementation requires a nontrival
amount of computation time to setup messages
(e.g., caused by buffers).

Another way of implementing message passing is
to have hardware support. However, as the author
describes, most of these approaches have relatively
low computation capability. This also limits the
flexibility.

The central problem that this work tries to tackle
is how to implement message passing efficiently on
hardwares without special support.

Contributions:

The author presents Active Message. The key observation
is that, the gap between commodity hardware between
the functionality of message passing requires is
not that large. For example, the DMA protocol implemented
by most hardwares can be used as a way to implement
message passing efficiently. Therefore, in each active
message, it contains the address of instructions; thus,
no buffers is needed.

Given the implementation of active message, the user
presents a Splic-C protocol, which has two primitives
PUT and GET.

What I Found Confusing:

I am confused on those machine names nCUBE/2, TAM etc.
I feel like more background is necessary to fully understand
their difference.

Posted by: Ce Zhang | November 19, 2014 08:40 PM

Summary:

The paper presents Active Messages, a communication mechanism for multiprocessor environment. This mechanism aids in efficient concurrency of computation and network communication, thereby leading to effective network bandwidth and processor utilization. Concept of active messages is discussed by using two message passing systems nCUBE/2 and CM-5. The work also suggests required modification in the hardware for active message support.

Problem:

Although parallel programming has gained a lot of fame, there are no specific standards for programming. So hardware designers cannot perform optimization specific to create a balance between computation and commutation.
Hardware optimization process is usually focused on just processor performance and it does not account for interaction between network and the processor.
With respect to network efficiency, latency is the usual target of optimization, which again does not account for effective overlap.

Contributions:

Eliminating buffering requirements(which is mandatory in asynchronous send/receive) excepting for minimal needs like deadlock avoidance, network transportation of large messages. This aids in efficient overlap of computation and communication as it reduces the overhead of buffer space allocation.
RETRY of messages taken care at user-level instead of message layer, thereby avoiding overhead to message layer for buffering.
Split-C matrix multiplication performance numbers corroborates overlap in communication and computation by means of constant and high join processor-network performance.
Scheduling and memory allocation performed on software message handler overcoming the limitations of message driven architectures. Handlers are meant to deal with only message placement to computation and not perform the computation itself, there by making the handler simpler.
Several hardware improvements such as reusing message data for composition, composition in registers, hardware polling and execution of handlers and computation in separate processors, will aid in effectively leveraging hardware support for the active messages.

Unclear concept:

In the CM-5 message passing system, I was confused whether there is provision for user-level network interface access as in the reasons section point 1 and 5 seemed to be contradicting about this.

Learning:

I learned that polling using system calls can be used as an alternative to interrupts.

Posted by: Meenakshi Syamkumar | November 19, 2014 05:56 PM

(This is duplicate to the one below. I forgot to write my name the first time.)

Summary: This paper proposed a new programming model for large-scale multiprocessors called Active Messages, which can overlap and balance communication and computation to achieve better utilization of hardware.

Problem: Traditional message passing models have the following weakness.
1. Computation and communication cannot overlap or cannot overlap well.
2. Main efforts are made to maximize time spent on computation, causing a poor utilization rate on network hardware.
3. Handles messages layer by layer which results in a very high overhead for communications.

Contributions:
1. Overlap communication and computation by associating each message with a handler. When a message arrives, the hardware will trigger an interrupt, and the handler will be called to process this message. In this way, the program will have a dedicated thread for computation, and the handler is responsible for retrieve the message and copy the message so that the computation thread can use it.

2. Eliminated receive buffer. Since the message handler will copy the message anyway if needed, there is no need to put the received message into a buffer. Eliminating the receive buffer can also greatly reduce the overhead of communication, e.g. reducing the startup time for each communication, as the major work of message initialization is buffer management.

3. Designed specialized hardware for Active Messages. The authors observed that the current hardware cannot support the Active Message model to the best because of poor locality. They designed hardware that can support Active Message directly.

Things confused me: It appears to me that the only difference between the Active Message protocol and the traditional asynchronized IO is that the Active Message can get rid of the receive buffer. How can that be a big save for modern hardware, especially when DMA can handle this without using the CPU power?

Things I learned: I learned how to achieve computation and communication overlap by adding a handler field in the header of each message.

Posted by: Menghui Wang | November 19, 2014 04:59 PM

Summary: This paper proposed a new programming model for large-scale multiprocessors called Active Messages, which can overlap and balance communication and computation to achieve better utilization of hardware.

Problem: Traditional message passing models have the following weakness.
1. Computation and communication cannot overlap or cannot overlap well.
2. Main efforts are made to maximize time spent on computation, causing a poor utilization rate on network hardware.
3. Handles messages layer by layer which results in a very high overhead for communications.

Contributions:
1. Overlap communication and computation by associating each message with a handler. When a message arrives, the hardware will trigger an interrupt, and the handler will be called to process this message. In this way, the program will have a dedicated thread for computation, and the handler is responsible for retrieve the message and copy the message so that the computation thread can use it.

2. Eliminated receive buffer. Since the message handler will copy the message anyway if needed, there is no need to put the received message into a buffer. Eliminating the receive buffer can also greatly reduce the overhead of communication, e.g. reducing the startup time for each communication, as the major work of message initialization is buffer management.

3. Designed specialized hardware for Active Messages. The authors observed that the current hardware cannot support the Active Message model to the best because of poor locality. They designed hardware that can support Active Message directly.

Things confused me: It appears to me that the only difference between the Active Message protocol and the traditional asynchronized IO is that the Active Message can get rid of the receive buffer. How can that be a big save for modern hardware, especially when DMA can handle this without using the CPU power?

Things I learned: I learned how to achieve computation and communication overlap by adding a handler field in the header of each message.

Posted by: Anonymous | November 19, 2014 04:57 PM

CS 739 Reviews - Fall 2014

Active Messages: a Mechanism for Integrated Communication and Computation

Comments

Post a comment