« End-to-end authorization. | Main | Active Messages: a Mechanism for Integrated Communication and Computation »

TreadMarks: Shared Memory Computing on Networks of Workstations

C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel, TreadMarks: Shared Memory Computing on Networks of Workstations IEEE Computer, Vol. 29, No. 2, pp. 18-28, February 1996.

Reviews due Thursday, 11/13.

Comments

Summary
The paper presents ThreadMarks , a distributed shared memory system which presents a notion of global virtual memory to the programmer while executing different processes in parallel on a network of workstations

Problem
1.Partitioned Address Space - The concept of message passing has been around for a long time and developing code based on these message passing libraries can be hard since it’s a different paradigm than shared memory programming and might result in poor performance due to improper usage .

2.Poor Performance - Current implementations of DSM employ a sequential consistency model which results in poor performance due to excessive communication amongst nodes

Contributions
1.API - An API similar to pthreads is presented to developers which include synchronization primitives such as locks and barriers allowing the developer to write code using a shared memory model .

2.Portable - ThreadMarks in built in user level space and so it can be ported to any Unix based system

3.Release Consistency - Shared information is sent across to the process only when the process acquires a lock ,and this information is usually piggybacked along the lock grant message reducing a lot of unnecessary communication .

4.Multiple Writer Protocol - It allows multiple writer scenario by calculating diffs and merging them amongst processes modifying the same page , thereby eliminating false sharing .

Comments
ThreadMarks goes a good job presenting an abstraction of global address space to the developer while it takes care of the implementation of message passing amongst nodes .

The paper lacks many implementation details such as how does the system handle network partitions and node failures .

summary:
- they described their distributed shared memory system, TreadMarks.

problem:
- we can use parallel programming to increase performance (making the programs faster and take advantage of idle resources)
- providing a virtual shared memory (instead of sending messages between nodes, for example using MPI) can make parallelizing the program easier.

contribution:
- they provide functions to allocate shared memory, and functions for synchronization (barrier and locks)
- they use release consistency instead of sequential consistency:
- in sequential consistency, all the other processes will be informed of the updates (as if everyone using the same memory).
- in release consistency only the next process that acquire the lock will be informed and this way they can minimize the communication overhead.
- they use multi-writer protocol that allows multiple processes to write (in different parts) of the same page of memory.
- they update the pages using diff (find changed parts of the page and merge them)
- if I understand it correctly, they assume that the processes will not write in the same position.
- they showed how they used their system for solving mixed integer programming and genetic linkage analysis.

Summary:
This paper introduces the TreadMarks system, which provides a shared memory abstraction to support parallel computing on networks of workstations.

Problem:
DSM should provide an abstraction of a globally shared memory. DSM usually replicate data and should ensure consistency of the system. Sequential consistency sacrifices a lot of performance.

Contributions:
(1) The TreadMarks system allows processes to assume a globally shared virtual memory. Barriers (wait until all processes arrive at the same barrier) and exclusive clocks (only one process can hold the lock) are used as synchronization mechanism.
(2) Sequential consistency can have great performance degradation and false sharing, even "ping-pong effect". The system uses lazy release consistency, which enforces consistency at the time of an acquire. TreadMarks uses an invalidate protocol. At the time of an acquire, the modifed pages are invalidated. A later access to that page causes an access miss which in turn causes an up-to-date
copy of the page to be installed.
(3) Use multiple-writer protocols. Every process has a writable copy of page and diff will be applied on access fault.

Confused:
In the multiple-writer protocol, the TreadMarks can't even ensure the correctness when there is overlapping writes and it's also difficult to handle this in its multiple writer protocols. Each processor should know what's exactly addresses others have written. Or we add a timestamp for every writes, which means a global accurate clock and is nearly impossible.

Learned:
The paper gives a good illustration of how a shared memory system works. There is tradeoff between performance and consistency design. The lazy consistency can reduce the traffic. But I wonder there is any way to have "asynchronize synchronization". After one releases lock and another one acquires the lock, there may be time when the network is idle. The synchronization can be done at this time and later the one tries to acquire the lock don't need to do synchronization and can be faster. In this way, the total traffic is the same as eager release consistency but the program may be faster than the late release consistency.

Summary:

The author propose TreadMarks system to provide a efficient programming shared memory interface for a network of workstations. Each process in this interface can access all memory as it access its local memory, even if the memory is physically on another machine (transparency).

Problem:

How to make the memory accessing in distributed systems with many machines easy and efficient? One simple way is to make the shared-memory layer on all processors. It should be:
1. Transparent to programmers, progammers do not need to specify when/what/how to access a specific address in memory.
2. Efficient, the communication through the network should not be too slow because of the different access pattern.


Contributions:

1. Two synchronization primitives: barriers and locks.
Barriers is a check point that all processes should communicate and agree on the shared resources to move to next state.
locks is used to acquire exclusive access to the critical section of the shared resources.

2. Lazy Release consistency: The notification of updates on a shared resource is only sent to processes which acquire locks on it, other processes (not conflicting) will not receive the messages. This approach reduce the communication among processes.

3. Multi writer protocol: allowing multiple writers in parallel. For the access and update to a shared page, it use VM hardware to detect it and create a local copy and update on the local copy. When they synchronize (barrier), they calculate the Diff (difference between local updates and shared page), and communicate to get a agreement on the update of shared page.

Confusing:

Security issue: what would happen when a node fails? How about the consistency, synchronization? How does shared-memory model deals with single-point failure?

Learned:

The assumption of correct parallel programming: no data race, all locks changes are done during synchronization. Based on this assumption, the author proposes efficient lazy consistency.

Summary
The author describes about Network based shared memory model, challenges implementation and performance.

Problem
Classical, sequential instruction model is very difficult to share among multiple CPUs as otherwise it will require a lot communication which makes network based shared memory super inefficient. Additionally TreadMarks tries to abstract out the all integrety details and give programmer a view of large global memory.
Contributions

  • Check points: Using locks and barriers periodically parallel running process to reconcile their computing state before proceeding.
  • Multiple-writer protocol: It drastically decreases the communication among sub-processes running in different machines.
  • Lazy release consistency: Pages are updated only when the process tries to update a location in the page.

Confusing
I am not very clear how this system adapt to different mode of failures including network failure. Also Can arbritary program and code run with their library.

Learned
Distributed memory can achieve significant performance boost given these modification even in the presence of limited network bandwidth. I that that process synchronisation will pose sever drawback in network shared memory.

SUMMARY: The authors describe TreadMarks, a distributed shared memory (DSM) library that provides a single address space across multiple machines.

PROBLEM: In typical parallel programming tasks, programmers use message passing to explicitly communicate specific data at specific times between specific processes, which can be complex and therefor more prone to algorithmic bugs. Shared memory systems let the programmer focus on the algorithm instead of the data and communication. However, existing DSM systems such as IVY are too aggressive in their consistency semantics and suffer performance-wise in some common situations such as false sharing. TreadMarks aims to provide a well-defined consistency model that provides better network performance for many of these common problems.

CONTRIBUTIONS: The authors provide a simple API which includes a shared memory malloc()/free() and two synchronization primitives: barriers and exclusive locks. When a program uses these calls, TreadMarks is able to work behind the scenes, entirely in user-space, to provide the program with the view of a single shared address space among all processes supported by TreadMarks.

Behind the scenes, TreadMarks contributes a relaxed consistency model called "lazy release" which has advantages over both "sequential" and "eager release" models in terms of minimizing the amount of bandwidth needed by only pulling pages when and if they are actually needed.

In addition, TreadMarks recognizes that just because two processes are writing to the same memory page, they may not be writing the same actual address (and they assert that doing so is a data race) and the system should avoid sending entire pages back-and-forth between two processes updating the same virtual memory page. They accomplish this by recording a "diff" of the local changes made to a page, and replaying all diffs for a given page when a barrier is reached. This allows local access to the page and significantly reduces the bandwidth needed both because false-sharing is avoided and because the RLE diff is typically much smaller than the entire page.

LEARNING: I was confused in the example pseudo-code how the processes other than (Tmk_proc_id == 0) obtain the pointer to the memory allocated by Tmk_malloc(). In Figure 3, for example, it seems the pointer "grid" would be uninitialized for all processes where (Tmk_proc_id > 0). One thing I learned is that memory allocation systems can trap and respond to page faults entirely in user space, as I had previously assumed that virtual memory management was entirely in the domain of the kernel.

Summary

The paper discusses ThreadMarks, a distributed shared memory system.

Problem

In an organization a machine can have idle cycles. Idle cycles are waste of resources. To effectively use computational resources it is necessary to reduce the number and length of idle cycles. One way to make use of idle computation resources is to build a distributed shared memory system and run parallel tasks during idle cycles. Before this paper, there are other DSMs but they have some disadvantages like comminication overload, false sharing etc.


Contribution
1) Critical sections require synchronization. It is achieved through barriers and locks.

2) Uses relaxed consistency model( lazy release consistency) to reduce communication overhead by making sure updates are reached to node only requesting the lock.

3) Multi write protocol is used to solve the problem of false sharing. It does so by making sure that multiple processes write to same pages at same time and using synchronization to remove overlaps.

Learned

Lazy implementation of a model might help to reduce resource waste.

Confusion

In the paper implementation details of barriers are not given. In addition to that it is not sure if the system takes into account any form of failures. How does locks and barrier behave during failure?

Summary
The paper describes TreadMarks which provides with API to access shared memory on a network of workstations relying on commodity hardware and software. The API enables to write parallel programming code for given setup.

Problems to solve
Distributed shared memory must be replicated while providing abstraction of a single shared memory. Overcome data races and consistency issues. Reduction of number of messages and data exchanges over the network. False sharing is another problem to overcome. When two objects in same page are being written a ping pong effect is seen between the two as each one invalidates the others page.

Contributions
Describes TreadMarks ApI which includes Locks and barriers and also a memory allocator for shared memory.
Use of lazy release consistency to propagate updates. Basically the updates will be propagated when trying to acquire locks. All modified pages will be invalidated and another access to that page will replace it with the up-to date page.
Multiple-Writer protocol used which allows multiple processes to have multiple writeable copies. This is done using diff which are of less size thus reducing network traffic. Also this protocol takes care of false sharing.

Found confusing
How would it handle node failures? How are Locks and Barriers implemented for a distributed setting?

Learnings
Management of a distributed memory system with use of locks and barriers.


Summary:
- The paper presents TreadMarks, a program to facilitate sharing memory between multiple processors on a network. Sharing memory supports larger data structures, which provides faster parallel computing.

Problems:
- Handling read faults and writes faults.
- Consistency/validation issues when writing.
- Handling diffs and merges with multiple writers and race conditions.
- Ping-Pong effect with page sharing, where pages are invalidating their own pages on unrelated writes.

Contributions:
- Distributed shared memory gives processes a globally shared virtual memory. This allows processes to execute on a larger data structures without worrying about communication. This makes the shared memory more scalable because all processors can communicate through the network as opposed to having sockets between every processor.
- Since updates are only accessed within the critical section protected by a lock, processors only have to send invalidations to the next processor acquiring the lock. This makes the system farther scalable because processors are only sending one invalidate per an update.
- Other contributions such as lazy consistency, using diffs, and piggybacking modification on lock release are taken from related works. These implementation details will help optimize a scalable distributed shared memory system.

Confusing:
- I found a problem shown in [Figure 8 of original paper, Figure 9 of alternative paper] to be confusing. What kind of workload decreases in performance when adding more memory (from 4 to 5 processors)? What kind of workload can have a 1,100% speedup when adding 20% more memory (from 5 to 6 processors)?

Learned:
- Shared memory can be scaled and treated as a distributed system.

Summary:
The paper talks about design and implementation of TreadMarks, a distributed shared memory system over a network of workstations. TreadMark has been implemented as user-level library on top of Unix, has high performance and is portable. It is based on lazy release consistency model and its application on mixed integer programming and genetic link analysis is demonstrated.

Problem:
Networked workstations offer parallel processing at relatively low cost. The paper discusses the idea of building a shared memory processing system distributed over the network, providing an abstraction of globally shared virtual memory even though they execute on nodes that do not physically share memory.

Existing DSMs like Ivy follows a sequential consistency model, extensive communication between nodes which is expensive and suffers from the problem of false sharing and 'ping pong' effect.

Contribution:
1. Lazy release Consistency mmodel to avoid data races; updates to a synchronized object are sent only to next process which acquires the lock; helps in reducing message propagation
2. Mutli-writer protocol to reduce the effect of false sharing. Diff is maintained between the modified and twin copy. As diffs are smaller in size (than a page), it reduces bandwidth requirements.
3. Synchronization is done through barriers and locks.
4. provides a user-level library, API is configurable and portable.

Unclear concept:
Could not understand the concept of barriers completely.

Learning:
- How to implement a distributed shared memory system.
- Parallel programming and various challenges encountered such as data race conditions, choosing a consistency model, false sharing etc.

This paper talks about a system called TreadMarks that is a distributed shared memory system in a network of workstations. The DSM assumes a globally shared virtual memory and supports parallel processing. The paper deals mainly with the consistency model and the memory structure of the DSM.

Contributions :
1. TreadMarks prevents data races in the accesses to shared memory by providing synchronization primitives like barriers and exclusive locks. Barriers are global and the calling process stops until other processes arrive at the barrier. Locks are used to protect critical sections.
2. They illustrate the use of these synchronization primitives to solve the Jacobi iteration methods and the travelling salesman problem implemented in a distributed shared memory system.
3. The system uses a release consistency model over a sequential consistency model used in Ivy which incurs a lot of message traffic. The sequential consistency model handles read and write faults in the following manner :
Read faults : The page is replicated for all replicas, Write faults : The page is invalidated across all the other processors and the writer’s copy is the authoritative copy.
4. The lazy release consistency model reduces traffic by sending a message about the change of values only to the process that acquires the lock next.
5. It uses an invalidate protocol over an update protocol wherein the modified page is invalidated and later fetched on a miss instead of the acquire message sending out the updated values.
6. Reduces false sharing by a multiple writer protocol. Multiple writer protocol allows processes to write concurrently to non-overlapping regions of a page. The changes to the pages are updated by applying diffs.
7. The usage and benefits of using DSM has been illustrated by applying it to problems of mixed integer programming and genetic linkage analysis.

Learning :

I learnt the concept of lazy release consistency and how message traffic in the DSM system could be reduced by using this model.

Confusion :

I did not quite understand how exactly the synchronization between processes in the multiple writer protocol is done using barriers. There is no mention in the paper about how failures in the processes is handled.

Summary:

TreadMarks, supports parallel computing on network of workstations, providing abstracted shared memory on commodity hardware available at that time. Treadmarks, a distributed shared memory system, which abstracts the globally shared memory and programmer can access whatever memory they need without knowing which processor had that data in memory. The proposed system makes the job of programmer by exposing the interface, in which programmer need not do the partitioning and communication between the processors which was only available that time. They also propose a relaxed consistency model called release consistency model and solves the false sharing problem using a protocol for multiple writers.

Problem:

The 'naive' distributed shared memory system uses sequential consistency model, where the memory order is the same as the instruction order in the program. To maintain such ordering, system needs a lot of synchronization which involves a lot of communication. Also, the problem of false sharing in which the granularity of sharing is page size. So that, two writers working on the different offsets in the page gets page invalidated in each of them, which is called as "ping pong" in the paper.

Contributions:

1. TreadMarks proposed a library for Distributed shared memory, which runs and communicates over the network.
2. They have developed implementations for the library API calls which implements barrier, shared malloc, free, lock
3. They used a relaxed consistency model with two modes eager and lazy, where the data is pushed to other nodes when the lock is releases or pull model where the node acquiring the lock gets the new data from the node which recently released the lock.
4. False sharing problem is solved by unmodified twin and copies in the writer nodes. At the end of the barrier, diff is calculated and send to all the nodes with the copies to update to a recent value.

Learned:

The concept is very similar to Unified Memory Access supported by Nvidia CUDA 6, which manages memory between CPU and GPU. The mechanism of multiple writers was interesting and could be used in UMA.

Confused:

When a processor faults, they look up in the Global Virtual memory, but they don't say where it is actually stored ?

Summary:
This paper proposes a shared memory computing system, TreadMarks, as well as its programming APIs. Then, a release consistency memory model and a multiple-writer protocol that adopted by TreadMarks are introduced to solve problems that a previous DSM system, IVY had. In the end, operation overheads and applications are showed for TreadMarks.

Problem:
The major goal for distributed shared memory system is that it provides a unique view of memory to developers in which way developers can write program in a regular way and has no need to worry about how to pass message when parallelizing a big program. In some previous DSM system like IVY, the performances are restricted by:
1. Its memory model. The implementation in IVY of a sequential consistency model would cause a large amount of communication.
2. False sharing problem. Two unrelated data in the same page would cause ping-pong effect when two processors want to update these two data at the same time.
3. The writer protocols. The writer protocols of most DSM systems are single-writer protocols which would reduce parallelism and cause large amount of communications.

Contributions:
1. Analyses on the DSM system IVY. The author deeply analyze the first DSM system and reveals several deficiencies like memory consistency model, model implementation and write protocols. Based on these analyses, TreadMarks is developed.
2. The release consistency model. Two kinds of synchronization mechanisms are used in TreadMarks: barriers and exclusive locks.
3. Multiple-write protocols and diff implementation to deal conflicts. Multiple writers are allowed to update the same pages as long as there is no overlapping portion. diff mechanism is for reduce the effect of false sharing and communicating bandwidth requirement.
4. Scientific analysis applications that run on top of TreadMarks. Two large applications, MIP and ILINK with performance analyses are shown to demonstrate TreadMarks.

Confusings:
1. How does this system do fault-tolerance? For example, if one machine or process goes down, how can others detect it when waiting on barriers?

Things that learned:
1. The lazy consistency model is the most interesting thing to me. I think this kind of model can be applied to many distributed systems, like distributed file system and NoSQL database.

Summary:

In the papers the authors talk about TreadMarks which is a distributed shared memory computing system used on networks of workstation. TreadMarks provides efficient shared memory to the application with shared memory abstractions to make it easy to use, portability and improved performance by using relaxed memory models instead of sequential consistency. The paper discusses the techniques used and the experience with two large applications - mixed integer programming and genetic linkage analysis.

Problem:

With the advent of high-speed networks and rapidly improving microprocessor perfromace, network of work-stations was becoming an appealing vehicle for parallel computing; hence the class of applications that can be executed efficiently on a network of workstations was also expected to increase. Therefore there was a need for a distributed shared memory system for parallel computing on networks of workstations. Existing systems did not have any memory encapsulations and relied on the user (programmer) for effective memory distribution and also used sequential consistency which was performance intensive. To address these issues the authors implemented TreadMarks which was comparatively easy to use - it encapsulated the shared memory by simple and powerful APIs, portable - no kernel changes where required and provided good performance since it used relaxed memory models.


Contribution:

  • TreadMarks provides a layer of abstraction of a globally shared virtual memory in which each processor can access any data item without the programmers having to worry about where the data is or how to obtain its value. Simple but powerful API is provided for this.
  • The use of relaxed memory model (lazy release consistency) to alleviate the performance issues associated with sequential consistency.
  • The use of multiple writer protocol to alleviate the problems resulting from mismatch between the page size and the application's granularity of sharing.
  • Implementing applications (Mixed-Integer Programming & Genetic linkage analysis) using TreadMarks and reporting the performance benchmarks.

What I learnt:

It was interesting to see how TreadMarks achieves synchronization using primitive - barriers and exclusive locks.

What I found confusing:

I am not sure how fault-tolerant this system is, or was it even considered in this context.

Summary: treadmarks provides shared memory to a process that spans multiple nodes in a distributed system to facilitate parallel computing on commodity hardware. It uses locks to synchronize the use of shared memory.

Problem: most parallel computing jobs are performed on dedicated hardware, and there are inadequate solutions for commodity hardware. Programmers would have to deal with the problem of synchronizing processes and deciding how to share memory among them by passing messages. This would make it difficult to port existing code to this system, let alone write new applications for it.

Contributions:


  • shared memory is mostly transparent to the programmer

  • critical sections require some synchronization code

  • have more relaxed consistency to address the needs of the parallel processes

Confusing: for the multiple writer protocol, I understand the problem of two processes writing to discrete parts of memory that happen to be on the same page: they are not necessarily overwriting the same locations in memory, but the memory just happens to be on the same page. TreadMark uses some kind of diff to copy the changes across, but how does it know that a diff is necessary and that it shouldn't copy the whole page?
Learned: how simple it can be to swap out the underlying memory management, transparently to the code. All they had to do was link it against their own memory management library that handled all the page faults in userspace.

Summary:
This paper discusses distributed shared memory systems and the authors’ improvements on them with their own system of TreadMarks. TreadMarks offers a portable user-level global memory system with an easy to use API for distributed parallel processing.

Problems:
The authors attempt to optimize some common problems with previous distributed shared memory systems, such as Ivy, while offering a simple API for synchronization.
• They try to avoid sequential consistency or “the last value written” which could incur expensive read and write virtual page faults where replicas need to be replaced constantly
• False sharing, where writes to the same memory page, although not overlapping, would cause a fault to occur and expensive overhead
• Minimize overall network traffic by saving roundtrips when possible

Contributions:
Some of the author’s novel contributions include:
• Relaxed consistency or release consistency which underlying principle utilizes synchronization as a mechanism for managing memory access
• Memory diffs, where snapshots of virtual memory pages are taken before they are written and compared afterwards to produce a difference. This difference can be shared with replicas to update the global distributed memory. With correctly implemented synchronization, overlaps shouldn’t occur.
• Instead of eager consistency where global memory is updated on release of the memory by a process, release consistency only updates a specific memory when it acquires the lock for it from the last process with the lock. This significantly reduces network overhead from contacting all nodes after a memory update to contacting just one node.

Unclear:
What I find unclear is how this system or other systems like it fit in with CAP theorem. Is availability being traded for consistency due to the locking mechanism? Is this system even partition tolerant?

Learned:
I learned about the usefulness of distributed memory systems and how a simple mutual exclusion API can be used to simplify massive parallel distributed processing. I am interested in how this works today in modern systems and how it compares to other distributed processing systems like map-reduce.


Summary:

In this paper, the author presents a system called TreadMarks,
which implements a shared memory on a network of (commodity)
workstations. The goal is to implement an programming interface
such that each process is coded as if they are executing on
a single machine, but the memory is actually distributed across
multiple machines (which is transparent to each processor).

Problem:

The key technical problem is how to build a distributed
shared memory system efficiently. Consider a baseline
system where each access to remote memory access causes a
communication through the network, it is clear that this
baseline could suffer from low efficiency. In this paper,
the author describes the assumptions they made on the
programming model and techniques they exploit.

Contributions:

I think there are couple technical contributions that are
interesting.

First, TreadMarks pushes the correctness of using
synchronization to the end-user and assuming that they
only synchronize with TreadMarks' primitive. This assumption
makes it possible to detect the start point and end
point of possible conflicts, and therefore, allow
the "lazy release" that they used in the paper. It
is easy to imagine cases of inconsistency if these
assumptions are false.

Second, TreadMarks explores classic (I am actually not
sure whether it is that classic dating back to 1996)
observations like false sharing and a lazy cache
coherent strategy.

What I Found Confusing:

I think the technical part is pretty clear to me. One
thing that is interesting to me is what will happen
if one node fails. Does it mean that the whole
memory server will be in a state of inconsistency
without any chance of recovering? In some sense, this
is kind of related to things like main-memory DB,
which provides persistence guarantee, however, only
allows a limited set of algebraic operations. What
is the middle point here?

What I Learned:

The idea of making assumption on the user coding
model and takes advantage of it to make consistency
maintenance more efficient is interesting to me.

TreadMarks: Distributed Shared Memory

Summary:
The authors have developed a system named TreadMarks, which provides the abstraction of distributed shared memory, using the local memory attached to individual computers which are connected as a network of workstations. This provides an easy application programming interface for the programmer to use the shared memory without worrying about the details like which node the memory is allocated in, what to do if multiple processes write to the same page at the same time, etc.

Problem:
In shared memory systems, where a memory of one node can be accessed from any other node, the decision to decide which node to place the memory in, how to retrieve the memory, how to communicate with other processors which own a particular memory, falls under the responsibility of the programmer. So, when programs grow bigger and bigger and when complex data structures are being used, the task of the programmer becomes too complicated to control and maintain this pool of shared memory. So, to solve this problem, TreadMarks provides the user with a single consistent view of the global memory, which is made up of the local memory from multiple machines. TreadMarks provides simple APIs to the programmers which can be used to run parallel programs using the distributed shared memory.

Contributions:
1. The idea of introducing barriers to make sure that all the different processes that are running simultaneously are made to wait for the other processes to reach that state before proceeding to the next state.
2. Making the release consistency to behave in a lazy manner, removes the unnecessary communication between processes and necessitates communication only between the next process which is going to grab the lock and the process that was holding the lock last.
3.The multiple-writer protocol removes the extra communication for invalidating the copies with different processes and makes sure that multiple processes can write to the same pages at the same time and they can be reconciled using barriers.
4. Illustrating the usefulness of distributed shared memory using TreadMarks by using it for practical problems like genetic linkage analysis was also a valuable contribution.

Thing I learnt from the paper:
I learnt that as multiple NUMA nodes can be made to share memory within a single computer, we can also share main memory across machines, which seemed very interesting to me. Also I learnt that primitives such as barriers(which are similar to thread.join) are very useful in distributed shared memory settings.

Thing I found confusing:
I didn't completely understand where the global virtual memory resides and what's the purpose of it. I'm not even sure if such a thing exists in TreadMarks.

Comment:
I felt that this paper was written very well and all the concepts explained well using good diagrams.

Summary:
Authors present TreadMarks, a user-level library providing distributed shared memory. They discuss the need for DSM, the problems with current DSM implementations, the design of TreadMarks, and its use in several applications.

Problem:
Sharing memory between processes on the same or different machines can allow for parallel execution of some types of programs. Good mechanisms exist for shared memory multi-processing, but those for distributed execution on networked workstations are less developed. Without a DSM abstraction, application writers must explicitly handle messages amongst the workstations to keep memory consistent. The Ivy implementation of DSM itself suffers from an excess of messages, particularly when one machine writing to a page causes invalidations to be sent to all other machines.

Contributions:
- User-space library for TreadMarks DSM which can be compiled into code with unmodified compilers, and run on an unmodified kernel.
- A set of concurrency primitives (locks and barriers) for use when controlling access to the shared memory address space provided by TreadMarks.
- Lazy release consistency, wherein the authors note that the pessimistic invalidations of the Ivy protocol are unnecessary when concurrency primitives are used to prevent data races. Thus, workstations only receive memory page updates when attempting to acquire a lock to the data. This reduces the number of messages required for correct consistency substantially.

Learned:
There is a trade-off between enforcing consistency at the release of an updated shared resource vs. at the acquisition of said resource. In the former case, more messages must be sent (because we do not know which is the next workstation who will acquire the lock). In the latter, there may be a bit of delay in transmitting the updates where otherwise it could have finished in the background of regular operation.

Confused:
It seems that read locks should be required to ensure that latest updates are seen on each read. I can imagine a correct concurrent program that doesn’t need to acquire a lock to read, unless there will be multiple coordinated reads or a read-update behavior. It wasn’t clear to me if locking for reads is expected by the authors or not.

Summary:

The paper presents implementation and performance measures of TreadMarks, a distributed system which abstracts the individual physical memories to appear as shared Global virtual memory. It achieves parallel computation without requiring any explicitly intercommunication.

Problem:

  • Given the rapidly improving hardware and efficient high speed networks, their potential will not be effectively leveraged by sequential programming. Rather than using a single high end efficient system, existing hardware infrastructure can be utilized by means of parallel programming.
  • Distributed computation using shared memory leads to consistency issues, which can be solved by sequential consistency.
  • Sequential consistency is quite simple and easy to implement, but often leads to high communication costs and false sharing.

TreadMarks solves these problems by using a lazy release consistency method along with multiple-writer protocols.

Contributions:

  • Easily configurable API provided to programmers, aiding in creation and synchronization of shared memory.
  • Synchronization for critical section access is achieved using barriers and locks.
  • By using lazy release consistency write updates are transmitted only to the next node which requests a lock, thereby reducing the communication overhead.
  • Multiple writer protocol aids in handling false sharing by allowing individual processes to make write changes on local copies and create local diffs; once all processes reach the barrier for diff creation, the diffs are exchanged among the processes.
  • Implementation is highly portable as the library is built on top of Unix.

Unclear concept:

In lazy release consistency, I do not understand how failure or recovery of the node which has written the data and released the lock can be handled. I also did not understand where exactly the global memory map is stored.

Learning:

I learned about consistency issues encountered during parallel programming and the different type of consistency solution methods. The concept of false sharing and its countering techniques was also new to me.

Summary

This paper discusses the design and techniques used to implement a distributed shared memory system called TreadMarks.

Problem

Shared memory can help transform sequential applications into parallel applications. TreadMarks is a system that can provide distributed shared memory to applications running on top of it. Applications are unaware of the actual location of the memory unlike Message Passing Systems. There were lot of problems with then existing DSMs: lot of communication overhead for invalidating, unnecessary invalidation because of false sharing (unrelated object sharing the same memory page).

Contributions

1. Unlike message passing systems that provide DSM, TreadMarks makes all communication transparent to the applications. Applications don't need to manage messy networking code to access remote object. TreadMarks hides all that details.

2. Lazy Release consistency - The authors keenly identify that synchronization can help prevent or optimize sending the invalidate message to the copies. For example, if the program is using locks(which is a synchronization primitive) to enter critical section then it is not necessary for the current writer to invalidate immediately after it releases the lock. Instead, the next process which is trying to acquire the lock will be notified only when it is trying to acquire the lock.

3. Multi writer protocol - All DSMs have single writer protocol wherein the writer has to get an exclusive copy of the page that it is updating. For this, it has to invalidate all other copies and also has false sharing. To prevent these problems, in TreadMarks, multiple writers are allowed to proceed in parallel. TM uses the local VM hardware to detect access to shared page and creates an unmodified Twin copy. Diff is calculated between the twin and the modified copy. When the processes synchronize they are informed about the parallel updates and they obtain the diff and apply them.

4. The authors ported two large applications Mixed integer programming and Genetic Linkage analysis to TreadMarks and showed possible performance improvements.

What I didn’t understand

I didn't understand what happens when the current writer fails before releasing the lock. This might be a general synchronization problem but it looks like it will require some changes in the lazy release technique in TM.

What I learned

I learned how synchronization in code can be taken advantage to delay update propagation and still provide almost sequential consistency semantics.

Summary:
This paper presents a distributed, virtual, shared memory built on top of the individual memories of the workstations within a given network. The system, TreadMarks, exports a simple C API which allows programmers to build highly parallel programs which need to access shared resources in a synchronized manner.

Problem:
The goal is to present a single shared memory over the network to connected workstations so that they can execute parallel processes and access the shared resources of the parallel program. The issue with shared memories is how to handle synchronization to shared resources; known as the consistency model of the distributed, shared memory.

Contributions:
- The TreadMarks consistency model which is called lazy release consistency which only notifies the next process attempting to acquire a lock on shared resource of the changes made to the shared resource (the lock acquire implies only this process will use it next).
- The multiple writer protocol by creating twins of pages being written to and diffs of the twin against the original so that one process only need pass it’s diff of the page to other processes that are writing to the same page concurrently. This also has the benefit of avoiding the problem of false sharing.
- The TreadMarks C API to easily build parallel programs using the barrier and lock synchronization primitives offered by the TreadMarks system.
- The lightweight network communication protocol to avoid message overhead between processes in the network. This was essential as it became a significant source of overhead in some workloads.

Learned:
I learned about the lazy relaxed consistency model which is quite clever because it is simple to understand and makes sense intuitively. The use of diffs to allow multiple writers was also straightforward and a new concept to me.

Confused:
How do other processes (or threads) of a parallel program learn about the memory locations of the shared data structures that are made by process 0 during initialization? I assume these details are handled under the hood by TreadMarks, but I didn't find any mention of it in the paper.

Summary: Paper presents TreadMarks, an implementation of distributed shared memory that relies on a lazy release consistency model with an easy interface to port programs over. Authors tout that distributed shared memory is a much better abstraction to write parallel programs rather than a message passing paradigm.

Problem: As parallel computing is shifting toward being a set of commodity hardware hooked up via a network instead of specialized parallel supercomputers, how we handle memory of an application running on this system becomes an interesting problem. It depends on where we want to shift the burden to. Do we want to make applications more complicated by having explicit message passing logic, or do want to make it easy to program by having a distributed shared memory system handle this for us? If we choose to use DSM, we have to choose a consistency model that does not flood the network with traffic; as workstation networks at this time had limited capacity. We also need to take into account who can write to pieces of data.

Contribution: The main contribution of this paper was the creation of TreadMarks and some of the novel techniques it used when it came to consistency and multiple-writers. They show that sequential programs are easy to parallelize using the API they created.

Their approach to consistency was to adopt a lazy release consistency model. That is, data is only synchronized when a process acquires a lock for some particular piece of data. This makes sense. Shared data should have lock synchronization to prevent race conditions. Therefore, the time you acquire a lock to a shared resource is the time when you need to have the updated value.

Finally, they implement a multiple writer’s protocol, which basically creates a twin of the page in order to compute the differences that were written to it. This solves the false sharing problem because multiple processes can write to the same page and the diff will make both changes show up. This, of course, assumes that proper synchronization is happening between data objects.

Confusion: Does their lazy release consistency model require that readers need a lock in order to read updated data? It seems that one could imagine correct situations where one only needs a lock on writes and reads can happen at any time. Was this a conscious tradeoff that the authors made?

Learned: They had an interesting take on how to design systems. That is, they approach it from the standpoint of assuming correct programs. With their release consistency model and their multiple writers’ protocol, it relies on the application programmer writing a correct program with no data races. If there is a data race, inconsistent things might happen in this system, but since you shouldn't have data races, we can use that assumption to solve the false sharing problem and reducing network traffic on writes.

Summary :
The authors discuss the implementation of a distributed shared memory system (DSM) on a network of workstations which globally share virtual memory though they are not on the same physical machine. This model uses lazy release consistency paradigm.

Problem :
When systems replicate data across multiple nodes, there is a need to maintain consistency with respect to reads and writes. Sequential consistency, where there is a total ordering on all the memory accesses, and that total order is compatible with the program order of memory access in each individual process, is generally viewed as the “natural” consistency model. The authors state that the implementation of this model can cause a lot of messages to be communicated between nodes and this might hugely impact performance. The second problem is that of false sharing, which could be a serious problem when virtual memory pages are large. The authors give a solution for both the problems with a lazy consistency model.

Contributions :
1. The shared memory view is transparent to the user in that he doesn’t have to make decisions as to which machine to talk to, or what data to share and so on.
2. Synchronization is achieved using barriers and locks.
3. Consistency is handled in a lazy fashion, where the process modifying a page gives the update only to the process which next acquires a lock for it. This way, the process need not broadcast updates every time and this reduces drastically the number of messages over the network.
4. False sharing is handled by using barriers where the processes make their own copies of the modifying page and in the end, when all of them have synchronized at the barrier, exchange their diffs and update.

What I learned :
I learned what problems could arise when multiple nodes are present in a shared memory framework and how this can be handled by relaxing the consistency model and using barriers.

What I found confusing :
I found that the authors do not handle failures that well. Writes are possible only on one node and what happens when this node is cut off from the rest or goes down? How is this handled?
I also don’t understand how they implement page maps and so on.

Summary:
In this paper authors presents TreadMarks, a portable global distributed shared memory (DSM) abstraction implemented over a network of workstations, each having individual physical memory. The system leverages the advances in network and processor capabilities to provide a powerful platform for the execution of parallel programs. TreadMarks uses the Lazy Release Consistency model to provide sufficient synchronization to the programmer to be able to prevent data race conditions in the code. Also it uses Multiple-writer protocol with diffs to reduce effect of false-sharing and to reduce the bandwidth consumption.

Problem:
The goal of TreadMarks is to provide distributed shared memory library over networks of commodity workstations. DSM inherently involves communication between nodes. Minimizing communication between nodes and maximizing parallel execution is a key to higher performance. There are many problems with the consistency model of previous DSM, which TreadMarks tries to solve:
- Unnecessary communication cost: Given that sending a message is expensive which involves OS kernel, it is worth to eliminate the unnecessary invalidation notification.
- False sharing: Since the pages are large, it is very likely that multiple processors are sharing the same page. The write actions from different processors will cause the “ping-pong effect”.

Contributions:
-TreadMarks provides two synchronization primitives: barriers and locks. Barriers act as common wait function between all the processors and the locks can be used to
implement critical sections of the program.
-The idea to use release consistency approach, instead of sequential consistency, to reduce traffic between the nodes.
-Using multiple-writer protocol with diffs to allow multiple processes to write a page at same time, which reduces effect of false-sharing and also reduces the bandwidth consumption.
-Their idea to implement TreadMarks as user level library, which provides relatively more portability.
-The API provides same programming environment as in shared-memory multiprocessor, allowing programs written in DSM easily portable to shared-memory multiprocessor.
-Using UDP/IP with message arrival insurance instead of TCP/IP, to remove unnecessary cost of connection setup required in TCP.

Confusing:
-How does page fetch happen when a page fault occurs ? Does the system incurring the fault broadcasts its request to all nodes ?
-In any distributed system over commodity workstations, it is a inherent assumption that node failures will happen. Can TreadMarks system handle node failures?

Learned:
-I learned about the distributed memory model and various previous models for providing distributed programming environment.
-Learned lazy consistency model is an important idea to reduce message traffic.

Summary:
This paper talks about distributed parallel computing through shared memory model . The model comprises of set of machines connected via network, that have a global notion of virtual memory (shared across machines) though each machine does not physically share memory with another machine. The authors have built an API called TreadMarks that implements shared memory model. TreadMarks provides synchronization primitives and employs relaxed consistency model.The authors also evaluate TreadMarks by plugging it into real world applications like Mixed Integer Programming.


Problem:
Having multiple commodity machines perform a task is much cheaper than buying a super computer to execute the same task. Also parallelism reduces the time taken to solve a complex problem. Distributed shared memory can be an efficient way of parallel computing if we have good network performance and processor performance. The programmer should have ease of use with the shared memory model.


Contributions:
1. Ease of use, since it reduces the programmers burden of partitioning the data sets and communicating the values.

2. Allowing programs written for a Distributed Shared Memory system to be used on a shared memory multi processor. Also, no kernel changes were made (Unix) while implementing TreadMarks hence making the system portable.

3. Detects the false sharing problem with traditional Distributed Shared Memory system implementation which occurs when unrelated information is stored on same page and two processors try to update the data on the same page concurrently. This was rectified by the use of Multiple-Writers protocol.

4. Aids in avoiding race condition by providing synchronization primitives like Tmk_barrier and Tmk_lock_acquire.

5. The number of messages communicated between the hosts during a write is reduced by notifying only when it is necessary.This forms the basis of lazy release consistency model implemented in this system.

6. By making the update visible to a host only when it is required, the system reduces the cost incurred by sending lot of messages across the network.

One thing I learnt:
I learnt the relaxed memory consistency model through this paper.

One thing I found confusing:
I did not quite understand who holds the page table for the global virtual space.

Summary: This paper introduced TreadMarks, a programming model and implementation for shared memory distributed applications. Two key features that made TreadMarks different from other programming models were shared memory and release consistency model.

Problem: Shared memory programming model had been used previously. However most of them used sequential consistency model. This model had performance issues in at least two aspects: too much communication traffic, and unnecessary invalidation for false sharing.

Contribution:
1. Used shared memory model for distributed applications. In the shared memory model processes share the whole memory address. Possible alternatives are message passing protocols, and shared structures. In the message passing protocol processes don't share anything but passing messages to communicate instead. The shared structures approach is only certain data structures are shared. The main advantage of the shared memory protocol is that it is much easier to write programs in.

2. Release consistency model. The old-fashioned sequential consistency model specifies that every write and read must have an total order, and that every read should return the value from the latest write. This is very inefficient to implement. The release consistency model only requires that the updated value will propagate to other processes no later than the release of any locks. This is based on the observation that any potential data race should be addressed by the programmer using synchronization primitives such as locks. Therefore there is no point propagating the updated value when the lock has not yet been released: a correct program shouldn't read the value without acquiring the lock first.

3. Lazy implementation of the release consistency model. This is based on another observation that only after having acquired the lock should a correct program read from shared values. Therefore we only need to propagate the value when a lock is to be acquired, and only to the process that actually gets the lock.

Things confused me: In the Jacobi Program listed in the paper, the global variable grid was only defined in process 0, but was used in all processes. How and when was that variable being communicated to other processes?

Things I leaned: I learned the intuition behind the lazy consistency model. This was a very reasonable observation.

Summary
This paper is an advertisement of distributed shared memory systems as vehicle of parallel computing. They introduce ThreadMark, a distributed shared memory subsystem, and its memory consistency model and concurrency model, and demonstrate its application through two applications. By implementing ThreadMarks with lazy release consistency and multiple writer concurrency semantics, and as a user level library built using the Unix API set, they are able to develop a system that is both performant and portable.

Problem
Existing distributed shared memory implementations had the programmer handle the memory distribution (through message passing) or relied on conservative sequential consistency based approaches (that has high perf overheads). Another problem that existed with existing systems was that of false sharing, where unrelated writes that mapped on to the same underlying page, suffered from a ‘ping-pong’ effect between the writers.

Contributions

  • The observation that in the absence of data races through synchronization (handled by the programmer), updates need to be made visible based on attempts to synchronize. This leads to the conclusion that consistency can be achieved on synchronization, and updates made visible. The lazy release consistency is chosen to further lower communication overheads.

  • To overcome the problem of false sharing, TreadMarks deals with multiple writers to a page using writes to a twin, that is ‘diff’ed against the original page, and having the diffs exchanged to achieve the final consistent page. This allows multiple concurrent writers to write to different parts of a page, that can be reconciled. Also important is the observation that the DSM system need not do anything to handle writes to overlapping regions of a page, as this amounts to a data race.

    What’s not clear
    A lot of the internal implementation details. What do the memory maps look like? How is the distributed locking/barriers implemented?

    Learning
    The two important observations that the authors made, described in the contributions section.

    Summary:

    The paper describes the techniques used to implement TreadMarks' distributed shared memory system which allow parallel computing on a network of workstations. It describes how TreadMarks was used to speed up the operation of two important applications.

    Problem:

    Existing message passing mechanisms must specify the processor it needs to communicate with, the data it needs, and when the communication must take place. This is difficult to do when you have complex data structures with sophisticated parallelism semantics. The programmer is burdened with the responsibility of partitioning and managing data, and becomes detracted from the core problem. Another issue is that the existing models of shared memory are expensive in terms of performance and also in the number of messages they might send across the network.

    Contributions:

    - Provides an abstraction of a global memory system that frees the programmer from managing the data.
    - The lazy release consistency model only propagates the updates to those processor that needs it. This reduced the number of messages that were passed around.
    - The multiple writer mechanism of exchanging diffs only at a synchronization point reduces both the size of the message and the number of messages that are exchanged compared to conventional models.
    - Provides the same API as that of hardware shared memory systems for easy portability to a hardware shared memory system.
    - It's an exemplary paper for clarity in writing.

    Confused on:

    - The algorithm described for solving the travelling salesman problem.

    Learned:

    I learned how parallel programming is done and how memory can be shared across processors.


    Summary:
    The paper discusses the design and implementation of TreadMarks (TM) - a platform that aims to provide shared memory abstractions for parallel programming on a network of computers that are physically separated. The main problem of shared memory across the processors is tackled by first considering sequential consistency and then resorting to release consistency (a weaker model) owing to the inherent overhead caused by sequential consistency.

    Problem:
    • The authors wish to achieve high computational power by using parallel programming over a network of computers that may have idle cycles otherwise.
    • The difficulty is with the programming model to be used in doing so, message passing model requires added complexity and significant modification of the program code.

    Contributions:
    • Programs written with TM have 2 categories of memory: shared memory, and local memory. The shared memory provides a unified memory across processors. “TM malloc” creates shared memory.
    • TreadMarks uses 2 synchronization parameters: barriers and locks. Barriers facilitate a common pause between all the processors and the locks help in implementing critical sections of the program.
    • Sequential consistency model incurs excessive communication overhead because of frequent page exchanges. So TM uses the release consistency memory model - the synchronization of memory takes place only when explicit primitives are used for the purpose (barriers, locks), otherwise the program may violate the access order.
    • TM uses lazy release consistency for updating the data within locks - so the next process to acquire lock gets the update, hence there is lower communication overhead.
    • Overcoming the false sharing - TM uses the multiple writer protocol, it creates a replica of the page that it wants to write and can make all the writes to it without synchronizing with other processes until a synchronization primitive is called, in which case the 2 processes will diff the copies and update the data.
    • TreadMarks DSM is compatible with both shared-memory processor and multi-processor (non-shared) environment- hence code is portable - only consideration would be the change in significance of latency in non-shared scenario.

    Confusion:
    • Why was UDP chosen with a wrapper around it? They could have chosen TCP instead?
    • What would happen in case a node fails in the middle of a program execution? How does it affect the program and is the task reassigned to a different node?

    Learning:
    I got to know the early attempts to create a convenient programming model for programming over a distributed environment. From the ideas it appears very similar to the design concepts in GPU. The tradeoffs in the consistency models are especially interesting.

    Summary:
    At Rice University, Amza et. al built a Distributed Shared Memory System (DSM) named TreadMarks which runs as a user level process. All nodes have limited private memory with access to a shared global memory address space, and an interface is provided for the application developer to allow massive parallel computations that can potentially outperform a supercomputer.

    Problem:
    Assume you have some institution that has many workstations, & some portion of these machines might typically have unused compute cycles available. The TreadMarks DSM can exploit these otherwise wasted resources and allow an organization to run massive parallel computations on existing commodity machines. Running these kinds of computations might be a daunting task for a developer, so TreadMarks provides a DSM system API that abstracts away much of the complexity and provides primitives for handling synchronization.

    In IVY, the first DSM, Sequential Consistency (a memory model) precisely defines “the last value written” for a distributed system; the memory appears to all processes as if they were on a single multi-programmed processor (i.e. there is total order on all memory accesses). This causes too much communication to occur (e.g. all processes are notified of invalidations). False sharing (unrelated objects on the same page & the “ping-pong effect”) is also a problem with IVY.

    Contributions:
    1-Lazy Release Consistency Algorithm: This algorithm enforces consistency at the time a lock is acquired, and only this process is notified of the latest invalidations (greatly reducing bandwidth costs). An access miss triggers the cache to fill the missing page.
    2-Multiple Writer Protocols: with IVY, only one writer can have access to a given page. In contrast, TreadMarks allows multiple writers to access different parts of a page at the same time. When a write attempt is made on a protected page, a twin is generated the write goes there instead. A diff is then generated and the twin is discarded. When the processes synchronize, an access fault is triggered, and the diff(s) are applied to the page. This alleviates false sharing problems and reduces bandwidth.
    3-TreadMarks Software runs at the user level; no kernel mods or special hardware is needed.
    -API Interface: create, destroy, synchronize, and allocate shared memory; Tmk_malloc() and Tmk_free(). Primitives are provided to control access to critical regions where race conditions are possible: Tmk_barrier() creates a global barrier, Tmk_lock_acquire() and Tmk_lock_release() respectively obtain and release locks.
    -mprotect: a system call controls access to shared pages. attempted access generates a SIGSEGV signal. The handler for this signal checks local data structures & determines if this is a read or write if the data is invalid, minimal machines are contacted to retrieve diffs.

    What I Found Confusing:
    What happens when both P1 and P2 access the exact same part of the page?

    What I Learned:
    The lazy release consistency algorithm could probably be applied in other contexts, and the premise really makes sense. We only notify the process that acquires the lock regarding changes in the critical region (no one else can do anything with that region yet anyway). There’s probably other situations in systems where we spend a lot of processing cycles and bandwidth running needless / wasteful updates.

    Summary:
    This paper presented TreadMarks system which implements distributed shared memory over a network of workstations. It provides a shared memory abstraction to the application developer. The novel features of treadmarks is release consistency and multiple writer protocol.

    Contribution:
    1. The main contribution of the paper is the release consistency model. The author nicely defines the sequential consistency model and explains how it performs inefficiently for distributed shared memory. In release consistency model, instead of invalidating all the copies of a page, the writer only invalidates the copy where it is going to be used which is known by using locks around the page. Also it invalidates it during the acquiring of lock which is a lazy approach. Using release consistency reduce the communication between processes.
    2. Multiple writer can write to a page simultaneously if they are writing on different part of a page, but it's the responsibility of programmer to make sure they are writing on different part. If developer wants to only have one writer, he can put locks around that code. The updates of different versions of a page are exchanged by just passing the diffs not the whole page. This reduce the bandwidth usage of the network.
    3. Paper also explained the implementation of two applications on treadmarks and showed the speedup of the application.

    Confused: I'm confused about the how the release consistency model works with barriers.

    Learned: I learned about distributed shared memory and the memory models implemented over it. This paper explained well the tradeoffs between consistency and performance in those consistency model.

    Summary:
    The paper presents TreadMarks - a distributed shared memory system for parallel programming on networks of workstations. It details some of the implementation techniques such as lazy release consistency and provides useful properties such as multiple-writer support.

    Problem:
    As the networks have grown faster, the appeal of using clusters of commodity machines for computing has become more and more appealing. They provide great advantages like reduced cost, ease of scaling, better fault tolerance etc. Clusters enable the realization of a Distributed Shared Memory (DSM) which allows processes to assume globally shared virtual memory even though they execute on nodes that do not physically share memory. Techniques which enable efficient parallel programming DSM setup is what this paper is looking at.

    Solution + Contributions:
    - They use a relaxed memory model called release consistency which compared to the sequential consistency system used by Ivy results in much less communication.
    - They give due importance to synchronization and use primitives such as barriers and exclusive locks to achieve it.
    - They implement a multiple-writer protocol which allows multiple processes to have a writable copy of a page at the same time.
    - For multiple-writer, they have a way of reconciling the divergent copies by merging the diffs between the different copies which results in much less message traffic.
    - They show the implementation of applications like mixed integer programming and genetic linkage analysis as parallel programs on TreadMarks and show the speedups achieved as a result of parallelizing those sequential applications using DSM.
    - They implement TreadMarks as a user-level library on top of unix and does not require any special privileges or modifications to the kernel, thus facilitating easy adoption. It is very portable.
    - The programs written for DSM can also be easily be ported to a shared memory multiprocessor since they both use the same programming environment. (The other way around is not trivial due to the increased latency that is expected in DSM).

    What was not clear:
    Why is it limited to a homogeneous set of nodes?

    My Key takeaway:
    Lazy release consistency is an interesting idea. It is quite effective in reducing the message traffic.

    Summary:
    This paper describes some of the ideas behind Treadmarks, a Distributed Shared Memory system, as well as some simple programs built using it, and
    some benchmarks.

    Problem:
    As commodity hardware performance has increased - namely, network bandwidth and processor power - there is a need to utilize idle periods. In addition, when a programmer builds a system to run on these idle machines, she must explicitly describe messages to pass between them, making distributed programming difficult.

    Contributions:
    Shared memory seems like it would not be effective, due to limitations on network bandwidth, and latency. Here they show that a distributed shared memory system is indeed feasible. In fact, they built it at the user level, making it extremely portable to most any workstation. This is an important aspect of the system.

    They also describe several consistency models, some of which aim to reduce the amount of message passing. Indeed, they show that large applications can be used in a shared memory system, and that it can effectively utilize the unused machines. Finally, they show that building programs using their simple primitives is an easy task, as compared to building a complicated distributed system. This definitely makes the case for distributed shared memory systems.

    Didn't Understand:
    The release consistency model section isn't entirely clear. I'm just not sure if it's the same as a locking mechanism or not.

    Learned:
    It was interesting to read about the different memory structures, and how this relates to the consistency model. I had never thought about the
    relationship between the two.

    Flaws:
    Not necessarily in the system itself, but I would have liked to see comparisons against systems that don't assume shared memory. I would like to see what the overhead is of using the shared memory model.

    Summary:
    This paper introduces a distributed shared memory system implementation - TreadMarks, to support parallel computing on networks of workstations. TreadMarks is implemented in user space, and it provides shared memory as a linear array of bytes for applications. The memory model of TreadMarks is a relaxed memory model - release consistency.

    Problem:
    Cluster of workstation is more cost effective than shared memory multiprocessor architecture, but it’s harder to program, because the data partition and consistency need to be handled by software instead of hardware. One approach to implement that is by message passing, in which programmers need to handle communication explicitly. This can be very complicated for complex programs. Alternative solution is providing programmers an abstraction layer which handle the communication between workstations for programmers. The desired features are easy to use, high portability, low communication/computation overhead, and linear scalability.

    Contribution:
    TreadMarks is entirely at user-level, and supports multiple languages with simple APIs. This paper introduces the design strategy of the system, explains the mechanisms used to solve several implementation issues of DSM, and compares the performance of them with alternative solutions. These are the major contributions of this paper.
    Consistency: TreadMarks use release consistency model, which only requires the memory to be consistency at specific synchronization points, and has lower communication overhead than sequential consistency. More specifically, TreadMarks uses lazy release, it delays propagation until an acquire is issued. Compared to eager release, lazy release can further reduce communication overhead.
    False sharing: because page size is big, different data items might be in the same page. Then two workstations write to different data items which are in the same page will need to move the page between main memory of the two workstations back and forth. TreadMarks uses multiple write protocol to solve the problem. Whenever a process is granted access to write-shared data, the page containing these data is marked copy-on-write, and first attempt to modify the contents of the page will result in the creation of a copy of the page modified - twin. At release time, TreadMarks compare the page and its twin to generate the diff. Other processors will request the diff the first time they access the page.

    Learned:
    Implementing the solution in user space is good for portability.
    Providing an abstraction layer for programmers to make the programming for parallel application easier is possible, and important.
    Giving the upper layer the least acceptable guarantee (e.g. in synchronization) can maximize the opportunity to optimize the lower layer software performance.

    Discussion:
    I feel TreadMarks’ design doesn’t consider failure handling at all. And we can’t add more workstation one-the-fly after the application starts. Moreover, I think there should be more potential by providing even higher level of abstraction for programmers to write parallel programs without consideration on portability, e.g. by design new programming language semantics for DSM.

    Post a comment