Position Statements

Bradley C. Kuszmaul and Charles E. Leiserson, MIT
Workshop on Transactional Systems, April 8, 2005, at ORD

In a seminal paper a dozen years ago, Herlihy and Moss proposed transactional memory as a way to ease the writing of concurrent programs. The idea of transactional memory is to allow a program to read and modify multiple, disparate memory locations as a single atomic operation, much as occurs within a database transaction, but without necessarily the durability requirement of database transactions. With transactional memory, they argued, programmers could avoid such concurrency anomalies such as priority inversion, convoying, and deadlock. Herlihy and Moss proposed an extension to hardware cache-consistency mechanisms that could provide hardware transactional memory (HTM) efficiently, but their proposal only handles bounded transactions.

The goal of our research project is to develop a multithreaded computing environment in which transactional overheads are sufficiently low that user code always executes within some transaction, rather than treating transactions as expensive and infrequent. In addition, the computing environment should be able to support transactions of any size, rather than just the fixed size proposed by Herlihy and Moss.

The "transactions-everywhere" research project at MIT, sponsored by NSF, takes the point of view that code is always executing in some atomic section. Rather than transactions being expensive and seldom used, they are inexpensive and prevalent. Our research effort integrates hardware, software, and theory to address the following questions:

* How should transactional memory be exposed to the programmer in a high-level programming language?
* What are the semantics of transactional memory?
* How should forward progress be guaranteed, e.g., by backoff or aging?
* What is the impact of hardware transactional mechanisms on the processor pipeline?
* What should happen when interrupts, I/O, and context switching occur during a transaction?
* What transactional state should be visible to software?
* How should hardware transactional mechanisms interact with virtual memory?
* How "unbounded" do transactions need to be?
* What tools are needed to support the debugging of transactional code?

Transactional memory holds the promise of simplifying or eliminating many of the problems with coordination and synchronization that programmers now face when dealing with concurrency. Transactions everywhere will make it far easier for all programmers, not just those who are specialists in today's arcane practices of parallel computing, to write correct high-performance multithreaded programs, thereby allowing concurrency to be employed in routine applications for ordinary users. Our research project aims to confront and overcome the problems in developing transactions everywhere so that transactional memory may eventually become, like cache, an expected subsystem of any computer architecture.

Christos Kozyrakis

Christos Kozyrakis & Kunle Olukotun
Computer Systems Laboratory, Stanford University
http://tcc.stanford.edu

Continuous transactional execution provides an opportunity to create an intuitive programming model for the upcoming single-chip multiprocessors. Nevertheless, there is significant skepticism as to the applicability of transactions on general-purpose applications beyond the narrow scope of non-blocking synchronization. While we recognize the significant work is needed before transaction-based programming becomes practical, we provide the following arguments to rebut the skeptics and motive the necessary research.

The exponential growth in uniprocessors performance is over. For the foreseeable future, sequential execution means performance stagnation. Transactional semantics have been successfully used to manage concurrency in databases and persistent storage. It’s about time we try them as the basis for parallel programming. Transactional execution facilitates non-blocking sharing, coarse-grain consistency, and failure atomicity with a single abstraction. Our experience with transactional programming so far shows that correct code is almost automatic and feedback-based performance tuning is easy.

So what if it doesn’t? BD vendors can afford to hire tens of expert developers to debug/tune with existing thread models. The challenge is to find a parallel model that makes it easy for average ISV programmer to develop parallel code for all other apps, which are currently sequential! If a transactional model succeeds there, it will eventually find its way in DBs as well. By the way, transactions for parallel programming should not be confused with DB transactions.

Yes, we’ve tried it and it works well. Speculative transaction execution allows for performance similar or better than conventional models. In some ugly cases we revert to sequential execution to maintain current Java semantics. Now it’s time to work on a cleaner and simple model based on transactions as a first-class language primitive...

True. Fine-grain sharing requires frequent commits and reduces the performance and failure atomicity benefits of transactions. But true fine-grain sharing is unlikely to work fast with any parallel architecture, even on a single chip, as latency is only getting worse after all. Most apps with significant parallelism can be exploited with coarser grains. Let’s focus on them first.

Hardware resources are ever increasing. Of course, there will always be a need for a backup mechanism for the rare, pathological cases. There are already several proposals that address this issue. Our preference is for SW-heavy solutions as they place minimum limitations on the development of the programming model.

Transactional programming should support closed nesting. The hardware can easily the semantics but will only make the common case fast (no surprises here). Library calls can be executed correctly using nesting.

This is perhaps the most challenging issue. Some IO devices can be made transactional (e.g. log-based file system). We may have to do it just for failure atomicity reasons (check last 2-3 years of OSDI). For the rest, we’ll have to redefine the API to execute/commit IO operations at transactions boundaries (using call-back functions or messages to an IO processing system). IO devices used in performance critical apps (disks, net, display) work well with that model! Finally, we can always revert to sequential execution for difficult, legacy code cases... Overall, OS & IO is not a reason to drop
transactions. It is an opportunity to revisit what IO, scheduling, and other OS services mean in a highly parallel system. By the way, older systems had put a lot of thought into IO processing and concurrency (e.g. the IBM 801). One more chance to "re-invent" ideas from old IBM machines...

In the words of Albert Einstein: "The significant problems we face cannot be solved at the same level of thinking we were at when we created them." Parallel computing is a significant problem and transactional execution is different & promising. Let’s get busy.

Cormac Flanagan

Atomicity and isolation are two key properties of traditional database transactions that may also be extremely useful in a general-purpose programming language.

Traditionally, multithreaded application software achieved isolation (or serializability) between concurrent operations using low-level locking operations, but this approach is prone to race conditions and other synchronization errors that result in unintended interference between threads.

Over the last few years, we have developed static and dynamic analyses that verify serializability of critical sections by analyzing the synchronization structure of the application code. The dynamic analysis verifies atomicity by observing the lock operations and memory accesses performed by the application at runtime. The static analysis uses an appropriate extension of Java's type system that, by reasoning about which locks protect which fields, can guarantee that every well-typed program is free of serializability violations. Thus, in a well-typed program, each critical section is amenable to sequential reasoning, which significantly facilitates subsequent validation activities.

Application software must also correctly handle unexpected errors, such as I/O errors, null pointer dereferences, and assertion failures. (Ideally, some of these errors would never occur, but in reality defects persist even in well-tested systems.) In some cases, correctly handling such errors requires rolling the application state back to an earlier consistent state.

Currently, serializability and error recovery are implemented in the application code. There is a clear opportunity to reduce the burden on the application programmer by providing additional transactional support from the programming language, the operating system, or the underlying hardware.

Hardware-supported transactions may be quite practical, particularly for short, machine-local transactions. However, applications often need to execute long-running and possibly distributed transactions. One key challenge is how to achieve compatibality and interoperability between hardware-supported machine-local transactions on the one hand and long-running, distributed transactions on the other. For example, one possible approach is to notify the application when a machine-local transaction is rolled-back, so that it can send appropriate messages to other applications collaberating in a distributed transaction.

Dan Grossman

Position Statement for WORKSHOP ON TRANSACTIONAL SYSTEMS, 8 April 2005
Dan Grossman

(1) The software-engineering advantages that atomicity holds over locking are compelling reasons to make an atomic construct the concurrency primitive of choice for high-level general-purpose programming languages.

(2) We can realize suitable implementations of atomic on today's hardware using a purely software approach to logging-and-rollback. We already have a prototype system for a uniprocessor, a special-case worth considering separately.

From the perspective of high-level programming languages such as Java, atomicity is easy to explain: Whereas "synchronized e { s }" acquires (resp. releases) the lock e before (resp. after) execution of s, "atomic { s }" guarantees s executes as though no other thread has computation interleaved with s. For building reliable and modular software, atomicity has the following advantages (and probably more):

Although hardware support for atomic may help in the future, we can provide atomic to programmers today, especially for "low performance" concurrent applications such as desktop applications where concurrency primarily masks I/Olatency and provides GUI responsiveness. Rather than employ sophisticated data structures for software transactional memory, my research group is taking a logging-and-rollback approach in which code executing atomically logs the changes it makes so that they can be undone if necessary.

We have implemented and used a dialect of OCaml (a mostly functional language) that provides atomic. Like many high-level language implementations, OCaml supports concurrency but not true parallelism (i.e., it runs on a uniprocessor). This setting allows particularly efficient logging implementations because we never need to log reads to memory. Our prototype has allowed us to learn key programming idioms, such as how condition variables interact with atomic.

Looking forward, we intend to adapt our logging-and-rollback approach to the multiprocessor setting. We have designed an implementation that uses locks to implement atomic. The locks are used in a way that provably avoids deadlock and the programmer never sees the locks. By dynamically changing which objects areprotected by which locks, we can use locality to adapt a program's behavior automatically to reduce interprocessor communication.

David Gay

UC Berkeley and Intel Research Berkeley have started a project to design a replacement for C, tentatively called Gamma. Gamma will be a safe language suitable for the implementation of systems software. Early target applications will include web servers and Linux kernel modules.

Requirements for Gamma include:
-- An evolutionary path for existing systems written in C. We will write a C-Gamma translator; we already have implemented a replacement and translator for the C preprocessor.
-- Concurrency support; this is discussed in more depth below.
-- Memory safety, data layout control, API usage checking.

These requirements are supported by designing Gamma as an extensible, refactorable language. This:
-- allow for easier language experimentation during the design phase
-- supports a staged translation of C code to safe gamma code:
        o an initial translation can produce unsafe (but cleaned up) Gamma code
        o subsequent refactorings will transform the code to, e.g.,:
                -- use memory safety extensions
                -- recognise and translate concurrency primitives
        o the end result of this process, which may include some manual intervention, is safe Gamma code
-- supports specific-system needs

Gamma will support thread-based concurrency, and some form of atomicity-based concurrency control. We have had prior positive experience in implementing and using atomic statements in nesC [1], a C dialect for microcontroller-based sensor-network nodes.

nesC's implementation of atomic is simplistic, but appropriate for simple microcontrollers: it simply disables interrupts. This is clearly not a solution for Gamma. Instead, we will consider several implementations:
-- analyse the data objects accessed by a program's atomic statements and
o add appropriate locking, unlocking in the generated code
o or, control thread scheduling (i.e., don't schedule threads with concurrent accesses simultaneously)
-- Harris/Fraser-style [2]
-- use hardware transactional memory

Finally, as mentioned above, we want to refactor existing lock-based programs to use atomic statements instead. This will also require a reasonably precise analysis of the data objects accessed by the program.

Can we analyse a program's data accesses well enough to infer useful locking strategies? Similarly, can our refactoring to introduce atomic statements place these at a useful granularity? We hope that Gamma's memory safety combined with extra information from the programmer in the form of more detailed type information and declaration of explicitly non-shared data can help here.

How well can hardware transactional memory deal with very long transactions? With operating system calls (both blocking and non-blocking) within transactions? We think that it might be necessary to implement different atomic statements using different techniques depending on their body. How will the various techniques we're considering interact?

[1] The nesC Language: A Holistic Approach to Networked Embedded Systems, D. Gay, P. Levis, R. Behren, M. Welsh, E. Brewer, and David Culler. PLDI'03.
[2] Language support for lightweight transactions. T. Harris and K. Fraser, OOPSLA'03.

David Wood and Mark Hill

THESIS. Transactional memory promises to ease the burden of multi-threaded programming, provided it both (a) supplies programmers the appropriate level of abstraction and (b) enables efficient implementations across many platforms. Think: virtual memory.

THREAD-LEVEL TRANSACTIONS. Transactions should be abstracted from physical limitations, similar to how virtual memory abstracts away the physical memory hierarchy. Specifically, programmers should view a transaction as available to a thread operating in a shared virtual address space. Transactional correctness should not be affected by events such as context switches, SMT interleaving, cache capacity and conflict overflows, or even paging.

SYSTEM SUPPORT. Transactions should be provided to threads via a judicious combination of high-level software, low-level software, and hardware, analogous to virtual memory implementations (paging policy, TLB/page-table code, and TLBs/page-table-walking hardware, respectively). All-hardware solutions are too complex; all-software ones are too slow; the appropriate mix may vary. This implies that full transaction support is available to user-level threads, and possibly high-level operating system threads, but not the kernel.

COMMON-CASE FAST. Hardware should accelerate common transactions, much as a TLB hit allows an all-hardware virtual memory access. Hardware should optimize for small transactions by striving to keep both "old" and "new" state in fast hardware buffers (caches). Commit and abort should discard “old” and “new” state, respectively, with little or no delay or communication. Ideally, four SMT threads can perform repeated transactions on an in-cache work queue with no memory communication.

GRACEFUL DEGRADATION. When thread-level transactions exceed hardware boundaries, performance should degrade incrementally, much as a few TLB misses or page faults are acceptable. In particular, performance should be tolerant to cache conflict misses that are hard to foresee in high-level software, composed of independently-written modules.

STABLE INTERFACE TO ALTERNATIVE IMPLEMENTATIONS. The transactional memory interface must provide stable semantics and performance across a variety of implementations. For software vendors, stability helps protect their code development investment. For hardware vendors, implementation flexibility allows them to target different application domains and price/performance points (especially low-development cost initial implementations). The transactional interface should not dictate hardware that is too conservative, too optimistic, too simple, too complex, too cheap, too expensive, or not invented yet.

BEYOND SIMPLE MEMORY. Thread-level transactions must go beyond simple memory operations to support input, output, exceptions, and other user-defined extensions. While idempotent I/O may be performed immediately, non-idempotent I/O must be made to appear transactional, perhaps via buffering. An exception handler may need to examine a partially-executed transaction to deal with an exception, want to abort a transaction, or both. Finally, users may wish to extend transactions with properties not addressed by this interface (e.g., durability). Transactional memory systems should expose this functionality through low-level, but (largely) hardware-independent APIs, much as virtual memory systems expose memory protection mechanisms.

Kevin Moore, User-Level Transactional Memory, Presentation at University of Wisconsin Computer Architecture Affiliates Meeting, Oct. 2004, http://www.cs.wisc.edu/multifacet/papers/affiliates04_tltm.pdf

Eliot Moss

It is becoming clear from recent work on transactional memory, related hardware designs such as speculative lock elision, and programming language design efforts, that we may be close to realizing transactional memory semantics in the marketplace. There is growing recognition that this will bring great benefits, and also that there is a potential train-wreck brewing with increased use of concurrent threads by ordinary programmers.

There remain some important and significant problems we need to address, notably:

1) Dealing with the limited capacity of hardware: For a model to be broadly useful it must not place arbitrary, and certainly not very small, bounds on the size of transactions, in terms of number of distinct memory items, length of time, etc. There is probably widespread agreement on this point.

2) Making best use of a transaction facility demands support for a range of semantics, richer than simple flat transactions of read-write sets of memory words. This may be more controversial.

While there is undoubtedly a tradeoff between richness of semantics and complexity of the underlying hardware support, I advocate pursuing the richest models and applying our best efforts to determining how much can be achieved with reasonable hardware additions to modern processors.

In particular, by adding explicit transaction ids (analogous to address space ids on some processors) to (data) cache lines, and a small hardware table of active transactions (including transaction relationships, such as nesting and groups), I believe we can achieve quite rich semantics. This approach need not affect bus coherence protocols, begin visible only to those CPUs/threads that share a cache. Whether or not this particular strategy has merit, we should not give up quickly on supporting such features as closed and open nesting, undo/redo lists, checkpointing, lock coupling, predicated actions, ordered groups of transactions, and connection to durability support via the operating system.

There are also issues of how best to present these semantics to programmers in the combination of the programming language(s), the run-time system, and the operating system. Further, these parts of the system can and should cooperate to offer a seamless whole for programming, including support for performance measurement and tuning.

Jim Larus

It’s the Software, Stupid

Position Statement for Workshop on Transactional Systems
James Larus
larus@microsoft.com
Microsoft Research

Transactional memory is a promising idea that may make concurrent programming easier and less error prone. The imminent arrival of chip multiprocessors (CMPs) has created considerable anxiety in the commercial software world about the difficulty of effectively and correctly programming parallel machines, and consequently opened a window of opportunity for new programming models, such as transactions.

An increasing number of papers have started exploring software implementations and hardware support for transactions. Much of this work, however, has lost sight of larger software issues that will ultimately determine whether this programming model is successful or not. Below, I briefly describe four problems with existing work that are serious shortcomings.

Nested transactions. Existing hardware transactional memory systems “flatten” nested transactions. In these systems, if several transactions nest and the innermost one, which is executing, aborts, all the transactions are terminated. It is easy to argue that this behavior presents only a performance, not a semantic, problem since the outer transactions can be restarted. However, in real software, this behavior is unacceptable, since an abstraction—e.g., a library—must be able to hide implementation details from its clients. If a transaction in a library aborts and causes the client’s transactions to abort, the behavior of the two systems are too tightly coupled.
Garbage collection. Complex systems provide abstractions at many levels, and a transaction may be appropriate only at a specific level, even if lower levels are implemented in software. For example, consider garbage collection in a safe language (e.g. C# or Java) that supports transactions. The garbage collector is implemented as code running in the same process as a transaction, but its effects on memory are transparent to the application and presumably outside of the transaction. However, the garbage collector and transaction system are not entirely separate either, since the GC may relocate transacted objects, thereby requiring updates to hidden state.
Scope. Least I be accused of picking only on hardware, software transactional memory systems have a serious flaw that greatly reduces their value. Most (all?) systems are transactional only with respect to code in another transaction. For reasons of overhead, ordinary reads and writes are not tracked, so that a programming error about the placement of atomic blocks can subvert the protection offered by the transaction system.
IO. Real applications perform IO as well as computation, where IO is broadly defined as any communication with non-transacted state, such as the user, a disk, the network, or even another process. In general, it is not possible to roll back IO when a transaction aborts. On the other hand, excluding IO operations from transactions greatly reduce their value as a parallel programming construct. This strikes me as a fundamental weakness in the programming model.

These problems, with the possible exception of the last one, probably have reasonable engineering solutions. However, it is easy to lose sight of these complications when looking at the simple, tractable pieces of code that drive early research and provide compelling illustrations for papers. Ignoring the complexity of real software, however, reduces the likelihood of finding a programming model that will be widely used.

Jim Johnson

Hardware support for transactions has been focused on acceleration for transacted memory. This is an admirable goal. This should be viewed as one end of a spectrum of transaction topologies, ranging from simple local memory operations out to full distributed traditional transactions. Furthermore, there should be a single, common infrastructure that an application would use to declare and use transactions, regardless of where they fell in this range.

This is in line with the direction that we've been taking in Microsoft's Common Language Runtime (CLR) with System.Transactions. We have a single set of classes that provide transactions from purely local, with an existing performance of over 1 MTPS, to those that are fully distributed.

I'd want hardware support that would work well across these possible topologies. In other words, for hardware acceleration to be generally useful it must be careful not to overly constrain the semantics, such that only extremely short and lightweight transactions may make use of it.

Some ways in which this might impact hardware support are that the transaction may involve memory as well as database or queued messaging access as recoverable resources. An application should be able to bracket some memory operations, as well as operations through one or more databases or reliable queues. The fact that other resources were involved should not disable the hardware support.

If the hardware support is focused on memory, there will be a need for updates to be abandoned as well as committed. Furthermore, depending on the fault potentials, the memory may need the ability to offer traditional phase 1 guarantees.

In this common model, there may be opportunities in partially implementing transaction support in hardware. For instance, when declaring a transaction an application specifies how concerned it is about the currency of the data, and it specifies the point when consistency is to be maintained.

Strict isolation modes, such as serializable, leave little room for maneuver. However, less strict modes, such as read committed, do leave options. In those cases, the application has stated that it is designed with the expectation that the data it acts on may be out of date, and that it has a recovery model for handling this. Can hardware acceleration key off these statements?

John Crawford

Transactional Memory Position Paper

Motivation: For the next decade, a major part of improved processor performance will come from CMP – chip multi-processing. All the major vendors, including Intel, have announced plans for integrating larger and larger numbers of processors onto a single chip. This poses several challenges that must be met in order to deliver this performance.

Promises: Transactional memory promises to provide breakthroughs on both of these dimensions. Software development is improved by providing a synchronization mechanism that is a) intrinsic to the problem being solved, b) eases correctness, c) avoids the split-the-locks performance (and correctness) tuning cycle, and d) naturally provides atomicity – avoiding complex code for recovery of busted critical sections. Performance and scaling are helped by allowing improved parallel execution, and removing the bottleneck amplification feature of locks.

KISS: As a hardware developer, I would like to build a mechanism that is simple, small, yet meets the needs of software. I also hope that software developers would want to keep transactions small and simple in order to be able to reason (and maybe prove things) about them, and to keep good performance. I would also like to see transaction semantics limited to as few instructions as possible, such that the majority of instructions would be executed outside of a transaction. This will make it easier to provide efficient (shared) hardware mechanisms within the context of a multi-core and multi-threaded processor.

Konrad Lai

Tom Knight proposed something in the late 70s. Maurice Herlihy reawakened the topic in the early 90s. May be we could realize the dream in the late first decade of the 21st century. The last 40 years of parallel/concurrent programming is not at all a pleasant experience. With the advent of multicores, the academic and industry must solve the parallel programming problem for a broad range of applications with your typical good programmers, not just rocket scientists.

Although it is desirable to keep the hardware simple, but it is as important to make it no simpler, i.e. a comprehensive HW/SW solution. Issues raised by the software community, like truly nested transactions, will have significant impact as to what's the minimal hardware mechanism should be. Moore's law is still alive and kicking, let's not shy away by some incremental hardware complexity. Over 25 years ago, Intel worked with a few leading computer arithmetic visionaries and made the mathematically robust IEEE binary floating-point standard happen. We resisted the temptation to support something easier to save on transistors and area.

Kunle Olukotun

Kunle Olukotun & Christos Kozyrakis
Computer Systems Laboratory, Stanford University
http://tcc.stanford.edu

As single-chip multiprocessors become ubiquitous, it’s essential we make parallel programming the norm. Hardware-supported transactional memory presents an opportunity to address performance and semantic difficulties with multithreaded models. Nevertheless, most architectural proposals are selling transactions short: they add transactional memory as yet another layer on top of the thorny protocols for cache coherence and memory consistency and limit its usefulness to non-blocking synchronization. Anyone who has recently read the Java memory model can guess how much easier multithreaded programming can get by adding complexity to it. If parallelism is about to become mainstream, this is the right time to rethink the basic abstractions and start dealing with concurrency as the common case, not as the exception or as an afterthought. We believe that transactions will be the key technique in this effort

Our proposal, called the Transactional Coherence and Consistency (TCC) architecture, uses transactions as the exclusive abstraction for communication, coherence, consistency, and failure atomicity in a shared-memory multiprocessor. Each thread consists of a sequence of user-defined transactions. There is no execution outside of transactions. The processors continuously execute transactions in speculation, keeping all writes in local caches or separate buffers in main memory. When a transaction commits shared-memory is updated (communication), cache coherence is maintained, and atomicity violations are detected for other transactions causing re-execution. Hardware guarantees that transaction commits are seen by all processors in the same order (serializability).

TCC simplifies multiprocessor hardware by eliminating the need for any other communication, coherence, or consistency mechanism other than transaction commit. No overlaying of complicated protocols, no small, latency-sensitive control messages on individual loads and stores. It also provides an interesting hybrid between shared-memory and message-passing. Threads still have access to shared-memory but communication and coherence is maintained at software-selected points using bandwidth-efficient messages that implement transaction commit.

By running transactions continuously, TCC provides multithreaded software with more than a mechanism for non-blocking synchronization. Consistency and communication become coarser-grain and more intuitive as they happen at the boundaries of user-defined transactions. From the programmer's perspective, all memory references from a committed transaction occur ``before'' all the memory references of a transaction that commits afterward. There are no ordering rules for individual loads and stores. Performance is not compromised as transactions are executed speculatively in parallel. Transactions also provide the framework for failure atomicity and recovery that can handle both hardware and software errors. Finally, an all-transactional program is easier to debug and tune using feedback. The hardware can accurately pinpoint races and performance bottlenecks to a dynamic compiler or a programmer because it already tracks all addresses touched by concurrent transactions and orchestrates their commits. Debugging and tuning is also easier because the programmer must master and manipulate a single abstraction.

There are still hardware and software challenges to address before TCC is proven. Nevertheless, the vector points to the right direction: a clean & easy-to-use parallel model that that provides good performance and strong failure atomicity semantics for software to build upon...

Maurice Herlihy

Transactional hardware vs software
Maurice Herlihy
Brown University and Microsoft Research

We still don't know how we should split up work between transactional hardware and software, but it is not too early to speculate.

At one extreme, it seems implausible that we could by with all-hardware transactions, any more than we could get by with an all-hardware common language runtime. A full-fledged transactional API requires all kinds of capabilities poorly suited for direct hardware implementation. Obvious examples include first-class nested transactions, or the ability to suspend execution waiting for a condition to be come true (``retry'' in Transactional Haskell). There are undoubtedly many non-obvious features we haven't identified yet.

At the other extreme, we don't know yet the performance limits of software transactional memory (STM) systems, but eventually, just as with virtual memory, some kind of hardware support will probably be needed to alleviate copying and synchronization overheads.

One kind of hybrid vision is to provide dual paths: try the transaction in hardware, and if it fails, perhaps because it is too low or too big, then retry it in software. I find this approach tremendously unappealing because it requires writing complex software, contradicting the original motivation for transactional APIs, which is to simplify multi-threaded software.

I'm currently working on two projects that address this question from both ends. On one end, Ravi Rajwar, Konrad Lai, and I have recently written a paper ``virtual transactional memory'' (to appear in ISCA 05), which virtualizes hardware transactions, guaranteeing transactional properties even for transactions that exhaust hardware resources. At the other end, I am also working on SXM, a STM system written in C# that uses run-time code generation for synchronization. SXM is currently non-blocking, using interlocked compare-and-swap instructions for synchronization. I am working on an alternative design that uses short-lived critical sections (simulating hardware transactions) in an attempt to lower run-time costs. Ultimately, I hope to see these two directions converge on a high-level transactional common language runtime where hardware transactions make up the moving parts.

Michael Scott

Michael L. Scott
University of Rochester

Position Paper, Intel Workshop on Transactional Memory, April 2005

In comparison to lock-based concurrency, transactional memory offers compelling advantages in semantics and, potentially, performance. It won't become popular, however, unless it is (a) portable, and (b) convenient. Portability requires acceptable performance not only on new machines with special hardware, but also on old machines, with existing primitives. Convenience requires not only atomicity, but also condition synchronization. At Rochester we are addressing these challenges through (1) enhancements to software transactional memory, (2) dual data structures, and (3) integration with existing programming models.

(1) Building on the DSTM of Herlihy, Luchangco, Moir, and Scherer at Sun Labs, and the OSTM of Harris and Fraser at Cambridge University, we have developed an object-based ADAPTIVE SOFTWARE TRANSACTIONAL MEMORY (ASTM) system that combines the best features of both. Specifically, ASTM adjusts both the mechanism and the timing of object acquisition based on the ratio of reads and writes and the potential use of early release in the application workload. On stress-case microbenchmarks, ASTM matches the performance of DSTM (and outperforms OSTM by an order of magnitude) under write-heavy loads; it matches the performance of OSTM (and outperforms DSTM by a factor of two) under read-heavy loads. Like DSTM, ASTM is obstruction-free, rather than lock-free. We are pursuing several strategies in high performance contention management.

(2) While nonblocking concurrency, including transactional memory, provides an attractive alternative to mutual exclusion, it traditionally offers no alternative to condition synchronization. We believe we know how to get around that. Dual data structures are containers that may hold not only data, but also reservations. By dividing each potentially blocking operation into a "request" and a "follow-up", DDS methodology allows a container like a queue to determine the order in which data will be supplied not only to consumers that show up in the future, but also to consumers that have shown up in the past, while the container was empty.

Nonblocking DDSes extend the linearizability theory of Herlihy and Wing to objects with partial methods. They guarantee that both requests and follow-ups are nonblocking. They also guarantee that between its request and the successful follow-up, a thread refrains from interfering with the progress of its peers.

(3) Several researchers have suggested that transactional memory might be used to replace lock-based critical sections (e.g. Java/C# synchronized methods) in existing programs. A major open question is what to do about notify() and wait(): how can an algorithm be transactional if it blocks in the middle, or if dictates which thread must run next? Harris and Fraser have proposed using conditional critical regions, despite the potentially high overhead of repeatedly testing conditions. Hammond et al. have proposed a more radical model in which transactions are the fundamental means of accessing memory. While we are open to such ideas, we claim that DDS methodology can be used to build efficient, intuitive transactional programs with existing language primitives. One would need to modify the compiler, but not, we believe, the language or the virtual machine.

This work is supported in part by NSF grants numbers EIA-0080124, CCR-0204344, and CNS-0411127, and by financial and equipment grants from Sun Microsystems Laboratories.

Ravi Rajwar

Hardware transactional memory proposals exploit modern processor techniques and cache coherence. But these proposals are invariably tied to the hardware implementation. Software proposals assume (nearly) no hardware support but end up being generally difficult to use.

How do I define difficulty? By looking at how much of the underlying mechanisms are exposed to the programmer. For example, if the programmer has to manage special objects, that is too much work.

A compelling benefit I see of a hardware approach is the guarantee of atomicity of a hardware transaction, no matter how many other unprotected access requests occur. This guarantee of atomicity is typically missing in software transactional models. Why would I think this is important? Because we want thread programming to be accessible to common programmers, and guaranteeing the atomicity of a transaction, no matter what, goes a long way.

I contend that, if we decide to adopt transactions as a model (and it might be a big if), then we need to virtualize the implementation while maintaining the properties provided by transactional memory and maintaining the clean abstraction of begin and end to the common programmer. Of course, expert programmers may want more control and a richer interface.

Anything short would undermine the motivation of going towards transactional memory.

Why do I say "it might be a big if"... because many challenges remain, both at the software model and the system level. Virtualization is a non-trivial task.

But, implementing virtual memory was also considered difficult. Actually implementing an out-of-order processor was considered difficult. But good engineering and design made them possible. We should not shy away from addressing the hard problems.

Given that we are advocating a new software development model, the biggest hurdle to adoption might be simply an absence of a viable application development environment.

Seckin Unlu

I have been working on optimizations of multi-threaded operating systems and applications for many years. One observation is, the synchronization operations based on shared-memory consistency, e.g. semaphores, latches or locks, tend to get expensive as levels of concurrency increase in the implementations, especially as the number of processors increase and the processors speed up faster vs memory. The number of processors will keep increasing, especially with multi-core products, and the processors will speed up even more, in the foreseeable future. A more efficient way of synchronization, or a more efficient method of consistent execution of multiple threads, e.g. transactional execution, will likely be of great benefit to all multi-threaded applications.

Another important issue is that in practice the "success" case of execution is much more common than the failure case, as measured by empirical data, even on benchmarks that are running at full speed. E.g. modern DBMS's have locks that succeed 99+% of the time, even on 8-16 processor systems. Hence, any attempts to optimize the "failure" cases of locks or transactions, should not burden the "success" case by more work then what's classically/traditionally possible, in order to be a general performance improvement.

Suresh Jagannathan

Overspecifiying synchronization is often an undesirable by-product of guaranteeing safe access to shared data using mutual-exclusion. Unfortunately, reasoning about concurrent thread interaction remains a challenge for programmers. We advocate an alternative model of concurrent programming that does not penalize applications for overspecifying monitored regions of code. Our goal is to allow synchronization to be relaxed automatically whenever safety is not compromised.

While a transactional infrastructure offers the promise of improved performance by enhancing opportunities for concurrency, and simpler reasoning by formulating correctness in terms of serializability and atomicity, the realization of this promise is complicated by language features such as the sophisticated rules enforced by the language's memory model to govern when updates performed by one thread are available to other threads. Nested monitors complicate the issue further, and strongly influence how serializability constraints are defined.

For example, the Java memory model encourages aggressive compiler optimizations, and facilitates Java's implementation on scalable platforms because of its lazy release consistency semantics. Updates performed within a thread need only be made visible when control exits the guarded region in which the updates take place. While this model benefits from current trends in scalable multiprocessor design, its impact on program semantics especially with respect to notions of serializability and atomicity is not
clear.

These issues are central to a larger collection of concerns that form the focus of our work. Transactional computing of the kind we envision supports promiscuous concurrency: the ability of many tasks to simultaneously, transparently, and safely inhabit guarded regions of code. However, this freedom currently comes at a hefty price. These costs manifest themselves in several different ways: (a) read-write barriers to monitor access to shared data; (b) commit checks when control exits a guarded region to validate global consistency properties; (c) initialization and space overheads on data structures used for monitoring, when control enters a guarded region; (d) abort mechanisms triggered when consistency checks fail; (e) global update overheads when consistency checks succeed, and (f) adaptive monitoring overheads to determine when transactional mechanisms can be profitably applied.

Tim Harris

In ongoing work we're implementing a software transactional memory (STM) that offers a combination of key features: (1) transactional updates are made in-place, avoiding the cost of a look-aside on reads, (2) objects retain a "flat" format, that is, their fields are reached directly rather than through a "current version" pointer, (3) contention can be managed either by lock-based and non-blocking strategies.

We're using this STM to build atomic execution blocks in C# by integrating the STM into an optimizing bytecode-to-native code compiler, allowing the instruction sequences for transactional operations to be inlined in application code. This environment provides new opportunities for exploring analyses and optimizations -- both at the level of the STM's implementation (e.g. streamlining log management), and at the level of building atomic blocks (e.g. bypassing the transactional interface when reading from immutable objects).

This work will let us develop an understanding of how fast a software-only implementation of atomic blocks can run and thereby gain insights into (a) what seemingly-unavoidable overheads remain, (b) what form of hardware support might be appropriate to avoid these overheads.

For instance, when considering transactional hardware over which to build atomic execution blocks:

- Does it make sense for hardware to directly export a transactional interface, or are there lower level primitives over which fast software transactions can be built (much as work like the SOAR project showed that object oriented languages could run effectively on conventional or slightly-extended CPUs).

- How should the interface to a hardware transactional memory make a distinction between ordinary operations and transactional ones? Should this be at a per-instruction level (perhaps leading to code duplication), or should transactional storage areads be identified in some way? What optimization opportunities does a compiler lose by not being able to distinguish different kinds of access, e.g. to transaction-local, thread-local or immutable data?

- How can hardware and software transactional memories interact on the same data? There are at least three aspects of this question, (a) how to manage transactions that exceed some hardware resource, (b) how transactions interact with "transparent" OS and VM services (e.g. context switching or garbage collection), (c) how to implement transactional models of blocking and wake-up.

We'll hopefully have some actual numbers from our system by the time of the workshop.

Todd Mowry

Position Paper for the Transactional Memory Workshop

Todd C. Mowry

Computer Science Department, Carnegie Mellon University
& Director, Intel Research Pittsburgh

One of my graduate students (Chris Colohan) recently wrote a paper on applying thread-level speculation (TLS) to break up the "New Order" TPC-C transaction into smaller parallel chunks to be run on a chip multiprocessor (CMP). Although applying TLS to an existing transaction is a bit different from using a transactional model to write parallel programs, I think that we learned some very interesting things during this study that are relevant to hardware-based transactional programming in general.

Let me start with some background on the study, and then I'll get to the specific lessons that we learned. Chris implemented the "New Order" transaction directly on top of BerkeleyDB, and he parallelized the main loop using TLS. At the SQL level, this loop looks reasonable straightforward to parallelize, but it turns out that there are a huge number of data dependences across iterations due to the internals of BerkeleyDB. Part of the reason for this is that these "epochs" (or threads) are much larger (roughly 60,000 instructions) than what we typically saw in earlier studies (5000 instructions or less), and the code is also quite complex. These dependences prevented the code from speeding up on multiple processors when the code was simply translated directly to a TLS form. However, using feedback on which critical inter-thread data dependences were causing TLS to fail, Chris incrementally modified the code by hand to fix some problems that were preventing successful parallel execution (many of which were obvious fixes). Having no previous experience with BerkeleyDB's roughly 200,000 lines of code, it took Chris roughly a month's worth of effort to modify less than 1200 lines of code to ultimately achieve more than a twofold speedup on four processors. (Details of this study can be found in the CMU technical report CMU-CS-05-118 from March, 2005.)

The optimizations that Chris performed to eliminate inter-thread data dependences fell into three major categories: (i) converting centralized structures into per-processor structures (e.g., pools of free memory, counters for gathering statistics, etc.); (ii) postponing certain modifications until after the speculative work in a thread is completed, so that they can be serialized non-speculatively; and (iii) exploiting what Chris called "isolated undoable operations" (IUOs). The idea behind an IUO is that certain operations create inter-thread data dependences that will always cause TLS to fail (the same is true for hardware-based transactions); however, when you think about these operations at the level of program semantics, they don't really need to cause speculation to fail provided that there is a way to undo the isolated side-effects of the particular operation if speculation does in fact fail for real. For example, certain forms of resource allocation map well to IUOs. In our study, we used it to handle the pinning of pages in the buffer pool. From a programmer's perspective, an IUO operation must have an equivalent "undo" operation (e.g., "pin_page" and "unpin_page"). The hardware's dependence-tracking mechanism (for speculation) ignores any memory dependences that occur within these operations, and they have real side-effects that are seen by other threads. Those side-effects simply don't alter the semantics of the program. This idea is similar in spirit to "nested top actions" in ARIES, where we are allowing side-effects to occur that are observable by other threads, but this does not alter higher level program semantics.

To summarize the major take-home point, it seems to be important for the programmer to not simply rely on the hardware-based mechanism for tracking transactions (or TLS), but to also allow the programmer to remove certain operations from the view of this mechanism while preserving higher-level program semantics through software.

Victor Luchangco and Mark Moir and Nir Shavit

Hybrid Transactional Memory

Victor Luchangco, Mark Moir, and Nir Shavit
Sun Microsystems Laboratories.

Our position on transactional memory is that, given anything that can be respectably called hardware transactional memory, it is possible to insulate programmers from the limitations and idiosyncrasies of the particular hardware transactional memory implementation using a combination of compiler and library support. In particular, a software transactional memory can be built that uses hardware transactional memory (if available) to boost performance, but always retains the ability to revert to software transactions if the hardware transactional memory is ineffective in a particular case, or even if it does not exist in a particular architecture. We call this approach "Hybrid Transactional Memory". We believe that this approach eases the path towards widespread adoption of a transactional style of programming both because programs that use transactional memory can be developed and tested even before hardware transactional memory is available, and because it will make it easier to convince processor designers to include hardware transactional memory support because they will be less constrained than they would be if programmers were to use their implementations directly. This decoupling of programmers' code and hardware implementations is in our opinion key to fully realising the benefits of transactional memory to improve programmer productivity, code maintainability and understandability, etc., in contrast to proposals that require programmers to continue to program defensively with locks and therefore offer only performance benefits and no software engineering benefits.