CS 736 Reviews - Spring 2015: TxLinux: Using and Managing Transactional Memory in an Operating System

1. summary
This paper proposed TXLinux a variant which uses HTM as synchronization primitive. This paper introduced cxspinlocks(a primitive allowing locks and transactions to cooperate) and integrates HTM with the OS scheduler(eliminating prioirty inversion). This paper also provides insights and measurements for converting an operating system to use HTM.
2. Problem
With the number of cores in a chip scales up in the upcoming years. Programming these systems is a challenge, and achieving scalable operating system performance with locks and current synchronization options for systems with over a thousand cores comes at a significant programming and code maintenance cost. Transactional memory can help the OS maintain high performance while reducing coding complexity.
3. Contributions
The main contribution is the use of transactional memory as a synchronization primitive in OS.
1) A mechanism for cooperation between transactional
and lock-based synchronization of a critical region using cooperative transactional spinlock (cxspinlock).
2) Cxspinlocks enable a novel way of managing I/O within a transaction, the system dynamically and automatically restart the execution and acquire a conventional lock. (cx_optimistic->cx_exclusive).
3) Integration of HTM with OS Scheduler -
1) An Integer called conflict priority is used to communicate between OS and contention manager the scheduling priorities and policies.
2) Contention manager chooses transaction in the order of conflict priority to prevent priority inversion and also improving the overall system performance.
4. Evaluation
1) Time spent for restarting a transaction when using cxspinlock is high. (16 cores - 57% less 32 cores- 1% more). As the number of cores increase, time wasted synchronizing during high contention executing a large critical section become worse than locking.
2) TXLinux allow more concurrency when the transaction sizes are small (32 concurrent threads in the same critical region).
3) Policy and priority inversion are eliminated.
5. Confusion
I couldn't understand completely, Virtualizing transactions.

Posted by: Prasanth Krishnan | March 27, 2015 11:49 PM

1. Summary
This paper discusses and measures TxLinux, a variant of Linux which uses hardware transactional memory (HTM) as a synchronization primitive which requires the cooperation between locks and transactions, and manages HTM in the OS scheduler. A new primitive, cooperative transactional spinlocks (cxspinlocks) is introduced which provide the advantages of both locks and transactions while synchronizing access to shared data.

2. Motivation
Achieving scalable operating system performance with locks and current synchronization options for systems with over a thousand cores which is becoming more popular today comes at a significant programming and code maintenance cost. Transactional Memory can help the operating system maintain high performance while reducing the code complexity.

3. Contributions

TxLinux, was the first operating system to use hardware transactional memory (HTM) as a synchronization primitive. It provides greater parallelism enabled by transactions

It introduced a new primitive, cxspinlock (cooperative transactional spinlock) that allows both locking and transactions to work together to protect the same data while maintaining both of their advantages.

The transactional memory proposals that existed at that time required every execution of a critical section to be protected by either a lock or a transaction, while cxspinlocks allow the system to dynamically choose between locks and transactions. For example, when a thread is executing a critical region in a transaction and an I/O operation is encountered, control is transferred to cxspinlock implementation so that thread executes the critical region exclusively by acquiring a conventional lock.
cxspinlocks also allows different critical regions that access the same data structure to be protected by a transaction or by a conventional lock which was not possible with the transactional memory designs of that time.

It was also the first to integrate the OS scheduler with the HTM. This has several advantages: priority inversion and policy inversion is eliminated and it also improves the overall system throughput by providing useful information to the scheduler

4. Evaluation
Benchmark experiments show that for a 16 CPU configuration, TxLinux-SS wastes an average of 57% less time synchronizing than Linux does, and for 32 CPUs it wastes 1% more. Using several real-world benchmarks, it is shown that TxLinux has similar performance to Linux, exposing concurrency with as many as 32 concurrent threads on 32 CPUs in the same critical region.
Overall introduction of transactions as a synchronization primitive in the OS reduces time wasted synchronizing on an average, but can cause pathologies that do not occur with traditional locks under very high contention or when critical regions are sufficiently large for the overhead of HTM virtualization to become significant.
HTM aware scheduling eliminates priority inversion with a very slight overhead in performance with workloads that exhibit high concurrency which cannot be achieved with traditional locks. However, under normal contention worklods, it does not have a significant impact.

5. Confusion
Does a HTM synchronization system suit a certain class of applications?

Posted by: Rohith Subramanyam | March 27, 2015 06:02 AM

Summary
This paper introduces us to TxLinux which is a variant of Linux. It is the first operating system to ever use hardware transaction model as synchronization primitive, which was also managed in the scheduler. The highlight of the paper was on cxspinlocks, which are new primitives for cooperative transactional spinlocks which enabled co-operation between locks and transactions and also handling of transaction by the OS scheduler.

Problem
The authors claim that use of locks for synchronization will lead to problems as they are susceptible to the problem of deadlocks. This puts a lot of pressure on the programmer to be very careful and testing of all possible scenarios. Locks also result in the priority inversion problem. But they could not just replace locks with transactions as they have serious issues while performing IO operations.

Contribution
To enable a co-operative relationship between locks and transactions, new transactional spinlocks were introduced called cxspinlocks. They both protect the data while maintaining each of their individual advantages. The system is now dynamically able to decide when to use locks or use transactions. Regions of code which accesses shared data is executed atomically and in isolation. When the hardware detects that an IO operation is required, the control is transferred to the cxspinlock which guarantees that the thread will re-execute the critical region exclusively. This allows for more parallelism. The TxLinux scheduler and the transactional memory was also modified to enable interactions which removed policy and priority inversions in the system.

Evaluation
TxLinux performed similar to Linux on real-world benchmarks with concurrency of 32 threads in 32 cores in the same critical region. However it out-performed Linux in a 16 core system. This was because on 32 core systems the kernel spent a very less amount of time synchronizing hence no speed ups were noticed in case of TxLinux. TxLinux also didn’t show improvements in workloads with normal contention profiles.

Confusions
I am not clear on why the time spent on the switching from transaction to locks and re-execution of critical section by cxspinlock does not cause any slowdown of the system.

Posted by: Nabarun Nag | March 27, 2015 04:09 AM

Summary:
This paper discusses TXLinux, a variant of Linux, which uses hardware transactional memory as a synchronization primitive in Linux. Using both spin locks and transactions solves many problems as compared to using only spin locks or transactions in concurrent programming,

Problem:
Author claims that with number of cores increasing using conventional locks in parallel programming is difficult. Some of the notable problems are locks does not scale well, they are conservative and not robust. Also, locks are difficult to use. Small errors in locking implementation will easily lead to deadlock and implementing error-free locks takes a lot of planning and testing. Transactional memory is fast, deadlock free and it does not suffer from priority inversion, however they cannot be used when I/O is performed. Author tries to solve problem of both and introduces hybrid approach which changes dynamically depending upon the operations.

Contribution:
Author introduces a hybrid approach of locks and transactional memory called cooperative transactional memory(cxspinlocks). Cxspinlocks allows dynamic switch between transactional memory and spinlocks and at the same time ensures correctness and fairness. If during the transaction I/O is detected, its state is changed and transaction is restarted and run as exclusive. Information of OS scheduling is included in transaction which helps in preventing problem of priority and policy inversion. Also, by increasing priority of process with active transactions and descheduling process which may waste CPU because of repeatedly restarts, system throughput is increased.

Evaluation:
Performance was evaluated by comparing TXLinux with Linux on 16 core and 32 core system. On 16 core system, TXLinux outperformed Linux whereas on 32 core system performance was comparable with Linux. For small transactions about 67% of concurrency is achieved and reduces the restarts by 20% as compared to spin locks.

Confusion:
Is this model used in production systems?

Posted by: Anup Rathi | March 27, 2015 01:10 AM

Summary
This paper presents TxLinux, an OS that uses hardware transactional memory (HTM) as a synchronization primitive and presents innovative techniques for HTM-aware scheduling. It introduces a new synchronization primitive called cxspinlocks that allows locks and transactions to work together to protect the same data while maintaining both of their advantages. The paper also presents techniques to avoid priority and policy inversions during scheduling during scheduling in HTMs.

Problems
Implementing locks for scalable OS is hard and comes with development and maintenance costs. Transactional memory which is easier, less complex, improves concurrency and helps OS maintain high performance. However, it cannot replace locks and have limitations that prohibit transactions such as performing I/O. In a large legacy system, its practically difficult to convert every instance of locking to use transaction. Also, transaction conflict resolution leads to priority inversion. The paper addresses the issue by introducing new sync techniques and methods.

Contributions
Introduction of cxspinlocks that act as both transaction and lock primitive based on conflicts in the critical sections that they protect - major contribution in migration of parallel applications.
Handling I/O within transactions that allows a transaction that performs I/O to automatically restart execution and acquire a conventional lock.
HTM mechanisms to nearly eliminate priority inversion, and OS scheduling techniques that use information from HTM to increase system throughput.
Approach, experiments and details about converting an OS to use HTM as synchronization primitive is also valuable.

Evaluation
The paper presents detailed evaluation of the system by measuring synchronization, concurrency, performance of cxspinlocks and scheduling. Results show that cxspinlocks is comparable to naked spinlocks and doesn’t add much overhead. HTM does not add significant overhead due to restarts. Benefit of concurrency is not seen on Linux kernel as the code is already optimized to avoid lock contention. OS priority sharing result show achievement of near elimination priority and policy inversion. Results also show that transaction based scheduling has good advantage for workloads which are conflict intensive.

Confusion
In our group discussion we were questioning about what it means to virtualize a transaction

Posted by: Yash Govind | March 26, 2015 03:04 PM

1. Summary
The authors discuss TxLinux, a variation of Linux that implements a new synchronization primitive to take advantage of transactional memory. These cxspinlocks allow a mix of locking and transactional memory to be leveraged as needed.

2. Problem
The paper argues that hardware transactional memory (HTM) makes OS design simpler and easier to maintain. However, it cannot be used in all situations, specifically when doing I/O. The authors note that current locking mechanisms do not perform well with HTM, so a new synchronization primitive is needed.

3. Contributions
The primary contribution is the introduction of cxspinlocks. This primitive allows the cooperative interaction of transactional and non-transactional threads. It maintains the high concurrency of transactional memory when non-transactional threads are not present, and allows a transaction manager to be notified when a non-transactional thread attempts to acquire the cxspinlock and arbitrate the lock access.

4. Evaluation
The authors have a lot of data included about the performance of their system. In the case of 16 processors, TxLinux wastes 57% less time synchronizing than Linux but for 32 processors it wastes 1% more. Additionally they mention for some specific cases, like creating a delete many small files in a single directory, transactional memory performs worse than a spin lock. The paper notes that in these cases the spinlock aspect of cxspinlocks could be substituted instead.

5. Confusion
What techniques can be used to stop the transactional memory state from overflowing the L1 cache that the authors are referring to?

Posted by: Alex Sherman | March 26, 2015 08:01 AM

1. Summary
This paper presents TxLinux, a variant of Linux that uses hardware transactional memory in the operating system as a synchronization primitive. To achieve cooperation between locks and transactions, this paper also introduces a new primitive called cxspinlocks, that allows the system to execute critical regions with transactions, and automatically roll back to use locking if the region performs I/O.

2. Problem
Transactional memory has become very popular because of its simplicity and modularity. However, There exist several problems with an operating system using transactional memory:

Transactions cannot simply replace or eliminate locks. Limitations exist such performing I/O.

In a large legacy system, there are practical difficulties in converting every instance of locking to use transactions.

Transactions are an optimistic primitive. Locks usually perform better for highly contended critical sections.

3. Contributions
This paper first provides an HTM primer, motivates transactions in the OS, describes the HTM model and explains the basic issues with using transactions in the OS. A transaction is started with xbegin, and commits at xend. The major advantage of synchronization with transactional memory is that it is more modular than locks, easing code maintenance and reducing the possibility of bugs.

The paper invents a novel mechanism for cooperation between transactional and lock-based synchronization of a critical region with the use of cxspinlock. These two different primitives work together to protect the same data with both of their advantages. An innovative primitive, cxspinlock, is introduced here, that allows the system to choose dynamically and automatically between mutually exclusive locks and transactions. The idea is to use cx_optimistic and cx_exclusive functions to acquire cxspinlocks with either transactions or mutual exclusion.
The mechanism is also for handling I/O within transactions that allows a transaction to automatically restart execution and acquire a conventional lock.

Furthermore, this paper also introduces an HTM mechanism to nearly eliminate priority inversion. When transactional conflict occurs, some hardware/software logic called contention manager determines which of the two conflicting transactions may proceed. The problem is solved by having contention manager resolves the conflict in favor of the thread with higher OS scheduling priority. This method also resolved the problem of policy inversion.

The paper also provides some insights and measurements from converting an operating system, TxLinux, to use hardware transactional memory as synchronization primitive.

4. Evaluation
This paper evaluated TxLinux, an operating system that uses HTM, and concludes that using transactions as synchronization primitive would reduce time wasted on synchronizing on average, but under very high contetion, it may cause pathologies that do not occur with traditional locks. Scheduling with HTM eliminates priority inversion and policy inversion problems, and enables better management of very high contention. With normal contention workloads, however, it does not have a significant impact on the performance.

5. Confusion
What is the difference between spinlocks and mutexes?

Posted by: Yiran Wang | March 26, 2015 08:00 AM

Summary:
The paper describes a synchronization primitive, Hardware Transactional Memory, in the context of an Operating System- TxLinux. The interaction of the HTM and scheduler is also discusses, where policies for scheduling are described. The authors claim that HTM provides a cleaner interface for the programmer and also guarantees correctness, such as avoiding priority inversion etc. The authors propose co-operational Transactional locking which provides a way to allow spinlocks and HTM to work-together to protect critical regions of code.

Problem:
The authors try to address the issue of providing a synchronization mechanism for OS for CMPs with a large number of cores, that they predict to be the norm in the future systems. Existing mechanisms, such as spinlocks in Linux OS, are slow and are susceptible to deadlocks and livelocks, which makes them very difficult to reason with. This can lead to coarse-grain locks deteriorating performance.

Contributions:
The authors describe the working mechanism of the HTM and reason about the advantages of HTM, justifying and mentioning motivation for their work. They show how using HTM can provide composability with guaranteed correctness. The prime contribution of the paper is the concept of Cooperative Transactional Spinlock, which allows the use of HTM and spinlocks to protect critical sections. The way it’s achieved is via two functions- cx-optimistic and cx-exclusive. cx-optimistic allows the code region to continue execution of the critical region and upon detection of a mutual exclusion violation, the transaction restarts with a lock. cx-exclusive allows executions of the critical region with a lock held. To address the problem of not being able to support I/O within critical regions in case of HTM sync, the authors choose to decouple IO calls from other system calls, allowing the extension of HTM to user-level. Finally, the OS can use register and counter to acquire information regarding transactions and make intelligent scheduling decisions.

Evaluations:
Simulations suggest that the cxspinlock perform well with 16-32 cores. It doesn’t add any overheads to regular spinlocks.It provides the gurateed protection, avoiding priotity inversion etc. Other performance benefits are not really seen.

Confusions:
what is transaction virtualization?

Posted by: Kishore Kumar Jagadeesha | March 26, 2015 08:00 AM

Summary:
This paper introduces TxLinux, an operating system that uses hardware transactional memory as a synchronization primitive and manages HTM in the scheduler.

Problem:
As the number of cores per chip goes up, achieving scalable os performance with locks and current synchronization options for systems comes at a significant programming cost. Transactional memory can make programming easier by executing the shared regions atomically and in isolation, buffering the results of individual instructional and restarting execution if isolation is violated. However, HTM have limitations in scenarios such as performing I/O. So the authors find a way to allow locks and transactions to work together.

Contributions:
(1) The paper proposes the cooperative transactional spinlock (cxspinlock) to allow cooperation between transactional and lock-based synchronization of a critical region. They propose a synchronization API that can affords the seamless integration. It allows transactional and non-transactional code to correctly use the same critical section while maintaining fairness and high concurrency.
(2) The paper proposes a novel mechanism for handling I/O within transactions. If MetaTM detects I/O during the transaction, the transaction state will be set to NEED_EXCULISIVE and the transaction will be restarted.
(3) The paper proposes a mechanism to nearly eliminate priority inversion and OS scheduling uses information from HTM. MetaTM provides an interface for the OS to communicate scheduling priority and policy to the hardware contention manager. TxLinux encodes a process's dynamic scheduling priority and scheduling policy into a single integer called conflict priority.

Evaluation:
The paper compares the performance of TxLinux against Linux. The paper gives the number of times a spinlock is acquired, the number of cycles spent acquiring it and the number of times a process had to spin before acquiring a lock, showing that TxLinux lowers lock contention and reduces the time spent on synchronization. Restarting rate of I/O transaction, transactional priority inversion frequency are also given.

Confusion:
Can you explain the eager version management (the old values are copied into an undo log managed by the processor) used by MetaTM?

Posted by: Jing Fan | March 26, 2015 07:48 AM

Summary:
This paper talks about the usage of hardware transaction memory primitives in an operating system (TxLinux). More specifically it talks about using a combination of locks and transactions to retain the advantages of both mechanisms, and about using transaction information in OS scheduling policies.

Motivation:
The authors feel that using HTM primitives in the OS (where using multiple locks leads to many synchronization bugs and code that is not modular) will simplify the code, reduce bugs and improve scalability by providing more concurrency. They also identify limitations of the then present transaction memory based mechanisms (such as impossibility to do IO in a transaction) and difficulty in migrating existing operating system code to the new primitives.

Contributions:
-Their most important contribution was the identification of a simple way to migrate existing OS code to use the newer primitives ( by using a combination of transactions and locks).
-The cooperative transactional spinlocks(cxspinlocks) can, depending on need alternate between the increased concurrency offered by transactions and the exclusivity provided by spin locks.
-This allowed them to migrate the locks in linux code that could not be modeled as transactions to use the 'exclusive' version of the cxspinlock and the other locks to use the optimistic version of the cxspinlock ( which attempts to optimistically protect the critical sections using transactions, which allow more concurrency).
-In scenarios such as I/O which cannot be protected by transactions, the system can thus dynamically switch over to restarting the transaction as 'exclusive'. This technique affords increased concurrency in places where it is possible, but switches over to the older mechanisms to provide isolation when it is not.
-The other major ideas in the paper are the inclusion of OS scheduling information in the transaction contention management mechanism to prevent priority and policy inversion and the usage of transaction related information in OS scheduler to increase throughput,by increasing effective priority for processes with active transactions and descheduling repeatedly restarting transactions for a while.

Evaluation:
The authors implemented their hardware model (MetaTM) on the Simics machine simulator. Their study shows that TxLinux reduced lock-contention ( due to the increased concurrency in transactions) and reduced the number of test&set operations which use the coherence hardware.The data does not show any definite increase in concurrency in the over-all system, but the authors argue that this is because in Linux, due to the use of locks, the size of critical sections has been minimized and if critical sections grow, the increased concurrency of transactions will be more visible.They also show that cxspinlocks have minimal overhead (2.8-3.1%) when compared to the original spinlocks and that transaction aware process scheduling does improve performance ( by around 8% for 4 CPUs).

Confusion:
The authors say that using priority inheritance to solve priority inversion requires us to change polling mechanisms such as spinlocks to blocking mechanisms such as mutexes. Could you please explain why this is so?

Posted by: Hariharan Gopalakrishnan | March 26, 2015 07:37 AM

Summary:

In this paper, the authors talk about TxLinux which uses hardware transactional memory (HTM) as a synchronization primitive. The paper discusses about how HTM is managed in the scheduler. The paper describes in detail about the cxspinlock that acts as a primitive that allows locks and transactions to work together. The paper also presents the evaluation of the proposed system.

Problem:

When concurrent programming is achieved using locks and synchronization primitives results in significant programming and maintenance cost. The paper tries a new approach to achieve concurrency programming using transactional memory. In this approach the challenge lies in co-operation of lock based and transactional synchronization and the integration of transactions with OS schedulers. Using transactional memory brings up a new problem of conflict management. Also priority inversion problem should be tackled by the HTM approach.

Contribution:

In this paper, the cxspinlock is used which is a cooperative transactional spinlocks. They allow locks or transactional based synchronization for different critical section executions. The cxspinlock can be obtained by cs_exclusive and cx_optimistic functions. The cx_optimistic allows multiple transactional threads to execute and when one require mutual exclusion it restarts and changes to cx_exclusive. The contention between two transactions is taken care by the contention manager. Fairness is achieved between transactional and non-transactional by subjecting both to contention manager policy. The contention manager is invoked when the write set and the read set/write set of two transactions intersect. Deadlock and live lock is avoided by using the timestamp policy. The MetaTM supports transaction aware scheduling by providing the OS with mechanisms to query the state of transactions. Dynamic priority based on the state of HTM is achieved by awarding priority boost for active transaction and penalizing those transactions that restart often. Also conflict reactive de scheduling takes place by de scheduling a thread that restarts very often.

Evaluation:

The paper presents the results of the evaluation of the TxLinux. For the 16 CPU config, it wastes 57% less time in synchronizing but for 32 CPU config it wastes 1% more. The results also conclude that TxLinux lowers lock contention by converting to cxspinlocks as 37% of calls to lock routine and 34% calls to test loops and 50% calls to test and set are eliminated. The use of cxspinlocks attribute to only 2% or 3% overhead. The priority inversion and policy inversion is almost eliminated by conflict management. But this is only for the workloads in the benchmark used.

Confused about:

Is there any similar system in use today? What is the SizeMatters contention management?

Posted by: Ashwin Karthi Narayanaswamy | March 26, 2015 06:26 AM

1.Summary
This paper is about adapting the free Linux operating system kernel to transactional memories. The resulting operating system kernel is called TxLinux. The paper discusses two innovations in detail: cooperation between locks and transactions with the OS scheduler. Mixing locks and transactions requires a new primitive, cooperative transactional spin lock (cxspinlocks) that allow locks and transactions to protect the same data while maintaining the advantages of both synchronization primitives. Integrating the OS scheduler with a hardware transactional manager is paramount to avoiding priority inversion. In real world benchmarks TxLinux is able to achieve concurrency with 32 threads in the same critical region.

2.Problem
When concurrent programming is done using locks, programmers have to face some obvious and unavoidable issues like inability of threads to coexist in the same critical region if they are not accessing the same bytes in memory, deadlocks and lack of composability. Transactional memory can help the operating system maintain high performance while reducing coding complexity. Transactions suffer from some issues too. They cannot be used in situations in which rollback is not possible especially in cases of I/O. So, a solution which can dynamically use locks and transactions depending on the needs has been devised. It is called cxspinlocks.

3.Contributions
The important contributions of this paper are as follows:

cxspinlocks are introduced which is a primitive that allows locks and transactions to work together to protect the same data while maintaining both their advantages. They can dynamically and automatically chooses between locks and transactions. It is implemented using cx_optimistic and cx_exclusive. The first enters a transactions assuming one can be started. If true mutually exclusive access is required the transaction is restarted with cx_exclusive and locks are used instead of transactions.
The Linux kernel was converted to TxLInux and this was done in two phases. The first time was an ad-hoc process that consisted of using the information about highly contented locks to replace those locks with transactions. I/O cannot be covered by transactions. The second conversion was to use transaction by cxspinlocks.
To cover I/O under transactions all I/O has to be done in caches first. The task of decoupling I/O from system calls reduces to making sure enough system resources are available for a user-initiated sequence of system calls to complete having updated only memory. IF enough resources are not available, the user process is killed. MetaTM cannot support a transaction whose memory needs surpass total memory available on the system.
MetaTM provides an interface for the OS to communicate priority and policy to the hardware contention manager. os_prio is the contention management policy used. It is a hybrid of three contention management policies. The first prefers transactions with the greatest scheduling value to the OS. In case of a tie, os_prio employs SizeMatters. If the transaction sizes are equal, os_prio employs timestamps.
If a process’ transactions restart repeatedly, it may make sense to make scheduling decisions that make future contention less likely. MetaTM provides mechanisms for the OS to query the hardware and communicate transaction state to the OS. TxLinux has a modified scheduler that uses this information while making scheduling decisions.
4.Evaluation
A Mimics emulator is used to emulate TxLinux. All the required models and configurations are provided by Simics. For a 16 CPU configuration, TxLinux-SS wastes an average of 57% less time synchronizing than Linux does and for 32 CPUs it wastes 1% more. Linux is highly optimized to avoid lock contention. Thus major gains in transaction concurrency were not observed. A detailed evaluation about cxspinlock performance and transaction-aware scheduling is given.

5.Confusion
Critical regions usually mean the same bytes in memory are read/written. If transactions allow multiple threads to run concurrently in the same critical region, won’t it result in most of the threads almost always restarting?

Posted by: Jyotiprakash Mishra | March 26, 2015 06:15 AM

1.Summary
The paper introduces a new variant of Linux OS called TxLinux, which takes advantage of the Hardware Transactional Memory (HTM). TxLinux implements a new synchronization primitive called cxspinlock by integrating both transactions and spinlocks when needed by the critical code regions. The paper discusses about handling I/O in transactions, integration of HTM into kernel schedulers and how cxspinlocks can solve priority inversion problem. They also have evaluated TxLinux on MetaTM processor and some performance results are provided

2.Problem
Concurrent programming with traditional locks have lot of problems. Locks are not modular, have lack of composibility, are susceptible to deadlocks and have to reason out the granularity of locks. Locks also have priority inversion problem which can't be easily resolved. Transactional memory provides better abstraction to programmers, however rolling back or restarting on I/O operations is not possible. The paper tries to address these problems by introducing co-operative transactional spinlock (cxspinlock), allowing both spinlocks and transactions to work together across a critical section of code.

3.Contributions
One interesting analysis the paper makes is reasoning out how transactions and spinlocks would still create problems in OSes. The main contribution is the introduction of new synchronization primitive cxspinlock with API functions cs_optimistic and cs_exclusive. Transactional threads optimistically execute the critical region by calling cs_optimistic, but if any non-transcational thread conflicts or an I/O operation interrupts then, the transactional threads restart by calling cs_exclusive to acquire mutual exclusion lock. I/O in transactions are handled by monitoring the interrupt instructions or memory mapped region and then restarting the transaction by acquiring the lock. The paper also discusses the scheduling and contention management policies integrated into contention manager by tracking the transaction profile (conflict priority) which includes the scheduling policy of thread, sizes and timestamps of transactions. By communicating the process scheduling priority and policy to contention manager, TxLinux allows transaction-aware scheduling to be effectively done across transactions.

4.Evaluation
TxLinux is evaluated with MetaTM processor implemented in Simics simulator. All the required models and configurations are provided by Simics. Synchronization overheads of TxLinux is compared with Linux and TxLinux wastes around 57% of time in removing cache misses for locks . Also, maximum concurrency across transactions has been evaluated. Number of restarted transactions for I/O is also shown. More results for transaction aware scheduling is also reported. However, in overall TxLinux performs slightly better on 2 benchmarks and performs same on others compared to Linux. More reasoning about TxLinux performance would be good.

5.Confusion
I did not get how open nested or closed nested transactions would make a difference in scheduling policy of OS? Also, what is transaction virtualization and how often does overflow in HTM occur?

Posted by: Vinay Gangadhar | March 26, 2015 05:12 AM

1. Summary
The paper provides an overview of hardware transactional memory (HTM), identifies issues with adding transactions to an OS, and proposes cooperative transaction spinlocks (cxspinlocks) primitive that lets same critical sections to be synchronized using either locks or transactions. The authors incorporate cxspinlocks into an version of Linux (named TxLinux), and evaluate the OS to show its benefits.

2. Problem
Design a primitive to synchronize threads executing the same critical section under hardware transactional memory either using locks or transactions, while providing fairness (or policy enforcement) in contention management of contending threads.

3. Contributions
HTM provides a mechanism for synchronizing critical sections by recording the read and write sets (memory addresses), and rolling back executions of all but one on a conflict (i.e., when write set of one transaction conflicts with read or write set of another transaction). Not all actions in the OS code can be rolled back (like I/Os) necessitating exclusive access to critical sections for those cases. The authors design cooperative transaction spinlocks (cxspinlocks), which can be used to synchronize the same critical section using locks to provide exclusive access or transactions to provide optimistic access. cxspinlocks let multiple transactional threads execute inside a critical sections, if they don't conflict. However, it provides exclusive access to non-transactional threads (i.e., no transactional or non-transactional thread can execute inside the critical section, when one non-transactional thread is executing). Using cxspinlocks, TxLinux can handle I/O inside transactions by detecting I/Os and restarting the transactions in exclusive mode (non-transactional). The authors propose a novel mechanism for OS code to convey the priority of threads to the hardware for contention resolution. The mechanism involves encoding the priority as an integer and communicating it to hardware by writing to a register.

4. Evaluation
The authors modify Linux to use cxspinlocks wherever appropriate, and create TxLinux. They use various userspace application benchmarks to stress various OS subsystems, and compare their performance under Linux and TxLinux on 16 and 32 processor setups. The benchmarks are on an average 57% faster on TxLinux on 16 processor setup, but 1% slower on TxLinux on 32 processor setup.

5. Confusion
Optimistic concurrency control (same as HTM) is an old idea in databases management systems, and has been shown to be a purely academic idea. It performs very badly in high contention scenarios like it was shown in the evaluation here. Are HTMs even popular now or actively researched? Should application developers be expecting this in any new commercial processors in the near future?

Posted by: Shoban Chandrabose | March 26, 2015 04:22 AM

1. Summary
The paper describes the working of TxLinux which a variant of Linux that exploits the advantage offered by the then recent development in hardware - Hardware Transactional Memory(HTM). HTM executes critical regions in hardware as if they are atomic transactions which provides a greater advantage compared to spin locks in scalable multicore processors.

2. Problem
In existing systems that use spin locks in kernel critical regions, the critical regions are particularly slow as the shared data has to be kept consistent. Spin locks may also result in deadlocks and livelocks if not implemented properly. HTM, on the other hand, does atomic transactions thus relieving the system of deadlocks and coherency issues. This is specifically explained with the then existing Linux code which had a lot of bugs with respect to locking and synchronization.

3. Contributions
a) The authors develop cooperative transactional locking(cxspinlocks) which uses atomic transactions replacing spinlocks. In cases of I/O where rollback is not possible, the system dynamically resorts to working with traditional spinlocks, thus ensuring correctness.
b) The cx_optimistic() operates on the above principle where only one transactional thread works in the critical region at an instant of time. When another thread enters the region, one of them maybe preempted. While restarting, if the thread needs exclusive execution status, it can call the cx_exclusive() which does so.
c) The priority issue of selecting which thread to schedule when in critical region is solved by using a contention manager which selects the thread with the highest priority to run and thus not causing the code to deadlock.
d) The authors also decouple I/O from system calls which provides the user-level code to also use transactional programming model and also modify the device state while doing so.

4. Evaluation
The authors test TxLinux on a simulated version of MetaTM by running a variety of benchmarks. The authors show that the TxLinux system wastes lesser time in synchronizing - the prime purpose of implementing the new system. The os_prio policy which gets rid of priority inversion issues does this at a slowdown on only 2.5%. The authors also reduce restarts in the HTM by around 20% by using a transaction-aware scheduling.

5. Confusion
Now that there are many scalable multicore processors common in existence, why is TxLinux not as popular as the authors motivate it to be?

Posted by: Naveen Anand Subramaniam | March 26, 2015 03:50 AM

1. Summary
The paper describes the working of TxLinux which a variant of Linux that exploits the advantage offered by the then recent development in hardware - Hardware Transactional Memory(HTM). HTM executes critical regions in hardware as if they are atomic transactions which provides a greater advantage compared to spin locks in scalable multicore processors.

2. Problem
In existing systems that use spin locks in kernel critical regions, the critical regions are particularly slow as the shared data has to be kept consistent. Spin locks may also result in deadlocks and livelocks if not implemented properly. HTM, on the other hand, does atomic transactions thus relieving the system of deadlocks and coherency issues. This is specifically explained with the then existing Linux code which had a lot of bugs with respect to locking and synchronization.

3. Contributions
a) The authors develop cooperative transactional locking(cxspinlocks) which uses atomic transactions replacing spinlocks. In cases of I/O where rollback is not possible, the system dynamically resorts to working with traditional spinlocks, thus ensuring correctness.
b) The cx_optimistic() operates on the above principle where only one transactional thread works in the critical region at an instant of time. When another thread enters the region, one of them maybe preempted. While restarting, if the thread needs exclusive execution status, it can call the cx_exclusive() which does so.
c) The priority issue of selecting which thread to schedule when in critical region is solved by using a contention manager which selects the thread with the highest priority to run and thus not causing the code to deadlock.
d) The authors also decouple I/O from system calls which provides the user-level code to also use transactional programming model and also modify the device state while doing so.

4. Evaluation
The authors test TxLinux on a simulated version of MetaTM by running a variety of benchmarks. The authors show that the TxLinux system wastes lesser time in synchronizing - the prime purpose of implementing the new system. The os_prio policy which gets rid of priority inversion issues does this at a slowdown on only 2.5%. The authors also reduce restarts in the HTM by around 20% by using a transaction-aware scheduling.

5. Confusion
Now that there are many scalable multicore processors common in existence, why is TxLinux not as popular as the authors motivate it to be?

Posted by: Naveen Anand Subramaniam | March 26, 2015 03:50 AM

Summary:

The paper implements a variant of Linux operating system namely TxLinux which tries to solve the problems that come up with concurrency like priority inversion as well as provides increased concurrency. It combines the benefits of synchronization through transaction and through locks by implementing a combination of both named cxspinlocks. Cxspinlocks makes use of transactions to achieve synchronization but if there is any I/O involved in the critical region it switches over to using locks.

Problems:

The current approach of using locks to achieve concurrency resulted in problems like priority inversion, lack of composability and lack of scalability. The approach of using transactions had problems of unable to perform I/Os since it was difficult to figure out I/O conflicts.

Contributions:

TxLinux came up with a cooperative transaction locking that allows a critical section to be protected both by a transaction or by mutually exclusive locks. Cxspinlocks can be acquired using one of the two functions - cx_optimistic or cx_exclusive. cx_optimisitc attempts to protect a critical section using transactions. If some transaction thread executing within critical section gets restarted due to another thread accessing the same data set then it is possible for the terminated thread to restart and acquire a cx_exclusive which would prevent further restarts. The combination of these two primitive helps achieve maximum concurrency.

The underlying problem of I/O in synchronization through transactions is solved by decoupling the I/O from system calls. This is done by completing the actions required in I/O in memory and later moving it to disk. The problem of priority inversion is solved in transactional synchronization by terminating the process with the lower priority in case there is a conflict between write or read set of two different priority transactions. Transaction-aware scheduling has been implemented by trying to run the thread which was active in a transaction thereby reducing the chances of a conflict arising.

Evaluation:

The paper talks about an alternate for synchronization using a combination of locks and synchronization. Though, the paper gives explanation about how they solve the problems of priority inversion, composability using this new synchronization primitive it does not give any proof that these cannot be solved through locks alone. Since there are certain mechanisms which already exist trying to solve the problem of priority inversion.

Confusions:

How exactly do they find the conflicts between the transactions of two threads? They would be accessing the critical sections mostly in their virtual memory right? In that case, what would be the methodology to detect the conflict in the read and write sets of two threads?

Posted by: Varun Joshi Kishanlal Joshi | March 26, 2015 02:46 AM

The paper implements a variant of Linux operating system namely TxLinux which tries to solve the problems that come up with concurrency like priority inversion as well as provides increased concurrency. It combines the benefits of synchronization through transaction and through locks by implementing a combination of both named cxspinlocks. Cxspinlocks makes use of transactions to achieve synchronization but if there is any I/O involved in the critical region it switches over to using locks.

Problems:

The current approach of using locks to achieve concurrency resulted in problems like priority inversion, lack of composability and lack of scalability. The approach of using transactions had problems of unable to perform I/Os since it was difficult to figure out I/O conflicts.

Contributions:

TxLinux came up with a cooperative transaction locking that allows a critical section to be protected both by a transaction or by mutually exclusive locks. Cxspinlocks can be acquired using one of the two functions - cx_optimistic or cx_exclusive. cx_optimisitc attempts to protect a critical section using transactions. If some transaction thread executing within critical section gets restarted due to another thread accessing the same data set then it is possible for the terminated thread to restart and acquire a cx_exclusive which would prevent further restarts. The combination of these two primitive helps achieve maximum concurrency.

The underlying problem of I/O in synchronization through transactions is solved by decoupling the I/O from system calls. This is done by completing the actions required in I/O in memory and later moving it to disk. The problem of priority inversion is solved in transactional synchronization by terminating the process with the lower priority in case there is a conflict between write or read set of two different priority transactions. Transaction-aware scheduling has been implemented by trying to run the thread which was active in a transaction thereby reducing the chances of a conflict arising.

Evaluation:

The paper talks about an alternate for synchronization using a combination of locks and synchronization. Though, the paper gives explanation about how they solve the problems of priority inversion, composability using this new synchronization primitive it does not give any proof that these cannot be solved through locks alone. Since there are certain mechanisms which already exist trying to solve the problem of priority inversion.

Confusions:

How exactly do they find the conflicts between the transactions of two threads? They would be accessing the critical sections mostly in their virtual memory right? In that case, what would be the methodology to detect the conflict in the read and write sets of two threads?

Posted by: Varun Joshi Kishanlal Joshi | March 26, 2015 02:45 AM

Summary: This paper introduces the design of TxLinux, a modified Linux version that supports hardware transactional memory. They proposed a new lock called cxspinlocks that combined the functionality of a traditional spin lock and transactional memory.

Problem:
1. Spin locks has been used extensively in linux kernel as a synchronization primitive. However spin lock does not exploit enough concurrency on modern multicore hardware.
2. Modern processors supports hardware transactional memory. Though it brings more concurrency than traditional locks, its use is more limited. For example, IO operations cannot occur in a memory transaction.

Contribution:
1. A mechanism that allows cooperation between transational and traditional lock-based synchronization in a critical section. They implemented cxspinlocks, which will operate in two modes: mutual exclusive or optimistic. When the critical section is entered by some traditional thread, cxspinlocks will operate in exclusive mode and will behave like a spin lock. Otherwise, cxspinlocks will take advantage of transactional memory and allows multiple threads to enter the critical section simultaneously. When a conflict is detected, the underlying hardware will revert all threads in conflict, and pick one to restart. In both modes, a same memory location will be used to indicate whether currently the lock is available or not.

2. Dealing with IO operations in a critical section by running cxspinlocks in mutual exclusive mode. Since IO operations cannot be reverted, it is impossible to run IO in a transaction. TxLinux solves this problem by let critical sections that requires IO use mutual exclusive cxspinlocks, which is similar to a traditional spin lock.

3. Solving the priority inversion problem by contention manager. When a transaction conflict is detected, the contention manager will select a process with the highest priority to schedule.

Evaluation: The authors compared measured the performance of TxLinux under many workloads, and compared its performance with the original Linux. They conclude that the performance gain is substantial.

Confusions: What can we do on a CPU that does not support hardware transactional memory?

Posted by: Menghui Wang | March 26, 2015 02:27 AM

Summary
The paper presents a linux variant operating system TxLinux that uses hardware transactional memory(HTM) as a synchronization primitive instead of traditional locks. The paper also talks about integration of transaction with OS scheduler which avoids priority inversions. The paper points out the problem with pure transactional behavior and provide a primitive, cxspinlock as a solution, that allows locks and transactions to co-exist while limiting the concurrency only when there are conflicts.
Problem
Use of transactions to facilitate concurrent access to shared resources is much easier to program compared to other techniques like semaphores, because one doesn't have to worry if he is properly locking and releasing the locks, or doing any operations after acquiring lock that might result in deadlock etc . Instead one has to simply specify what code should be run atomically.This simplicity helps reduce bugs and makes for more quickly understandable code. But locks cannot be completely eliminated in a real OS, since certain operations like I/O are irreversible. The authors wanted to develop a lock primitive which can co-operate with transactional memory.
Contribution
Transactions allow many threads to concurrently execute in critical section unless they are accessing the same resource. The paper introduces a new synchronization primitive cxspinlock that has the advantage of higher concurrency while retaining the capacity for exclusive access if required. Traditional concurrency handling is pessimistic in the sense that it assumes contention by default. The processes in this OS take an optimistic approach and obtain a non-exclusive lock initially but when a conflict is detected, the transaction is restarted, and the lock is upgraded to an exclusive one. Instead of writing many small critical sections protected by locks, we can have one long transaction without much loss in performance and code simplification. A contention manager is used to resolve priority inversion problem caused by locks. Also transaction-aware scheduling helps to assign dynamic priority to account the impact of transaction state on system throughput.
Evaluation
The authors did an extensive performance evaluation of TxLinux. They had run various benchmarks to evaluate the synchronization performance, concurrency, cxspinlock performance, contention management and scheduling. The os_prio policy eliminated priority inversion and policy inversion entirely at cost of performance of 2.5% for TxLinux-default and under 1% for TxLinux-sched. The total number of restarts and total restart cycles wasted are reduced by 20.3% and 21.5% on an average, by transaction aware scheduling.
Confusion
Does the system have some intelligent way of handling resource conflicts? What if a huge transaction encounters a conflict just before it is about to end? will it be restarted? if so wouldn't it be an overhead?

Posted by: Nikhil Collooru | March 26, 2015 02:20 AM

Summary:

This paper describes how an operating system could be built to use transactional memory rather than locks. Transactional memory improves code modularity while also allowing for better concurrency. The authors explain the mechanisms they implemented to deal with issues like priority inversion and handling I/O with transactional memory.

Problem:

The authors argue that handling concurrency in OS design leads to maintenance issues and potential lock related bugs like deadlocks. The authors attempt to solve these issues with transactional memory while incorporating locks and TM in the same system.

Contributions:

The biggest contributions of this paper is an isolation system that allows transactional and non-transactional processes to execute concurrently and a system that uses hardware and software to implement mutual exclusion. They implemented a contention manager to determine if a transaction is successful and to handle contention between transactional and non-transactional processes.

To enable all of this, they designed their system around cxspinlocks, which appear like locks to non-transactional processes and the markers for transactional regions of code to transactional processes.

Another interesting aspect of this paper is allowing processes to execute in multiple transactional contexts. This allows for transactional operations in system calls as well. However, this does raise the issue of I/O while executing in a transactional context.

The authors worked around this issue by removing the need of system calls in handling I/O.

They deal with priority inversion by making the contention manager priority aware and making the scheduling policy take the transactional status of a process into account. They implemented all of these ideas in a version of Linux that they dubbed TxLinux and compared the performance of the two.

Evaluation:

The authors compared the performance of their system, which consists of a simulated architecture with MetaTM and TxLinux with a baseline Linux distribution. They used a few benchmarks and simulated a number of processors to demonstrate the effectiveness of their system with increased contention.

While their performance numbers are decent, they aren't all that impressive. In fairness to the authors, it is still impressive for a first attempt and some of the issues they tackled (like reduced programming complexity and fewer kernel deadlock bugs) are tricky to quantify.

Confusion:
The authors talk very briefly about open-nested and closed-nested transactions. Could you please explain these and what advantages or disadvantages one implementation has over the other?

Posted by: Clint Lestourgeon | March 26, 2015 01:52 AM

Summary :
This paper talks about TxLinux, a version of linux that also uses hardware transactional memory for synchronization. It mainly focuses on a new synchronization primitive called cxspinlocks that is a cooperative transactional spin lock which dynamically changes between locks and transactions to achieve synchronization.

Problem :
Fine grained locking using locks is complex and also has less performance scalability. Transactional memory follow the principles of atomicity and isolation and do not end up in problems of deadlock and lack of composability. The problem here is design the OS with transactional memory for synchronization and also incorporate locking for scenarios like I/O where transactions are not suitable.

Contributions :
1. Makes use of the benefits of transactional memory as it is more modular than locks and provides scalability in performance by allowing multiple transactional threads to execute concurrently in the critical section.
2. Transactions need to be restarted when there are transactional conflicts. Cooperative transactional spinlocks have been introduced that allows protection of critical sections by both transactions and locking whichever is suitable. Locking is mainly used for I/O, protection of data structures read by hardware and also for critical sections that are under high contention and might cause restarts due to many transactional conflicts.
3. Two major functions - cx_optimistic and cx_exclusive are used to acquire cxspinlocks. cx_optimistic is used to protect critical sections using transactions and the transactional hardware takes care of isolation. cx_exclusive is used to provide true mutual exclusion. In the absence of active transactions, a non-transactional thread can acquire the spinlock and prevent any other thread from entering the critical section.
4. I/O is handled differently as it cannot be rolled back. Therefore, transactions that try to perform I/O are automatically restarted and acquire a normal lock.
5. Contention manager avoids priority inversion by using the os_prio policy that uses a conflict priority followed by size and age of transaction to select the transaction to run.
6. Per-thread transaction profiles enable the scheduler to provide both dynamic priority and conflict reactive descheduling which tends to deschedule threads that has a high probability of wasting CPU due to restarts.

Evaluation :
An extensive performance analysis of TxLinux with multiple benchmarks has been done in the paper. These have been tested with 16 and 32 CPUs. The following observations have been made : TxLinux wastes around 57% less time synchronizing for 16CPUs that linux and 1% more for 32 CPUs. Also, reduces the lock contention by eliminating around 37% of calls to lock routines. On measuring concurrency, 67% of the 284 critical regions have more than a single thread executing in parallel. cxspinlocks have an extra overhead and they cause slowdowns of 3.1 and 2.8% for 16 and 32 CPUs. The os_prio policy has been seen to get rid of priority and policy inversion at a cost of a slowdown of 2.5%. On an average, the restarts have been reduced by 20.3% and 21.5% using transaction aware scheduling.

Confusion :
Would like to know more about the deadlock that arises due to flat nesting of transactions when both spinlocks and transactions are used.

Posted by: Krishna Gayatri Kuchimanchi | March 26, 2015 01:12 AM

Summary:
The paper describes Txlinux, a variant of Linux which uses Hardware Transaction Memory as a synchronization primitive in OS scheduler.

Problem:
With increased scalability of small-chip multiprocessors, the challenge is to take advantage of multi-core. Parallel programming is difficult with locks due to deadlocks, priority inversions, poor composability, complicated lock ordering and perfrmance-complexity trade off. The author proposes transactional memory in OS as a solution as it benefits user programs and can simplify programming.

Contributions:
1. Cooperation between locks and transactions using cxspinlocks (Adv: concurrency of transactions and safety of locks):
- cxspinlocks are introduced as a sunchronization primitive which uses both locks and transactions to protect same shared data.
- cx_optimistic and cx_exclusive primitives run critical section as transaction or in spinning lock behavior
- On I/O detection in a module, cx_optimistic is automatically changes to cx_exclusive.

2. Integration of Hardware Transaction Memory with OS Scheduler:
- Transactions implemented using hardware support. Data stored in hardware registers and cache, all actions performed in hardware atomically and written to main memory upon commit.
- eliminates priority inversion: content manager favours thread of higher priority on a conflict.

Evaluation:
Implemented on MetaTM simulator on Simics 3.0.27 machine, 16k 4-way tx L1 cache, 4MB 4-way L2, 1GB RAM, 1 cycle/inst, 16 cyc/L1 miss, 200 cyc/L2 miss, 16 and 32 processors. TxLinux with xspinlocks and 16 cpus is 2% slower than Linux. Pathological backoff in bonnie++ benchmark has 1.9% and 2% speed up with 16 and 32 cpus respectively. TxLinux with cxspinlocks showed 2.5% and 1% speedup over Linux with 16 and 32 cpus respectively.

Confusions:
"Virtualizing a transaction" concept is a bit unclear to me.

Posted by: Harneet Singh | March 26, 2015 01:10 AM

Summary:
The authors propose TxLinux as a variant of Linux which is the first OS to use Hardware Transactional memory for synchronization. This is primarily achieved using cxspinlocks which dynamically decides between using locks and trasactional memory for a particular operation by allowing to rollback if a lock is needed instead of transaction.

Problem:
Programming complexity increases and managing and debugging is difficult as core count increases. Traditional locks do not scale and are not robust. A lot of bugs in the Linux source code is due to synchronization. Using Transactional memory solves some of these problem associated with locks but they do not work with I/O requests and such since I/O requests typically cannot be rolled back. The key problem being addressed here is to achieve advantages of both transactions and locks, by allowing to decide between them dynamically. Since OS synchronization is difficult and a complex problem, the authors propose HTM as a solution to solve complex OS locking issues.

Contributions:

cxspinlocks: This allows executions of critical regions to be managed by transactions or locks and dynamically chooses between the two. In the optimistic way, it first protects a critical section using transaction but if a conflict or I/O occurs, it jumps does a roll back and acquires locks to perform the event. Exclusive locks can be used for sections which perform I/O always. Since system calls and I/O are independent of each other, system calls make use of transactions explicitly. Co-operation between locks and transactions and Integrating transactions with the Operating System: Using tradiational locks sometimes causes higher priority threads to wait for lower priority threads, but the contention manager in TxLinux prevents this by resolving conflicts in favor of high priority thread. The scheduler on its part given the state information, also prioritizes threads with current transactions to prevent contention.

Evalaution:
The paper provides a comprehensive evaluation results of Linux vs TxLinux on various parameters. In synchronization overhead, the authors evaluate spinlocks acquire vs transaction restart on 16 and 32 CPUs, and the results show a 57% lesser overhead of TxLinux on 16 CPUs, but a slightly more overhead in 32 CPUs. In terms of priority inversion, TxLinux is able to eliminate priority inversion problems associated with Linux.

Confusions:
In what cases will using locks and transactions together lead to deadlocks? Does the original claim no deadlocks stand since cxspinlocks still lead to deadlocks in certain conditions?

Posted by: Tejaswi Agarwal | March 26, 2015 01:02 AM

1. Summary
This paper is about TxLinux, an operating system designed for use with transactional memory. It introduces a new primitive for mixing locks and transactions, the cxspinlock and evaluates its performance vs. standard Linux.
2. Problem
Multiprocessors with large numbers of nodes (thousands) are seen by the authors to be the way of the future. Unfortunately, parallel programming for these systems is significantly complex unless the fine grained nature of locking code is sacrificed.
3. Contributions
* The authors implement a new primitive, the cxspinlock which allows locks and transactions to work together. Cxspinlocks allow code to be executed transactionally be default, and then dynamically transition it to using exclusive locks if an I/O operation is performed.
* Additionally, transactions alleviate the priority inversion problem experienced with normal locks, and the TxLinux implementation even avoids the pitfalls of timestamp based transaction queuing by using a contention manager that respects OS priority.
* MetaTM is introduced, which is an interface for the OS to communicate scheduling priorities to the hardware conflict manager to avoid the hardware subverting the policies of the OS.
4. Evaluation
The authors built TxLinux and used the Simics machine simulator to test the hardware transactional memory support. In general, performance of the Txspinlock was very similar to standard Linux spin locks. The priority inversion protection tested by the authors was found to eliminate priority and policy inversion, but at a cost of ~1%-2.5% performance.
5. Confusion
I'm not quite clear on what situations trigger the switch from transaction to lock. I'd assume it's any operation that isn't hardware supported to be performed in a single transaction, which seems like most of them besides direct memory access?

Posted by: Peter Den Hartog | March 26, 2015 12:57 AM

1. Summary
The current paper discusses the implementation of HTM in Linux kernel and scheduler. The authors propose a new synchronization primitive that combines locks and transactions. In addition, they also discuss policies requiring cooperation between scheduler and HTM.
2. Problem
Transactional memory has several advantages including ease of programmability, composability and performance due to optimistic execution of the critical sections. But, they have limitations for transactions in I/O, where the state cannot be undone on a restart and for highly contended critical sections which results in many restarts. On the other hand, locks have lower latency and perform well under contention. But they can be eliminated when critical regions do not interfere. They also cannot support composability and programmer’s ability to reason about deadlocks.
3. Contributions
A cooperative synchronization primitive called cxspinlock is proposed. It enables the use of HTM when concurrency is possible and uses locks when safety of locks is necessary, for example for I/O. Both transactional and non-transactional threads can co-exist. The execution state of the thread can be changed dynamically. A contention manager imposes policies on scheduling, priorities etc.
A mechanism is proposed to detect I/O operations and restart them with mutual exclusion required. This is done before any changes are done to the hardware.
As part of TxLinux implementation, a policy os_prio is proposed that enables more cooperation between the kernel scheduler and contention manager. This avoids problems of priority and policy inversion, asymmetric conflicts. The priority of a thread is computed before the transaction starts, based on three factors: the scheduler priorities, the working set sizes and transaction timestamps.
4. Evaluation
Evaluation is performed with changes to the Linux 2.6.16 scheduler on Simics x86 machine simulator with 16-core and 32-core configurations. The synchronization overhead of spinning is measured for a set of benchmarks and it shows that the spinning overhead is reduced, but the abort overheads are high. Another observation is that the concurrency in critical regions is less the most part, indicating the overhead of TM for such code regions. I think the proposed mechanism works well under certain scenarios (high contention, concurrency) but the kernel code that the changes are made to do not quite represent this behavior. Thus, the experiments do not show any significant benefits.
5. Confusion
I did not clearly understand the part where the paper mentions about flat-nesting and open-nesting. Can you please explain this.

Posted by: Bhuvana Kakunoori | March 26, 2015 12:34 AM

1. Summary
The paper describes the first implementation of HTM in an operating system (TxLinux). The main ideas introduced are 1) a new synchronization construct called cxspinlocks and 2) incorporating the HTM with the scheduler to avert priority inversion.

2. Problem
Current synchronization options (e.g, locks and semaphores) do not scale as well as transactional memory and are also not modular and lead to priority inversion. Transactions also have other advantages such as allowing for larger critical sections and greater concurrency. But current HTM designs are limited as they are not usable in certain scenarios such as I/O and require the co-existence of locks and transactions.

3. Contributions
cxspinlocks
-------------
- Allows for integration of transactions and locks as different executions of a critical section can be synchronzied using a lock or a transaction.
- Multiple transactional threads may enter a critical region without conflicting.
- Transactional threads poll the cxspinlock without restarting. Important for nested transactions.
- When an I/O operation is performed, the current transaction is restarted with mutual exclusion.
- Kernel can use best-effort transactional hardware w/o vitualization. Only in the case of overflows, is the transaction restarted in exclusive mode.

Scheduling
-------------
- MetaTM provides an interface for the OS to communicate scheduling priorities and policies (through a single integer called conflict priority).
- Contention manager chooses transactions in order of conflict priority, sizeMatters and timestamp policies.
- OS can query hardware to learn of transaction state. The OS can use this to make better scheduling decisions (e.g, re-schedule a process that has a current transaction soon.)

4. Evaluation
- Time wasted due to synchronization is 57% less in TxLinux in the case of 16-core while 1% more in the case of 32-core
- TxLinux lowers lock contention and also eliminates 37% of calls to locking routines.
- Acquiring a cxspinlock adds additional overhead in the number of instructions and memory references. However, as the references are mostly to stack variables, which is optimized in x86, this overhead is not much (2%-3%).
- Around 9.5% of transactional conflicts result in priority inversion with the default scheduler. The modified scheduler is almost able to eliminate priority and policy inversion completely.
- The scheduler's knowledge of transaction state seems to have little effect on performance.

5. Confusion
How exactly is programmability enhanced through transactions? It seems that in the code, you just replace the locks with the transaction statments?
A comparison of transactions vs locks (or semaphores) in class would be useful.
I did not understand the part about decoupling I/O from system calls.
How are transactions different from the "synchronized" keyword in Java?

Posted by: Naveen Neelakandan | March 26, 2015 12:29 AM

Summary: This paper proposes a variant of Linux that use hardware transactional memory (HTM) as the synchronization primitive and the first to manage HTM in the scheduler. It achieves cooperation between locks and transactions (no sacrifice the advantages of both synchronization primitives) and the integration of transactions with scheduler (eliminating priority inversion). Evaluation results show concurency with 32 threads on 32 CPUs.

Problem: Transactional memory is a programming model to reduce the parallel programming complexity while maintainning the performance based on atomically and isolately executes critical regions, NO locks. However, it is impractical to convert every instance of locks to transaction. Therefore, a cooperation between locks and transactions are necessary.

Contributions:
1. Implementing cxspinlocks, a cooperation between lock and transaction, a lock-based synchronization of critical region. It can deal with call from transactional and non-tranactional differently. Also, it handles I/O within transactions that allows transactions that performs I/O to automatically restart execution and require a conventional lock.

2. A HTM mechanism to nearly eliminate priority and policy inversion. The scheduling techniques use information from HTM to increase system throughput.

Evaluation:
The evaluation is on a 32-core machine simulator with 16KB L1 cache, 4Mb L2 cache.
1. Benchmark on basic OS routines (e.g. pmake, mab, find, config) shows saved (average 57%) time of synchronization.
2. Concurrency test on benchmarks shows decrease of times of spins on cached value, and cache coherent locked decrement.
3. Scheduler evaluation shows good concurrency.

Confusion: Is there any products that use transactional memory hardware?

Posted by: Shike Mei | March 26, 2015 12:22 AM

Summary

As multiprocessors scale to more and more cores, programming them becomes a bigger challenge; traditional lock primitives become increasingly complex to design correctly and also more prone to priority inversion. This paper proposes a hybrid hardware transactional memory (HTM) synchronization mechanism combined with a spinlock, which they implement in a Linux OS variant called TxLinux. HTM allows multiple threads into the same critical section; it kicks out threads on a write conflict, or falls back to a spinlock when I/O is required. Additionally, this hybrid solution uses a contention manager to solve conflicts, allowing it also to deal with difficult problems of priority and policy inversion.

Problem

Synchronization is complex, especially as the number of cores increases in a system. Locks are a common primitive used to achieve concurrency, but these suffer from many problems. One challenge is scalability and making sure they cover the minimum possible amount of data in critical sections. Another is lack of modularity; all system components using a certain lock need explicit knowledge about how other components use that lock. Additionally, the priority inversion problem can happen in situations where low-priority threads holding locks get starved.

Contributions

This paper makes several notable contributions. The first is the cooperative transactional spinlock (cxspinlock) which overcomes the one major problem of using HTMs as a synchronization primitive: managing I/O within a transaction. If the system notices an I/O operation, it transfers control to the cxspinlock which restarts the critical section under a traditional lock. HTM also provides a contention manager mechanism for eliminating priority inversion by simply resolving contention conflicts in favour of the higher priority thread.

Evaluation

To evaluate this new model, the authors built the TxLinux variant which implements their HTM + cxspinlock concurrency model and tested this on systems using 16 and 32 CPUs. They indicate that even at 32 cores the system only spends 12% of its time synchronizing, and so their results show that TxLinux and cxspinlocks perform at roughly the same speed as unmodified Linux and regular spin locks. Empirically, the modified system does not show any significant performance improvement.

One thing I wish they had measured is how many synchronization bugs exist under a cxspinlock system. I acknowledge this is a difficult metric to test and cannot be done in a short period of time; but this was a central tenet of the paper, and seems they would have achieved a dramatic improvement in program correctness.

Confusions

I don't really understand how hardware transaction locks remove the programming complexity from synchronization. The programmer still has to choose the critical sections. So this model basically just allows sloppy code to not fail?

Posted by: Mark Coatsworth | March 26, 2015 12:04 AM

1. Summary
This paper introduces the API, implementation and performance of cxspinlocks, which is used to create a seamless connection between conventional lock and transactions to support better concurrent programming performance and easier and simpler building procedures.

2. Problems
To improve concurrent programming, hardware transactional memory and its hardware ISA are invented.It makes concurrent programming easier and simpler. However, it still cannot replace conventional locks. It has problems of limitations in HTM design that prohibits transactions in certain scenarios such as performing I/O, because it cannot be rolled back; difficulties in converting every instance of locking to use transactions; poorer performance in critical sections are highly contended. To solve these problems, this paper introduces the primitive of cxspinlock, which enables a novel way of managing I/O within a transaction. It also ensures thread re-executes the critical region exclusively and provides a convenient API for converting lock-based code to transactions.

Problems also exists in spinlock that in maintaining mutual exclusion, it loses the concurrency of transactions and lacks fairness. It is solved by cooperative transactional spinlock.

3. Contribution
From paper introduction, the author indicates his contributions as follow,
1.A novel mechanism for cooperation between transactional and lock-based synchronization of a critical region.
2.A novel mechanism for handling I/O within transactions that allows a transaction that performs I/O to automatically restart execution and acquire a conventional lock.
3.An HTM mechanism to nearly eliminate priority inversion.
4.Insights and measurements from converting an operating system to use hardware transactional memory as a synchronization primitive.

4. Evaluation
The paper has a section to evaluate the performance of TxLinux, which shows the transactions are generally good for 16 and 32 CPUs. From which the author thinks the limitation is the scale. Besides, the results also shows that the problem of priority inversion is eliminated in TxLinux too.

5. Confusions
So the “flat nesting” is flattening data into one big transaction as is said in the paper. I can kindly imagine the procedure. However, the paper mentions there are other ways to nest transactions, what are they?

In my opinion, all the code finally becomes machine code. Why the design can enhance the final machine code? For instance, when using locks, it is a basic rule to keep the critical section short, while it is not a necessary in transactions. So there are other mechanisms that creates the critical section for the only thread run in the critical section. Therefore it is still a critical section, isn’t it? Is it short? In another word, if this way of separating critical section is efficient, why not use it directly with conventional locks? What’s the difference?

Posted by: Junhan Zhu | March 25, 2015 11:34 PM

Summary
TxLinux is an implementation of Linux that employs hardware transactional memory as a primitive for synchronization. TxLinux introduces cxspinlocks, which will either use HTM as the synchronization primitive or use locks if in the critical section I/O is performed. The advantage of using cxspinlocks as opposed to normal spin locks is that cxspinlocks allow for high concurrency, the transactions will only fail if the write set of one transaction intersects with the read or write set of another transaction. If a cxspinlock does fail, the transaction is simply restarted. If the transaction fails because I/O was encountered, the transaction will restart but will acquire a normal spin lock to prevent subsequent transactions from executing.
Problem
Synchronization programming using locks is typically hard to get correct and is often susceptible to deadlock and complicated code. It is hard so reason about the execution of a concurrent program using locks. Cxspinlocks attempt to solve this problem by implementing hardware transactional memory, which alleviate the programmer of the difficulties of reasoning about locks. Cxspinlocks also offer higher concurrency with workloads that are low contention.
Contributions
A major contribution of this work is the concept of transactional memory, which is that the effects of a transaction are either all completed or none of them completed. No intermediate states are allowed to persist if the transaction fails. Transactions are denoted in the code by the programmer. When a transaction begins, the read set and write set are recorded. If the write set intersects with the read or write set of another transaction, the transactions are said to conflict and must be resolved by the contention manager. Different kinds of policies can be enforced by the contention manager as to which transaction fails. The policy of the contention manager can solve the priority inversion problem. Priority of a transaction is determine by its conflict priority, size, and then age. Using the concept of HTM, cxspinlocks were implemented to choose to either use transactions as the synchronization primitive or normal locks. Normal locks are needed when the critical section performs I/O because the effects of I/O cannot be undone.
Evaluation
Overall, the performance of Txlinux was favorable. The amount of time spent synchronizing in Linux compared to Txlinux was about the same at 1-14%. One a 16 CPU system, Txlinux performed considerable better than Linux, but slightly worse when the cores were increased to 32. This is mainly because less cores removes more cache misses for the lock variable.
Confusions
How do transactions keep track of their read and write sets efficiently? Do transactions use locks to synchronize when accessing the read and write sets?

Posted by: Justin Moeller | March 25, 2015 11:14 PM

Summary :
This paper describes TxLinux, which was the first operating system to use hardware transactional memory as a synchronization primitive. The authors describe how concurrent programming can be simplified by using both spin locks and transactions are primitives, instead of just one of them.

Problem :
The authors point out that the number of cores on a chip has been scaling up and will continue to do so. In this setting, it is very difficult for the programmer to manage concurrency (if locking is used). But, transactional memory can be very effective here because it helps the operating system maintain high performance and reduces the complexity for the programmer. They also do not suffer from deadlock, do not cause priority inversions and are composable. However, with modules that perform I/O, transactions cannot be used and here locking works the best. So, this paper is an approach of combining the best of both worlds to use each one of them in their appropriate environment and dynamically being able to change into one another.

Contributions :
1. Introducing cxspinlocks (Cooperative transactional spinlocks), a synchronization primitive that allows locks and transactions to work together to protect the same data while maintaining both of their advantages.
These can either be cx_optimistic (which runs the critical section as a transaction) or cx_exclusive (which provides a spin lock behavior). cx_optimistic primitive is converted to the cx_exclusive one when I/O is detected inside the module.
2. The system dynamically chooses between transactions and locks. A thread can execute in a critical region in a transaction, and when it performs I/O, the system will make the thread re-execute with a mutex lock.
3. HTM is very simple because it eliminates lock variables. Moreover, unlike a lock scenario, threads which are not accessing the same memory can concurrently run the same transaction thus improving performance.
4. Conflicts between transactions are detected and one of them is allowed to proceed. This decision is taken by the contention manager (some of which is implemented in hardware for performance reasons).
5. The system also provides information to the threads about their transactions using the "transaction status word”.

Evaluations :
The authors provide results of comparing synchronization in unmodified Linux and TxLinux. This system provides more concurrency (67%) when the transaction sizes are small. It is shown to reduce the restarts by 20% than those of spin locks. Priority inversion is not always controlled here but with increase in number of cores, they show that it is lesser. Knowledge to the scheduler about the current transactions is shown to improve performance by ~8%.

What I found confusing :
I would like to know more about virtualizing transactions.

Posted by: Anusha Dasarakothapalli | March 25, 2015 10:50 PM

1. Summary
TxLinux is a modification of the linux kernel that uses hardware transactional memory as the synchronization primitive. It demonstrates that transaction based synchronization and spinlocks can co-exist in one systems and present transaction aware scheduling techniques.

2. Problem
Traditional locking mechanisms like spinlocks are generally slow and do not scale well to many core systems. This means a lot of time is spent in synchronization by the system and not doing useful work. Also spinlocks have the possibility of concurrency bugs like deadlocks.

3. Contributions
The main contribution of this paper is to show that transactions (in particular, hardware transactional memory) can be used in an operating system for synchronization between processes. In addition, this paper shows that transactions can exist along with spinlocks in the same system and that the process scheduling can be made transaction aware. The paper makes a case for using hardware transactional memory in the operating system. To support both transactions and spinlocks, TxLinux has co-operative transactional locks. It uses transactions as its synchronization primitive. However, if a thread does I/O, the thread is restarted and spinlocks are used for mutual exclusion. The authors make a distinction between I/O and system calls to take advantage of the fact that no actual device I/O happens during system calls and hence transactions can be used. Scheduling using transactions avoids the problems of priority and policy inversion to a large extent in TxLinux.

4. Evaluation
The paper presents evaluation of the system via measurements of synchronization, concurrency, performance of cxspinlock and scheduling. The authors use measurements on a set of workloads that are run on different number of CPUs in the system. Also, they present numbers of same measurements done in native linux for comparing the performance. The simulations do not show a significant difference with transaction aware scheduling in all workloads.

5. Confusion
I could not understand what the paper means by virtualizing a transaction.

Posted by: Mihir Patil | March 25, 2015 10:12 PM

Summary:
The paper looks at the use of hardware transactional memory as a synchronization primitive in Linux, and proposes cxspinlocks, a cooperative transactional spinlock.. Extensive evaluation is done to study scheduling and which-type-of-synchronization primitive to use in different scenarios.

Problem:
Using locks for scalable operating system performance with large number of cores comes at a significant programming and code maintenance cost. Additionally, locks are not modular, complex, and has disadvantages like priority inversion.

Solution:
HTM can be used as a synchronization primitive. The paper proposes cooperation between locks and transactions called cxspinlocks; and integration of transactions with the OS scheduler. cxspinlocks allows the system to attempt execution of critical regions with transactions and automatically roll back to use locking id the region performs I/O. Salient feature of this solution are:
-MetaTM (transactional memory model used as part of the solution) is a standard cache-coherent shared memory multiprocessor that uses the cache-coherence mechanism.
-MetaTM supports only flat-nesting, useful for short transactions; it uses transaction status word returned from xbegin to indicate first time or restart of a transaction.
-cxspinlocks are implemented for use at kernel level only without any virtualization, so if cxspinlocks overflows the h/w limit, it is restarted in exclusive mode.
-transactional and non-transactional code can use the same critical section correctly while maintaining fairness and concurrency.
-system calls are decoupled from I/O by buffering the effect of system call(started by user) in memory.
-priority and policy inversion is avoided using an MetaTM interface for the OS to communicate scheduling priority and policy to the hardware contention manager. os_prio is used for contention management.

Evaluation:
The paper has included extensive evaluation for the proposed solution. The setup is based on Linux 2.6.16, and MetaTM is implemented as a hardware module. Overall the performance for transactions is good for 16 and 32 cpu cores. Some evaluation numbers are: for synchronization on 16 core setup, 57% less time than regular linux; 67% more concurrency; priority inversion trend is not strict but decreases with increase in processor numbers; restarts reduced by about 20%.

Concerns:
How will cxspinlock perform in case of directly accessible I/O devices i.e. user accessible cxspinlock enviornment?

Posted by: Nidhi Tyagi | March 25, 2015 09:48 PM

Summary - The paper introduces TxLinux, an operating system which uses a novel synchronization primitive called cooperative transactional spinlocks which effectively uses hardware support for transactional memory (HTM). TxLinux incorporates HTM statistics into its scheduling decisions to support priority scheduling and deadlock/livelock prevention concurrently.

Problem - Transactional memory simplifies code complexity for concurrent applications without sacrificing performance. HTM has proven to be a better alternative than locks for mutual exclusion in terms of dealing with issues such as deadlocks and composability. However, HTMs cannot handle operations that cannot be rolled back such as I/O operations, and suffer severe performance degradation in cases of high contention.

Contributions - The paper proposes a novel cooperative transactional spinlock mechanism to offer the programming simplicity and other virtues of HTM, while allowing a fallback option of using spinlocks when needed. The authors propose a scheme to be implemented in a contention manager to simultaneously balance preservation of priority scheduling and guaranteeing forward progress through the minimization of deadlocks/livelocks. Methods are also discussed to use the statistics available from HTM such as number of active transactions, read/write set sizes, number of restarts taken for making better scheduling decisions by being aware of contention so as to avoid scheduling conflicting transactions simultaneously.

Evaluation - The synchronization overheads in terms of the time spent in spinlock acquires and transaction restarts is analyzed for unmodified linux as well as TxLinux, and is comparable. The authors show the opportunity for concurrent execution in the Linux kernel code, by demonstrating that 67% of the critical sections have multiple transactions executing simultaneously. This seems flawed as the number of simultaneous transactions is logged on transaction entry, and doesn’t depict the number for successfully completing transactions. While the workloads selected do not show much performance impact with the improved transaction-aware scheduling, a self-written microbenchmark with greater contention and longer transactions demonstrates around 7% performance improvement.

Confusions - I had a question regarding the os_prio policy. Do you think it is a good idea to have a well defined order of parameters (i.e the ordered tuple of {conflict priority, size, age}) is a good idea, given that even that doesn’t handle priority inversion in certain cases? This seems just be a contention manager that prioritizes lack of priority inversion over making forward progress. Could an adaptive policy with dynamic weights for these metrics make more sense, with priority scheduling being most of the time, but also allowing a lower priority transaction to make progress once in a while as its overall scheduling importance slowly increases? In their model, a low priority thread cannot ever be win a conflict while there exist high priority transactions.

Posted by: Swapnil Haria | March 25, 2015 09:23 PM

1. Summary
In the paper, "TxLinux: Using and Managing Transactional Memory in an Operating System",the authors give a model of Hardware Translation Memory as an alternative for synchronization and prove how HTM feedback mitigates priority and policy scheduling anomalies. They demonstrate how locks and transactions can be automatically and dynamically chosen by a system by using cooperative transactional spinlocks.

2. Problem
How to achieve high scalable OS performance using current synchronization primitives in a system with over a thousand cores while reducing coding complexity?

3. Contributions
- MetaTM: uses a flat nesting TM (nested transactions are flattened) with eager conflict detection (which means first memory access that causes the conflict makes one of the transactions to restart).
- Transactions: Lock based critical sections allow only a single thread to execute a critical section while other threads are waiting, thereby enforcing the programmer to make fine grained locks i.e short critical sections. Transactions, on the other hand, allows all threads to execute in a critical section, in parallel whenever possible, thereby reducing engineering effort.
- Cooperative transactional spinlocks: locking primitive to allow different executions of a single critical section to be synchronized with either locks or transactions thereby combining the concurrency benefits of transactions with the safety guarantee provided by locks.
- cx_exclusive:Transactional threads which require mutual exclusion or any thread which has been restarted since it requires mutual exclusion enters this function and polls the lock using xtest.
- cx_optimistic: Non-transactional threads call this function to exclude other non-transactional threads from entering critical section; internally it uses xcas instruction to obtain the spinlock without being unfair to a transactional thread
- Contention Management: Contention manager of MetaTM, implemented by providing a interface through which OS communicate scheduling priority and policy to HW contention manager, almost eradicate priority inversion problem caused by locks.

4. Evaluation
Experiments show that much time is spent for restarting of transactions and back-off in case of cxspinlocks. As the number of CPUs increases, the performance of the cxspinlock in TxLinux becomes worse compared to that of Linux (32 processors). One advantage of the cxspinlock, compared to conventional spinlock, is that it allows more concurrency when the average transaction sizes are not small. The conflict management policy of TxLinux completely eliminates priority and policy inversion, with a small overhead. We can see from the available metrics, performance of cxspinlocks is comparable to locking. Though coding complexity is potentially reduced, cxspinlocks introduces new pathologies.

5. Confusion
Can you discuss about virtualizing transactions in class?

Posted by: Shruthi Venkatesan | March 25, 2015 07:19 PM

1. Summary
This paper presents TxLinux - which uses Hardware Transactional Memory (HTM) as the synchronization primitive with the kernel. It introduces a new sync primitive called cxspinlock which allows a single region of code to be synchronized through locks or txns on different executions. Information about scheduling priority is shared between OS and HW contention manager to avoid priority/policy inversions. Empirical evaluation suggests that TxLinux gets comparable performance to Linux with greatly simplified code dev and maintenance costs.

2. Problem
Lock-based coding is very hard to get right at fine-grained granularity without significant development costs. Locks are not composable and preclude potentially safe concurrence. TM is modular, composable, avoids deadlocks & improves concurrency by allowing multiple threads into the critical region. However, txns can lead to excessive restarts during high contention and don’t work well with I/O within a txn. Txn conflict resolution can also lead to priority inversion. This work tries to integrate HTM into an OS by addressing its shortcomings.

3. Contributions
Cxspinlock primitive has two functions - cx_exclusive which acquires an exclusive lock around the critical region, potentially rolling back any txn thread already active in the region. cx_optimistic - which tries to execute the region as a txn, but if conflict or I/O is detected, it can be restarted and made to acquire an exclusive lock. I/Os within txn are handled by detecting accesses to MMIO region or I/O instructions and causing all active txns to restart with exclusive lock. Cxspinlocks, however, can lead to deadlocks and reduced concurrency. If multiple threads conflict within a txn, a contention manager (in HW/SW) decides which thread can progress. This manager can subvert OS scheduler policy/priority. Hence, TxLinux allows OS scheduler to set a conflict priority for each process. This will be used first to resolve conflicts by contention manager. Similarly, OS can acquire txn information from HW counters and registers and make better scheduling decisions. For example, boost the priority of a process with active txn.

4. Evaluation
Simulation experiments suggest that HTM does not add significant overhead due to restarts. Similarly, cxspinlock does not add much overhead over naked spinlocks. There is less benefit of TM’s concurrency in Linux kernel as the code is optimized to avoid lock contention. OS priority sharing reveals the near elimination of priority/policy inversions. Txn-aware scheduling has little benefit for realistic workloads, but seems good for conflict-intensive ones.

5. Confusions
It is said that cxspinlocks don’t execute in isolation. Isn’t this a deal-breaker?
Why have they considered adding spinlocks within txns in section 4.1?

Posted by: Aditya Venkataraman | March 25, 2015 07:13 PM

Summary
This paper describes the issues that transactional locking systems create, and how they can be solved to provide the concurrency and programmability of transactions with the performance of fine-grain locking.

Problem
Programming parallel systems (with many cores) is difficult. This is especially true with locks. Locking isn't modular, so global knowledge is necessary to avoid deadlock. Additionally, locking is used in cases where it is not necessary, and two different threads could pass through the same critical section concurrently.
Transactional memory holds the promise of easy programmming while maintaining the fine-grained performance of locks. Unlike locks, it is modular and provides performance scalability. But transactions don't work for performing I/O, and locks perform more effectively in highly contended critical sections. Additionally, transactions are subject to priority/policy inversion, in which some lower-priority item can be preferred over the correct, higher priority item.

Contributions
The solution to these problems are hybrid cxspinlocks, which allow for the same critical sections to be executed by either locks or transactions. This means I/O and other lock-necessary operations can be executed safely, while concurrency can also be managed. There are two forms of cxspinlock functions introduced, which are used to achieve
performance. cx_optimistic tries to execute concurrently. If mutual exclusion is required, the transaction will be restarted and the critical section will be executed exclusively. cs_exclusive is used when mutual exclusion is absolutely necessary. The xcas instruction is a mechanism that allows for some contention manager to implement a specific policy for what access control should be used.

A number of interesting scheduling mechanisms are also introduced to solve the priority inversion problem and to improve performance. os_prio is a policy used to induce a total order on scheduling that is decided before xbegin is run. This order is determined by OS priority, then by transaction size, and finally by timestamp. Because the value is calculated before the transaction begins, deadlock and livelock are eliminated. The linux scheduler is also modified to read transaction state provided by hardware. The scheduler dynamically handles these hardware values to boost the priority of threads running a current transaction, and to temporarily deschedule threads that are wasting work due to transaction restarts. This can improve performance and remove many issues related to transaction restarts.

Evaluation
The authors evaluate a number of workloads to compare their transactional system against basic Linux. They first (TxLinux-SS) converted Linux by hand, modifying sections that experienced high contention with care to avoid I/O transaction problems. Their second conversion (TxLinux-CX) converted spinlocks to cx_optimistic calls, and used cx_exclusive calls for the necessary difficut sections of code. They find that TxLinux-SS wastes 57% less time with 16 CPUs (likely due to fewer cache lock misses), and wastes 1% more time for 32 CPUs. They find that one specific test, Bonnie++, doubles wait time due to a function which quickly changes variables and causes starvation through transaction restarts.
They also find that the overhead of using csspinlocks instead of regular spinlocks are negligible.
Finally, the scheduling aspects of the system are evaluated. The authors measure that in default Linux, 9.5% of transactional conflicts result in priority inversion. The previously mentioned changes to the kernel scheduler eliminate this problem entirely, at less than a 3% overhead. Changing the scheduler to use knowledge of current transaction state improves performance by 8 and 6% for 16 and 32 cores, respectively. The authors have shown that transactions can provide similar performance to locks, and can be easily integrated into existing systems.

Questions
How does a transaction know whether mutual exclusion is required? Basically who defines that, or is it automatically detected (due to things like I/O)?

Posted by: Michael Bauer | March 25, 2015 04:24 PM

CS 736 Reviews - Spring 2015

TxLinux: Using and Managing Transactional Memory in an Operating System

Comments

Post a comment