Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Major Points:

* Reduce synchronization cost for fine grained threads on SMTs

* Implementation of a thread-shared hardware "lock box".

- Simple and scalable hardware design.

- Each lock box entry has lock address, pointer to instruction grabbing the lock and v-bit.

- Guarrantte starvation prevention and deadlock avoidance.

* Speculative prediction of lock release.

- Prediction based on thread ID and PC history.

- Reduce the critical path from 15 cycles to 9 cycles.

- Gains an additional performance by 40%.

Synchronization Mechanisms:

* Spin locks such as test-and-set and load-locked and store-conditional instructions.

* Full/empty bits associated with each memory block.

* Full/empty bits to registers.

* Shared registers for synchronizations.

Goals for SMT synchronization:

* High performance.

* Resource-conservative.

* Deadlock-free.

* Scalable.

Evaluation:

* Trace driven simulation. Can this give you accurate performance results?

* Are locks very frequent in applications like the authors claimed?

* Only find 6 loops from several benchmarks and conclude based on performance from these loops.