

| • Locks                                            |             |  |
|----------------------------------------------------|-------------|--|
| <ul> <li>Contention limits performance</li> </ul>  |             |  |
| <ul> <li>Contention limits scalability</li> </ul>  |             |  |
| <ul> <li>Cause unnecessary memory traf</li> </ul>  |             |  |
| <ul> <li>Involve long latency memory ac</li> </ul> | cesses      |  |
| Programmers can use finer-gra                      | inod locks  |  |
| - Time-consuming                                   | illeu locks |  |
| _                                                  |             |  |
| <ul> <li>Difficult and Error-prone</li> </ul>      |             |  |
|                                                    |             |  |
|                                                    |             |  |
|                                                    |             |  |





























### **Buffering speculative state**

- Register state
  - No need to track inter-instruction dependences
  - One checkpoint already available for restoring
  - Speculatively retire instructions
- Memory state
  - Augment write buffer to store speculative data Any other buffering technique ok
  - Speculative data never exposed until commit

#### **Detecting data conflicts**

- Data conflicts
  - For any two threads, at least one is writing the data block
  - Extend cache with speculative access bits Track data block accesses
- Coherence protocol provides functionality
  - Writes to shared locations generate invalidations
  - Zero cost detection
- Lock kept in shared state
  - Automatically notified if anyone acquires lock
  - Misspeculation may result in explicit lock acquisition

17











| Speculative Lock Elision concept<br>Speculative Lock Elision details<br><b>Results</b><br>Concluding remarks |  |    |
|--------------------------------------------------------------------------------------------------------------|--|----|
| Results                                                                                                      |  |    |
|                                                                                                              |  |    |
| Concluding remarks                                                                                           |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  |    |
|                                                                                                              |  | 24 |
|                                                                                                              |  |    |





#### Outline

- Motivation
- Speculative Lock Elision concept
- Speculative Lock Elision details
- Results
- Concluding remarks
  - SLE advantages/limitations
  - SLE summary

27

#### **SLE advantages** Implementation + Uses well understood speculation techniques + Much functionality already present: coherence protocol + Transparent to programmers No software or instruction set support + No system level changes Completely in the micro-architecture No coherence protocol changes Performance + Lock never written to No serialization on lock Kept locally in shared state + Concurrent execution + Reduced observed memory latencies + Reduced memory traffic 28

### **SLE limitations**

- Extra data paths and hardware
- Lock acquired for misspeculation conditions other than conflicts
  - Resource constraints, I/O, certain types of system calls
    - Get same performance and behavior as without SLE
- Sensitive to restart threshold
  - Coherence protocol interference

29

# SLE contributions

- Enables highly concurrent multithreaded execution
  - Concurrently execute non-conflict critical sections
  - Validation without writing to/acquiring lock
- Simplifies correct multithreaded code development
  - Programmers do not have to learn new techniques

#### • Easy implementation

- Does not require software or instruction set support
- Minor changes to modern processors
- Opens up optimization opportunities...
  - Can provide strong performance guarantees even with conflicts?

# Simulation parameters for SMP

| Processor speed                                               | 1 GHz (1 ns clock)                                                 |  |  |  |
|---------------------------------------------------------------|--------------------------------------------------------------------|--|--|--|
| Reorder buffer                                                | 128 entry with a 64 entry LSQ                                      |  |  |  |
| Issue mechanism                                               | Out-of-order issue/commit of 8 ops. per cycle                      |  |  |  |
| Branch predictor 8-K entry combining predictor, 8-K 4-way BTB |                                                                    |  |  |  |
| Write buffer                                                  | 64-entry                                                           |  |  |  |
| L1 instruction cache                                          | 64-KB, 2-way, 1-cycle access, 8 pend. misses                       |  |  |  |
| L1 data cache                                                 | 128-KB, 4-way, write-back, 3 ports, 1-cycle access, 8 pen. misses  |  |  |  |
| L2 unified cache                                              | 4-MB, 4-way, write-back, 12 cycle access, 16 pend. misses          |  |  |  |
| Line size                                                     | 64 bytes                                                           |  |  |  |
| Coherence protocol                                            | Sun Gigaplane-type MOESI protocol b/w L2s. Split transaction.      |  |  |  |
| Address network                                               | Broadcast snooping, 20 ns per snoop, 120 outstanding transactions. |  |  |  |
| Data network                                                  | Point-to-Point, pipelined, 70 ns transfer latency                  |  |  |  |
| DRAM memory module                                            | 8-bytes wide, ~70ns access time for 64 byte line                   |  |  |  |
| Memory consistency                                            | Total Store Ordering                                               |  |  |  |

## More configurations in paper

backup

| • | Concurrency techniques                                                             |
|---|------------------------------------------------------------------------------------|
|   | – Lamport                                                                          |
|   | <ul> <li>Transactional memory</li> </ul>                                           |
|   | <ul> <li>Load-Linked/Store-conditional</li> </ul>                                  |
| • | Database techniques                                                                |
|   | <ul> <li>Optimistic concurrency control</li> </ul>                                 |
| • | Speculative buffering                                                              |
|   | – ARB (Franklin & Sohi)                                                            |
|   | – SVC (Gopal et al.)                                                               |
|   | <ul> <li>Speculative Retirement (Ranganathan et al., Lai &amp; Falsafi)</li> </ul> |
|   | <ul> <li>Multiversion Memory (Sorin et al.)</li> </ul>                             |



