

# Piranha

A Scalable Architecture Based on Single-Chip Multiprocessing

Luiz André Barroso Kourosh Gharachorloo Robert McNamara Andreas Nowatzyk Shaz Qadeer Barton Sano Scott Smith Robert Stets Ben Verghese



COMPAQ.

# Increasing Complexity of Processor Designs

- Pushing limits of instruction-level parallelism
  - multiple instruction issue
  - speculative out-of-order (OOO) execution
- Driven by applications such as SPEC
- Increasing design time and team size

| Processor<br>(SGI MIPS) | Year<br>Shipped | Transistor<br>Count<br>(millions) | Design<br>Team<br>Size | Design<br>Time<br>(months) | Verification<br>Team Size<br>(% of total) |
|-------------------------|-----------------|-----------------------------------|------------------------|----------------------------|-------------------------------------------|
| R2000                   | 1985            | 0.10                              | 20                     | 15                         | 15%                                       |
| R4000                   | 1991            | 1.40                              | 55                     | 24                         | 20%                                       |
| D10000                  | 100.6           | 6.80                              | >100                   | 26                         | - 25W                                     |

urtesy: John Hennessy, IEEE Computer, 32(8)

• Yielding diminishing returns in performance

### Q

# Importance of Commercial Applications

Approximate Breakdown of



- Total server market size in 1998: ~\$50-60B
  - technical applications: less than \$5B
  - commercial applications: over \$35B

| <br> | <br> |  |
|------|------|--|
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |
|      |      |  |

# Challenges for Commercial Applications



- Memory system dominant factor in overall performance [Barroso et al., ISCA'98]
- Small gains from multiple instruction issue and OOO execution [Ranganathan et al., ASPLOS'98]
- No use for floating-point and multimedia functionality
- Further questions viability of more complex processors

 $(a_{1},a_{2},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{3},a_{$ 

0



# 

| Piranha Project                                                                                                                                   |   |
|---------------------------------------------------------------------------------------------------------------------------------------------------|---|
| Explore chip multiprocessing for scalable servers     Focus on parallel commercial workloads                                                      |   |
| Small team, modest investment, short design time     Address complexity by using:     – simple processor cores     **Total of N. C. mathedalary** |   |
| standard ASIC methodology     Piranha's CMP approach extremely compelling for server workloads with explicit thread-level parallelism             |   |
|                                                                                                                                                   |   |
| taranta tarianta taranta tarianta tarianta tarianta tari                                                                                          |   |
|                                                                                                                                                   |   |
|                                                                                                                                                   | 1 |
| Outline                                                                                                                                           |   |
| Background     Piranha Architecture                                                                                                               |   |
| Performance Evaluation                                                                                                                            |   |
| Summary                                                                                                                                           |   |
|                                                                                                                                                   | - |
|                                                                                                                                                   |   |
|                                                                                                                                                   |   |
|                                                                                                                                                   |   |
|                                                                                                                                                   |   |
|                                                                                                                                                   |   |
| Piranha Processing Node                                                                                                                           | ] |
| Alpha core:                                                                                                                                       |   |
| Pi-lsus, in-order, 500MHz                                                                                                                         |   |
| _                                                                                                                                                 |   |
|                                                                                                                                                   |   |
|                                                                                                                                                   |   |
|                                                                                                                                                   |   |
|                                                                                                                                                   |   |
| Ann. Nav. Ann. Nav. Nav. Nav. Nav. Ann. Ann. Ann. Ann. Ann. Ann. Ann. An                                                                          |   |



















# L2 Cache and Intra-Node Coherence • 8 banks based on cache line address interleaving • No inclusion between L1s and L2 cache - total L1 capacity equals L2 capacity - L2 misses go directly to L1 - L2 filled by L1 replacements • L2 keeps track of all lines in the chip - sends Invalidates, Forwards - orchestrates L1-to-L2 write-backs to maximize chip-memory - cooperates with Protocol Engines to enforce system-wide coherence Q $(A_{1},A_{2},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{3},A_{$ **Protocol Characteristics** • 'Stealing' ECC bits for memory directory 8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53) • Directory (2b state + 40b sharing info) Dual representation: limited pointer + coarse vector • "Cruise Missile" Invalidations (CMI) - limit fan-out/fan-in serialization with CV • Several new protocol optimizations a Outline Background • Piranha Architecture • Performance Evaluation Summary

# **Experimental Methodology**

- Workloads
  - TPC-B: 600MB SGA, 500 transactions, 8 processes/CPU
  - TPC-D (Q6): 500 MB SGA, 4 processes/CPU
  - Oracle DBMS
- Simulation Environment: SimOS-Alpha
  - full system simulation (includes OS)
  - CPU models:
    - single-issue, in-order, blocking caches
    - out-of-order speculative OOO, non-blocking caches
  - - shared L2 cache
    - NUMA model

196196196196196196196196196196

Q

# Simulated Architectures

| Parameter               | Piranha<br>(P8) | Next-Generation<br>Microprocessor<br>(OOO) | Full-Custom<br>Piranha<br>(P8F) |  |
|-------------------------|-----------------|--------------------------------------------|---------------------------------|--|
| Processor Speed         | 500 MHz         | 1 GHz                                      | 1.25 GHz                        |  |
| Туре                    | in-order        | out-of-order                               | in-order                        |  |
| Issue Width             | 1               | 4                                          | 1                               |  |
| Instruction Window Size | -               | 64                                         | -                               |  |
| Cache Line Size         | 64 bytes        | 64 bytes                                   | 64 bytes                        |  |
| L1 Cache Size           | 64 KB           | 64 KB                                      | 64 KB                           |  |
| L1 Cache Associativity  | 2-way           | 2-way                                      | 2-way                           |  |
| L2 Cache Size           | 1 MB            | 1.5 MB                                     | 1.5 MB                          |  |
| L2 Cache Associativity  | 8-way           | 6-way                                      | 6-way                           |  |
| L2 Hit / L2 Fwd Latency | 16 ns / 24 ns   | 12 ns / NA                                 | 12 ns / 16 ns                   |  |
| Local Memory Latency    | 80 ns           | 80 ns                                      | 80 ns                           |  |
| Remote Memory Latency   | 120 ns          | 120 ns                                     | 120 ns                          |  |
| Remote Dirty Latency    | 180 ns          | 180 ns                                     | 180 ns                          |  |

Methodal de Noble Metho

Q



- Piranha's performance margin 3x for OLTP and 2.2x for DSS
- Piranha has more outstanding misses 
   better utilizes memory system

190190190190190190190190







# Implementation & Status • RTL-level C++ simulator - allows mixed C++/Verilog execution ASIC methodology - using Verilog and industry-standard CAD tools - IBM ASIC process • Engagement of Compaq's NonStop Hardware Division Current status - Alpha core (in Verilog) under debug and synthesis for timing - other modules at code completion stage and being translated to $\boldsymbol{a}$ Related Work on CMP • Hydra [Hammond ASPLOS'98] and other CMP TLDS work - thread-level data speculation not needed for commercial MAJC [Tremblay Microprocessor Forum'99] - focuses on client appliances • IBM Power4 [Diefendorff Microprocessor Report, Oct.'99] - most similar in focus to Piranha - opts for fewer, larger cores a Summary • Commercial workloads are rich in explicit thread-level parallelism and poor in ILP • CMP is an excellent match to this application domain • Piranha explores an extreme point in CMP design - use many simple cores - aggressively optimize memory and interconnect systems • Piranha departs from increasing core complexity trends • CMP is inevitable in future systems - key questions are: • number and complexity of CPU cores • best partitioning of on-chip memory hierarchy

# Additional Piranha Team Members Research Joel McCormack Mosur Ravishankar NonStop Hardware Division Tom Heynemann Dan Joyce Harland Maxwell Harold Miller Brian Robinson Sanjay Singh Jeff Sprouse Former contributors Basem Nayfeh Joan Pendleton Daniel Scales