













| Summary                                                                                                                                                              |       |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
| <ul> <li>Principled approach         <ul> <li>Programs encoded as blocks                 <ul></ul></li></ul></li></ul>                                               | works |
| <ul> <li>Results:         <ul> <li>Up to 5X better performance than best supersca<br/>processors</li> <li>Prototype system expected in summer</li> </ul> </li> </ul> | lar   |
|                                                                                                                                                                      | 8     |

























| Inter-Block                                                              | Concurrency                                                                   |
|--------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| Logically blocks execute sequ                                            | entially                                                                      |
| Fetch Execute Com                                                        | mit<br>Fetch Execute Commit                                                   |
| <ul> <li>Reservation stations and cont<br/>instruction window</li> </ul> | rol speculation enables deep                                                  |
| FetchExecuteFetchExecuteFetchFetchPredictFetch                           | Commit       Ite     Commit       Execute     Commit       Execute     Commit |
|                                                                          | 21                                                                            |













| Prototype Design                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Design <ul> <li>Modularity reduced complexity: Specification → Physical design</li> <li>SoC-like but tiles form one large uniprocessor</li> </ul> </li> <li>Verification <ul> <li>Hierarchical verification (265 bugs total)</li> <li>Tile-level, processor-level, chip-level</li> <li>Performance verification (16 bugs total)</li> </ul> </li> <li>Lessons <ul> <li>Clean predicate model and simple block exit path</li> <li>Register renaming design revised, full search done once</li> <li>H/W prototype design helped push s/w toolchain flow <ul> <li>Compiler heuristics, register allocator, scheduler</li> <li>Block predictor design complexity ⇒ 3-cycles to predict</li> <li>Significant router area (12%), routing logic on critical path</li> <li>LSQ replication consumed significant area</li> <li>Ongoing work addresses this challenge</li> </ul> </li> </ul></li></ul> |
| 28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |



| • | Introduction                                                    |  |
|---|-----------------------------------------------------------------|--|
|   | <ul> <li>Scaling conventional microarchitectures</li> </ul>     |  |
|   | <ul> <li>Summary of contributions</li> </ul>                    |  |
| • | Irregular Concurrency                                           |  |
|   | - EDGE: A class of ISAs for concurrency                         |  |
|   | <ul> <li>TRIPS: Concurrency in the microarchitecture</li> </ul> |  |
|   | <ul> <li>Performance results</li> </ul>                         |  |
|   | <ul> <li>TRIPS Prototype chip</li> </ul>                        |  |
| • | Extensions for regular concurrency                              |  |
| • | Future work                                                     |  |
| • | Conclusions                                                     |  |















| Summary                                                                                                                                                                                                |       |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
| <ul> <li>Three design principles         <ul> <li>Rebalance compiler and hardware effort</li> <li>Amortize instruction level overheads</li> <li>Eliminate centralized resources</li> </ul> </li> </ul> |       |
| <ul> <li>EDGE: A new class of ISAs to match technology</li> <li>Block atomic execution</li> <li>Instruction-to-instruction communication, dependences exp</li> </ul>                                   | licit |
| <ul> <li>TRIPS microarchitecture         <ul> <li>Modular and tiled design with mesh network</li> </ul> </li> </ul>                                                                                    |       |
| <ul> <li>Up to 5X better performance than superscalar<br/>processors</li> </ul>                                                                                                                        |       |
|                                                                                                                                                                                                        | 38    |

## Future Work: Systems 10-15 years from now

- Energy efficiency and programmer productivity at high performance
- Technology
  - Much more unreliable
  - High process variability: each transistor is unique
- Software
  - Applications will change: mobility, modeling, ubiquitous connectivity
  - Concurrency will be expressed at language level also
    - Transactions, active objects, futures, parallel STL algorithms
  - Abundant latent concurrency in the application 1000 "threads"
- Architecture:
  - Exploiting concurrency in "threads": Where to run, when to run, how to run (voltage/frequency)?
  - What abstraction should compiler and language see?
  - Mechanisms and microarchitecture optimized for common case
  - Agile at many levels

39

| Acknowle                                                                                                       | edgements                                                                                                                                     |    |
|----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|----|
| Adv                                                                                                            | risors                                                                                                                                        |    |
| Stephen W. Keckler                                                                                             | Doug Burger                                                                                                                                   |    |
| Architecture                                                                                                   | Contributors                                                                                                                                  |    |
| Raj Desikan<br>Saurabh Drolia<br>Sibi Govindan<br>Paul Gratz<br>Divya Gulati<br>Heather Hanson<br>Changkyu Kim | Haiming Liu<br>Robert McDonald<br><b>Ramdas Nagarajan</b><br>Simha Sethumadhavan<br>Premkishore Shivakumar<br>Nitya Ranganathan<br>Bill Yoder |    |
|                                                                                                                |                                                                                                                                               | 40 |





























|                 |      | Comp   | iler -O4 |
|-----------------|------|--------|----------|
| Benchmarks      | Min  | Max    | Mean     |
| SPECINT2000     | 8.8  | 29.2   | 18.4     |
| SPECFP2000      | 13.7 | 75.9   | 29.5     |
| EEMBC           | 8.2  | 49.9   | 22.7     |
| Microbenchmarks | 16.5 | 101.3  | 48.6     |
|                 |      | Hand o | ptimized |
| Microbenchmarks | 17   | 112.5  | 73.5     |





| Name   | TRIPS<br>Speedup | Alpha<br>IPC | TRIPS<br>IPC | TRIPS<br>Inst/Block | Description                                        |  |
|--------|------------------|--------------|--------------|---------------------|----------------------------------------------------|--|
| a2time | 5.05             | 0.81         | 4.05         | 77                  | Control, integer math                              |  |
| bezier | 3.30             | 1.05         | 3.20         | 76                  | Bezier curve, fixed-point math                     |  |
| dct8x8 | 2.66             | 1.70         | 4.70         | 90                  | 2D discrete cosine transform                       |  |
| matrix | 3.30             | 1.68         | 4.05         | 72                  | Matrix multiply                                    |  |
| sha    | 0.92             | 2.28         | 2.10         | 80                  | Secure hash<br>(mostly sequential algorithm        |  |
| vadd   | 1.92             | 3.04         | 6.51         | 74                  | Vector add<br>(limited by load/store<br>bandwidth) |  |

Alpha compilation with GEM compiler and maximum opts (O4 and tuned for 21264)

TRIPS compilation with in-development compiler plus some hand-tuning Speedup measured by comparing Alpha cycles to TRIPS cycles

58





























| TASL | TASL (target format) Object Code |            |        |    |      |      |    |
|------|----------------------------------|------------|--------|----|------|------|----|
| [R1] | \$g1                             | [2]        | v reg  | Т  | 1    | T0   |    |
| [R2] | \$g2                             | [1] [4]    | v reg  | ٦  | -1   | Т0   |    |
| [1]  | ld                               | L[1] 4 [2] | opcode | pr | LSID | imm  | ТО |
| [2]  | add                              | [3] [4]    | opcode | pr | хор  | T1   | TO |
| [3]  | mov                              | [5] [6]    | opcode | pr | хор  | T1   | ТО |
| [4]  | st                               | S[2] 4     | opcode | pr | LSID | imm  | 0  |
| [5]  | addi                             | 2 [W1]     | opcode | pr | хор  | imm  | Т0 |
| [6]  | teqz                             | [7] [8]    | opcode | pr | хор  | T1   | Т0 |
| [7]  | b_t                              | block3     | opcode | pr | exit | offs | et |
| [8]  | b_f                              | block2     | opcode | pr | exit | offs | et |
| [W1] | \$q5                             |            | v reg  |    |      |      |    |

| TASL | (target | t format)  |   |         |        | Objec | ct Code |      |
|------|---------|------------|---|---------|--------|-------|---------|------|
| [R1] | \$g1    | [2]        | [ | 1 00001 | [2     | ]     |         | ]    |
| [R2] | \$g2    | [1] [4]    | [ | 1 00010 | [1     | ]     | [4]     | ]    |
| [1]  | ld      | L[1] 4 [2] |   | ld      | 00     | 00001 | 4       | [2]  |
| [2]  | add     | [3] [4]    | [ | add     | 00     |       | [3]     | [4]  |
| [3]  | mov     | [5] [6]    | [ | mov     | 00     |       | [5]     | [6]  |
| [4]  | st      | S[2] 4     | [ | st      | 00     | 00010 | 4       |      |
| [5]  | addi    | 2 [W1]     | [ | addi    | 00     |       | 2       | [W1] |
| [6]  | teqz    | [7] [8]    |   | teqz    | 00     |       | [7]     | [8]  |
| [7]  | b_t     | block3     |   | b       | 11 (t) | E0    | block3  | - PC |
| [8]  | b_f     | block2     | [ | b       | 10 (f) | E1    | block2  | - PC |
| [W1] | \$g5    |            | [ | 1 00101 |        |       |         |      |



