(3.3.4) Piranha

Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing, Proc. International Symposium on Computer Architecture, June 2000, pp. 282-293. ACM DL link

lots of Alpha in order in one chip

commercial workloads : large memory stall component. data-dependent naure, lack of ILP. high floating=point and multimedia functionality.

SMT > superior in single thread performance > very wide-issue processors which are more complex to design

CMP > simpler processor cores at potential loss in single thread performance.

Piranha : extremely simple processor : single -issue, in-order, eight stage pipeline.

processing node (8 CPUS, Intra chip switch, system control, home and remote engines, packet switch router and L2 -> memory.

IO chips : same as above, just 1 CPU inside

Intrachip switch : back-to-back transfers without dead-cycles. reduce latency < modules issue target destinations of future requests (hint) > pre-allocate datapaths to speculatively assert requester's grant signal. (high priority and low priority).

fully ordered.

8 banks, 8 way set associative.

duplicate copy of L1 tags and state at the L2 controllers.

L2 is a very large victim cache.

in case of multiple sharers write back happens only when owner of data replaces it.

full map centralized directory-based.

home engine

exporting memory whose home is at the local node. remote imports.

directory storage

limited pointers

coase vector.

inter node coherence protocol

invalidation based directory

does not depend on point to point ordering => adaptive routing.

no NAKs (no protocol deadlocks/network deadlocks)

no livelock and starvation.

hot potato routing : as age increases, so does priority.

bound the number of messages that can be inserted as a result of a single request.

cruise-missile-invalidates

inband clock distribution/S-connect.