-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- ------------ ICN 3 ------------ Outline * Finish H&P * Mukerjee et al. * Kim et al. (optional and brief) ------------------------------ Use selected slides from http://www.cs.wisc.edu/~markhill/restricted/AQA4e_AppendixE_ICN_slides.ppt E.6 - end (switch microarchicture & micellaneous) ------------------------------ @InProceedings(mukherjee:alpha21364-network:hot-interconnects:2001, author = "Shubhendu S. Mukherjee and Peter Bannon and Steven Lang and Aaron Spink and David Webb", title = "The Alpha 21364 Network Architecture", crossref = "hot-interconnects:2001", topic = "interconnect, compaq", ) Reviews * Daniel, Tony, Andy N., Marc: deadlock, more on VCs * Eric: 7 bits ECC on 32 bits (SECDED) * Guoiang: fourth rectangle in Fig. 3 donut? * Brian: cache coh modified for VCs? * Andy E.: RAMBUS? * Syed: hill climbing for VCs * Marc: arb -- too complex Comments by Mark D. Hill, 26 March 2004. Up to 128 glueless MPs Each node interfaces to Rambus Memory, I/O and four neighbors in 2B torus Directory protocol Flit -- really Phit -- 32 bits + 7b SECDED ECC Packet Classes request (12 bytes) (not counting ECC) forward (12 bytes) block response (72 or 76 bytes) nonblock response (8 or 12 bytes) write I/O (76 bytes) read I/O (12 bytes) special (4, 8, or 25 bytes) -- no-ops, buffer de-alloc, etc. Virtual Cut-Through -- can fully buffer packet Each of the above classes (except special) have 3 virtual channels per physical channel (6*3 + 1 = 19) adaptive VC0 VC1 As per Duato, use adaptive channels in free-for-all. requires buffering so packet can't block both allow adaptivity only in minimal rectangle can go back to adaptive if free buffers If apparent deadlock, move to deadlock-free routing VC1 before VC0 in each dimension Fixed dimension order Since naive V1/V0 use unbalanced, balance it offline Router Has fancy pipeline Packet always accesses configuration table with 128 24b entries: 14b header routing info (less VC) 6b access control 3b for routing in incomplete torus 1b parity Thus, header routing info is: 2b direction 4b+4b destination 1b if adaptivity is allowed (e.g., not if incompl torus) 1b of I/O 2b reserved 2b VC At each hop, ECC checked and regenerated (esp. for changing header). ------------------------------ John Kim, William Dally, Brian Towles, Amit Gupta "Microarchitecture of a High-radix Router" 32nd International Symposium on Computer Architecture (ISCA 2005) 1980s -- high-radix (e.g., hypercube) to reduced hop with store and forward 1990s -- low-radix (e.g., 2/3D tori) for wide channel since worm-hole/virtual cut-through tolerates more hops (Dally's Ph.D. thesis) Future? -- back for high radix due to high total pin bandwidth making many, narrow links look good again (also chip area to do switching?) See Section 2 Unloaded Latency = number of hops * hop delay + packet length / link BW If switch has higher radix + number of hops goes down - link BW goes down (fixed switch BW / more links) Figure 3 optimal radices * 2003 has minimum at, say, 30 (which is still high) * 2010 minimum is larger than 250 Practical consideration favor radices smaller then these theoretical optima Nevertheless, it looks like radices should grow a