-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Outline ------- Larrabee (Data Center on a Chip) (Cell) Final Thoughts As time permits: (DySER) ------------------------ Larrabee Seiler et al., SIGGRAPH 2008 Show Figure 1 8 in-order cores w/ private L1 caches 32KB I + 32KB D Private 256KB L2 cache on bidirectional 512b=64Bring ICN Fixed function logic, memory, & I/O also on ring (Fixed function logic for graphics-specific stuff (rasterizaton, interpolation)) Each core -- asymmetric 2-way issue - 1st way any x86 instruction - 2nd way simple subset - 4-way multi-threaded Vector Processing Unit - 16 double-precision lanes - vector registers - vector mask - vector ISA allow one operand in memory (cache) - gather/scatter (also more generally cache-ctrl ISA, e.g., prefetch and "give low priority" so sweeping thru data doesn't flush cache of other stuff). Graphics?, GP?, Both?, Neither? --> cancelled C.f. Weiser's Valley slide 39 http://www.cs.wisc.edu/~markhill/restricted/weiser_wisconsin09.pdf X = #theads Y = Performance Conventional cached-based peaks GPU upper-bounded by BW Beware the Valley But new Labbabee Vector ISA to MIC (Many IA Cores)? ------------------------ Data Center on a Chip (reference) http://hothardware.com/News/Intel-Unveils-48Core-SingleChip-Cloud-Computer/ 48 Pentium cores 2D Mesh ICN "Energy Efficiency" I hear: * Shared memory space but no coherence * Nominally each core to work is it own part of the shared space * but can let otehr access it with software coodination * Rumours of alternative communication mechanisms, e.g., block-copy engines and message queues for experimentation. But: * Pentium cores meak N/W look really fast * Better than 80-joke-core Tera-scale chip http://techresearch.intel.com/articles/Tera-Scale/1449.htm ------------------------ Cell Kahle et al. IBMJRD Jul-Sep 2005 Run with Figure 1 PowerPC + 8 SPEs Synergistic Processing Element - hyped name - no i-cache - overlays - explicitly-managed DMA - makes ------------------------ Final Thoughts 10-100x -- programmers will stand on their heads 1.x-3x -- they will not; time-to-market matters Between? Between Amdahl's Law -- overall speedup, not kernel speedup 20th Century * Moore's Law (w/ Dennard scaling) provided * Higher level harnessed benefits 21th Century * Lack of Dennard scaling makes energy central * Moore's Law in trouble (economically) * Need to re-visit 20th Century * Maybe new technology will save us but not for decade+ What to do? Wring inefficiencies 20C used to manage complexity * HW/SW specialization -- see H.264 ISCA 2010 and DySER (reader) * Reduce SW bloat -- recall matrix multiply in Java >1000X slower than hand-turned C * Consider approximate and analog operations ==> All hard as require "vertical cut" thru SW/HW Parallelism is necessary, but not sufficient * Must exploit Locality-aware parallelism * Otherwise communication's energy cost dominate * 10x-100x efficiency gains over naive parallelism Our work is not done! ------------------------ DySER -- as time permits -- use slides