--------------------------------------------------------------------
CS 757 Parallel Computer Architecture
Spring 2012 Section 1
Instructor Mark D. Hill
--------------------------------------------------------------------


Outline
-------

Larrabee
(Data Center on a Chip)
(Cell)
Final Thoughts
As time permits: (DySER)

------------------------
Larrabee
Seiler et al., SIGGRAPH 2008

Show Figure 1
8 in-order cores  w/ private L1 caches 32KB I + 32KB D
Private 256KB L2 cache on bidirectional 512b=64Bring ICN
Fixed function logic, memory, & I/O also on ring
(Fixed function logic for graphics-specific stuff (rasterizaton, interpolation))

Each core -- asymmetric 2-way issue
- 1st way any x86 instruction
- 2nd way simple subset
- 4-way multi-threaded

Vector Processing Unit
- 16 double-precision lanes
- vector registers
- vector mask
- vector ISA allow one operand in memory (cache)
- gather/scatter
(also more generally cache-ctrl ISA, e.g., prefetch and "give low priority" so sweeping thru data doesn't flush cache of other stuff).

Graphics?, GP?, Both?, 
Neither? --> cancelled

C.f. Weiser's Valley slide 39
http://www.cs.wisc.edu/~markhill/restricted/weiser_wisconsin09.pdf
X = #theads
Y = Performance
Conventional cached-based peaks
GPU upper-bounded by BW
Beware the Valley

But new Labbabee Vector ISA to MIC (Many IA Cores)?


------------------------
Data Center on a Chip (reference)
http://hothardware.com/News/Intel-Unveils-48Core-SingleChip-Cloud-Computer/

48 Pentium cores 
2D Mesh ICN
"Energy Efficiency"

I hear:
* Shared memory space but no coherence
* Nominally each core to work is it own part of the shared space
* but can let otehr access it with software coodination
* Rumours of alternative communication mechanisms, e.g., block-copy engines and message queues for experimentation.

But:
* Pentium cores meak N/W look really fast
* Better than 80-joke-core Tera-scale chip
http://techresearch.intel.com/articles/Tera-Scale/1449.htm


------------------------
Cell
Kahle et al.
IBMJRD Jul-Sep 2005

Run with Figure 1
PowerPC + 8 SPEs

Synergistic Processing Element
- hyped name
- no i-cache - overlays
- explicitly-managed DMA
- makes 

------------------------

Final Thoughts

10-100x -- programmers will stand on their heads
1.x-3x -- they will not; time-to-market matters
Between?

Between Amdahl's Law -- overall speedup, not kernel speedup

20th Century 
* Moore's Law (w/ Dennard scaling) provided
* Higher level harnessed benefits

21th Century 
* Lack of Dennard scaling makes energy central
* Moore's Law in trouble (economically)
* Need to re-visit 20th Century
* Maybe new technology will save us but not for decade+

What to do? Wring inefficiencies 20C used to manage complexity
* HW/SW specialization -- see H.264 ISCA 2010 and DySER (reader)
* Reduce SW bloat -- recall matrix multiply in Java >1000X slower than hand-turned C
* Consider approximate and analog operations
==> All hard as require "vertical cut" thru SW/HW

Parallelism is necessary, but not sufficient
* Must exploit Locality-aware parallelism
* Otherwise communication's energy cost dominate
* 10x-100x efficiency gains over naive parallelism

Our work is not done!

------------------------
DySER -- as time permits -- use slides