-------------------------------------------------------------------- CS 757 Parallel Computer Architecture Spring 2012 Section 1 Instructor Mark D. Hill -------------------------------------------------------------------- Outline Clusters Getting Smaller * Motivation & Packaging Background * Blades * SeaMicro Getter Larger * Shipping Containers * Warehousescale comptuers * (Google Paper) * (MapReduce and beyond) Acknowledgment * Paul "Beebs" Beebe for CSL tutorial ---------------------------- Motivation * What is a computer system? How design? * Formally, this meant proc,memory,i/o is single OS * Now, it also means designing "racks" or even data centers GETTING SMALLER .... ---------------------------- Computer Packaging Background Let U = 1.75" = 44.4mm = 4.44cm DRAW PICTURE AS I GO Non-blade server computers go in racks * 11U wide (~19") * ~42U tall (~72" = 6') * (varys?) 16U deep * rack is passive Computers go HORIZONTALLY in rack * 11U wide * 4U --> 2U --> 1U today (think: pizza box) * Each one is complete computer: proc, mem, disk, ethernet, misc. i/o cards power supply, fans, etc. * To each, connect: power, ethernet, service network (serial-->ethernet) Computer Room must * Provide AC power * Air Conditioning In long run, changing power easier than changing A/C Beebs guesstimates that doubling * CSL power == $30K * CSL A/C == $5M ==> Over-provision data center A/C for uncertain future ---------------------------- Blades -- Double density and reduce cost (perhaps not price :-) ) Numbers from Desai, et al., BladeCenter, IBM JR&D, 11/2005 DRAW PICTURE AS I GO Same rack: 11U * 42U * 16U Blade Chasis: 7U tall (still 11U wide and 16U deep) * Active midplane: 2 ethernet, RGB video bus, DC power, ... * Back has "bays" for (redundant) power supplies/fans, switching (ethernet?) * Front top has misc: CD-ROM, floppy, USB, lights, service processor (?) * Rest of front has vertical slots for 14 "blades" at 30mm * Each blade slot: 0.65U/1.14"/29mm wide * 6U tall * 10U deep * Everything "hot pluggable Blade * n*0.65U wide * 6U tall * 10U deep, where n=1,2,... * n > 1 if needed for size, power, or cooling Processor Blade * midplane connectors, voltage regulators * processor, memory, disks, ethernet (?) * A "logic" PC with own OS, etc. -- only physical packaging / service changes Other Blades * I/O Expansion to PCI-X, Infiniband, Fibre Channel, etc. * Expansion (that talks to base blade via PCI-Express) for SCSI disks, PCI, etc ---------------------------- SeaMicro HotChips 2011 lides Start w/ slide 14 -- big pix slide 11 -- four virtual devices PCIe ethernet, 4 SATA disks BIOS UART GETTING LARGER .... ---------------------------- Shipping Containers 40' Dry Freight Container Outside: 40' x 8' x 8.6' (height last) 20' Dry Freight Container Outside: **19'10''** x 8' x 8.6' (height last) and others. Three plugs: * Power 220v or higher, three phase? * Chilled water in then out * Network Advantages * Little on-site assembly (time) * Measurement isolation: e.g. performance/watt clear. * Gives vendor more degrees of freedom * Service-level agreement (SLA) regarding failures? * Bid vendors on big item * Become commodity? Disadvantages * Large-grain (can't buy small) * Hard to service (but see SLA) ??? ---------------------------- Warehouse Scale Computers ---------------------------- Luiz Andre Barroso, Jeffrey Dean, Urs Holzle, Web Search For a Planet: The Google Cluster Architecture, IEEE Micro, 23(2):22-28, March-April 2003 Comments by Mark D. Hill, 15 March 2004 Google Query ------------ DNS lookup with load balancing phase 1: inverted lookup for pages that match Split among shards with several machines per shared, selected by load balancing produces docids -- document ids phase 2: lookup docs (at least beginnings) Commodity Parts --------------- (Note aburd price in this 2003 article) mid-range PC with bit disks carefully amortize cost lots of thread-level parallelism, but SMP price don't make sense Pentium 3's run at CPI 1.1 >> 1/3 I$ works great; D$ see spatial location, not temporal 5% branch mispredictions P4 doesn't help -- too long a pipeline to this purpose Want SMT and or CMPs with short pipelines Other applications? Web servers? Read-mostly Easy correctness Sun: Network is the Computer Google: Data Center is the Computer Reviews * Guoliang: Google design generalize? Whither server HW? * Brian: Still commodity x86s? * Marc: low computation-to-communication ratio? * Syed: Updates? * Aditya: actual searce technique --> inverted index * Daniel: Mantain copies of the web? Yes, 10 static ones * Andrew N: Power? Some trends: efficient power supplies, making power linear with performance * Eric: GPUs? ------------- Dean and Ghemawat, MapReduce, OSDI 2004 (superficial coverage) Motivation * Google needs BS grads to write programs for their clusters * Want to abstract parallelism, distribution, fault-tolerance Example: Counting Word Frequency in Large Corpus MAP (String key, Iterator values) // key: document name // value: document counts for each word w in value: EmitIntermediate(w, "1"); REDUCE (String key, Iterator values) // key: a word // value: a list of counts int result = 0; for each v in values: result += ParseInt(v) Emit(AsString(result) Implementation Notes * Done by Gurus * MAP completely parallel * REDUCE must be associative * REDUCE requires gathering maps record with like key (e.g., hashing) * Load balance, fault tolerance, etc. C.f. Hadoop -- shared memory implementation C.f. Phoenix Ranger et al. HPCA 2006 -- study & code for multicore How will we do parallel programming models * Note Data parallel, MapReduce, SQL, LINQ, etc. --> very high level