UW Madison Quals Notes

Disclaimer
These are notes I made from papers in the reading list. At times I might have included screenshots of figures. Please shoot me an email if you want that taken down : though you can download free copy of almost all of these papers (scholar.google.com). Also its all my take on the papers, depends on my mood, level of understanding, etc., These are not reviews, just points I thought were important.
Feel free to copy.
Uploaded : Feb, 2012

J. E. Smith and A. R. Pleszkun. Implementing Precise Interrupts in Pipelined Processors, IEEE Trans. on Computers, May 1988, pp. 562-573. ACM DL Link

Precise Interrupts

1) All instruction preceding the instruction have been executed and have modified processor state correctly

2) All instr following the instruction indicated by the saved program counter are unexecuted and have not modified

3) Saved program counter points to the interrupted instruction.

Classification of interrupts

1) program interrupts

2) external interrupts.

I/O timer interrupts (for restarting), virtual memory page faults, software debugging, arithmetic exceptions, unimplemented opcodes. virtual machines can be implemented if privileged instructions faults cause precise interrupts.

Joseph A. Fisher and B. Ramakrishna Rau. Instruction-Level Parallel Processing, Science, 13 September 1991, pp. 1233-1241. Science Link

1992

For ILP

The dependences between operations must be determined
The operations that are independent of any other operatation that has not as yet completed must be determined
These independent operations must be scheduled to execute at some particular time, on some specific funct unit, and must be assigned a register to which the result may be deposited.

Hardware and Software Techniques for ILP Exec

Pipelining / multiple functional units

+ ILP increase

- interconn network/reg files goes up as square of funct units.

Superpipelining > eg : int ALU is pipelined.

Superscalar : multiple instruction issue of seq programs

Horizontal microprogramming >

Compiler decides

VLIW issues

Problem : small size of basic blocks

Dynamic schemes for speculative execution

terminate unnec speculative computation once the branch has been resolved

undo the effects of operation that should not been executed

ensure no exceptions are reported

preserve enuf exec state to resume

Compile time > trace scheduling

Branch prediction to guide speculative scheduling

gather execution statistics

compile time profiling (Trace sched/superblock sched)

Dynamic branch predition ++

Predicted execution : to selectively squash.

ILP Compilation

Scheduling > NP complete problem

Only a problem on VLIW kind processors (non sequential)

Local Scheduling

sheduling barrier is assumed to exist between adjacent basic blocks in control flow graph

(Local code compaction)

+ computationally inexpensive (one-pass)

+ near optimal

Global Acyclic Scheduling

Trace Sched

compiler often can, by rearranging its generated machine instructions for faster execution, improve program performance. Trace scheduling is one of many known techniques for doing so.

Trace scheduling was originally developed for Very Long Instruction Word, or VLIW machines, and is a form of global code motion. It works by converting a loop to long straight-line code sequence using loop unrolling and static branch prediction. This process separates out "unlikely" code and adds handlers for exits from trace. The goal is to have the most common case executed as a sequential set of instructions without branches.

SuperBlock Schel

a simplified form of trace scheduling which does not attempt to merge control flow paths at trace "side entrances". Instead, code can be implemented by more than one schedule, vastly simplifying the code generator

Cyclic Scheduling

Software pipelining

Take loops, unroll them find repeating patterns and reroll them. modulo constraint : can be because of initiation interval or Resource constraints.

>> wiki : Instruction scheduling

+ adequate registers more important than optimizing schedule.

+ shorter schedules > higher register pressure.

- Antidependences and output dependences

prepass scheduling (sched before reg allocation)

graph coloring

best : schedule first, register allocate, shedule again.

T-Y. Yeh and Y. Patt, Two-level Adaptive Training Branch Prediction, Proc. 24th Annual International Symposium on Microarchitecture, November 1991, pp. 51-61. ACM DL Link

Gurindar S. Sohi, Scott E. Breach, and T.N. Vijaykumar, Multiscalar Processors, Proc. 22nd Annual Symposium on Computer Architecture, June 1995, pp. 414-425. ACM DL Link

Static program > Control Flow Graph.

Dynamic program execution > walking thru the prog CFG, generating a dyn seq of basic blocks.

Speculation : To achieve high ILP, loads are speculated (if stored by a predecessor task).

1) control speculation

2) data

Multiscalar programs

The actual code

details of the structure of the CFG

communication characterisctics of individual codes

task descriptor : what registers a task may produce (create mask) : conservative.

operate-and-forward instruction : one task done, can forward results to following processing units.

dead memory value analysis to release registers/memory for future tasks

Hardware required :

Scheduler

Address Resolution Buffer : before data cache : holds speculative reads/writes

What happens to cycles ?

Non-Useful compute cycles

Syncronization and data communication

Early validation of prediction required.

No computation cycles

Inter and intra task dependencies

load balancing

How multiscalar differs ?

Branch prediction accuracy does not limit, because compiler has an idea of CFG, almost 100% bp can be achieved between task scheduling (forget inside task prediction).

n instructions => n^2 complexity to perform dependence cross checks. (not quite necessary in tasks)

load stores can be whatever memory model inside tasks.

vs multithreaded : multiple threads or loci of conrol which are contrl independent and (typically) data independent.

executing on a multiscalar proc are related as different parts of a sequential walk through the same program, and are not control and data independent.

granularity of tasks matter

Summary

expoiting fine-grain ILP

combination of hw/sw

dividing the program CFG into tasks, stepping thru the CFG speculatively, taking large steps, a task at a time, without pausing to inspect the contents of a task.

processor complex uses multiple PC to sequence through different parts of the CFG simultaneously.

Subbarao Palacharla, Norman P. Jouppi and J. E. Smith. Complexity-effective Superscalar Processors, Proc. 24th Annual International Symposium on Computer Architecture, June 1997, pp. 206-218.

paper looks at complexity (in delay of critical path)

logic associated with issue window and data bypasses cause max delay

R10000 and DEC 21264 : register commit, active list

PowerPC, HP PA-8000, HAL SPARC 64 : rob holds non commited renamed register values.

> diff : size of register file.

IW == issue width

checkpointing rename logic

RAM : logical address to phy addr direct

Trename = Tdecode + Twordline + Tbitline + Tsenseamp

Tdecode, Twordline, Tbitline = c0 + c1 x IW + c2x IW^2

CAM : phy reg to logical registers

Wakeup logic

issue window > CAM array holding one instr per entry.

Delay = Ttagdrive + Ttagmatch + TmatchOR > quadratic all.

Selection logic

Tselc = c0 + c1 x log4(WINSIZE)

Data Bypass Logic

S pipestages > (2 x IW^2 x S) bypass paths (2 inp func units)

datapath and control

0.5 x R x C x L^2 (length of wire)

Even if alternative layouts are used : quadratic delay with bypasses remain

reg file/bypass within cluster single cycle, across clusters multi.

>> biggest problems > window logic and bypasses

dependence based microarchitecture

issue window > number of FIFOs.

inorder issue of dependent instructions

bypass is propogated to the head of the instr queue

FIFO entry :

all operands of I are ready : new empty FIFO

1 outstanding operand : FIFO generating that thing just behind the Isource else new FIFO

two outstanding : try both operands.

clustering based dependent

group issue and exec units into clusters (these cna be fixed/windows)

PEW : parallel execution windows

Conclusions :

parameters that affect > issue width and issue window size.

data bypass logic

may affect in future > register files and caches

Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, Gurindar S. Sohi, Dynamic Speculation and Synchronization of Data Dependences. ISCA 1997: 181-193. ACM DL Link

> predict if the execution of an instruction is likely to result in a data dependence mis-speculation

> provide sync needed to avoid mis-speculation.

Control speculation and data speculation

data > value and dependence speculation.

small instr window sizes of modern dynamically scheduled processors, the probability of mis-speculation is small, perf loss due to erroneous data dep sep is small. this paper : what if the instr windows are large?

> predict instr whose immediate exec is going to violate true data dependence

> delay the execution of those instr as long as it is nec to avoid the mis-spec.

Dependencies > unambiguous or ambiguous. 2 instr linked via an ambiguous dependence gets resolved to no dependence => speculation is okay.

problem most acute : production and consumption of data through memory. this paper only looks at this.

a load is allowed before a store which its ambiguously dependent. (normally processors do this blindly).

As window sizes grow larger, min the net cost of mis-speculation important.

min the amount of work lost due to misspec

reduce the time required to redo the work that is lost on misspeculation

reduce the prob of mis-spec.

to min the impact

predict if the immediate exec is likely to violate true data dependence

predict the store/stores that the load depends upon

enforce synchronization between the dependent instru.

selective data dependence speculation > choose which loads should be speculated

static store-load instr pairs that cause most of the dynamic data mis-speculations are relatively few and exhibit temporal locality. > build the association eg., save the address accessed. or : use dependence edge as a handle : PC of the store, load. but static dependence between a given store-load pair > multiple dynamic dependencies.

differentiate dynamic instances : same static dependence > a tag for each instance. dependence distance can be used

Implementation

Memory dependence prediction table

valid flag, LD PC, ST PC, DIST, optional prediction

Memory dependence synchronization table

Valid, LD PC, ST PC, LD ID, ST ID, INST, full/empty flag (conditional variable).

Issues

Intelligent prediction required

Incomplete synchronization

avoid deadlock

free MDST entry allocated for sync that will never occur (pipeline flush say)

Misspeculations

squash MDST entries.

Multiple Table Entry matches

discard all but most frequent mis-speculations.

Conclusions

larger dynamic window sizes > net performance loss due to data dependence mis-speculation is higher.

min work lost, min time required to redo, improve spec accuracy

static data dependencies responsible for majority of mis-speculations are few and dynamically exhibit temporal locality.

dynamic dependence prediction and synchronization in hardware

Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton, Continual flow pipelines. ASPLOS 2004. ACM DL Link

High system throughput

Many cores per chip

constant chip size and power envelops > large cores cannot fit on a single chip/small cores traditionally do not provide high single-thread performance

CFP : efficiently manages cycle-critical resources > continues processing instr even in the presense of a long latency data cache miss to memory.

CPF does :

draining out long-latency-dependent slice (along with ready source values) while releasing scheduler entries and registers

reacquiring there resources upon reinsertion into the pipeline

integrating results of independent instr without their re-execution

splits instr dependent on miss different from free to go

decoupling critical resources

Scheduler

take out L2 misses and store them in a temp small first-in, first out FIFO Slice Data Buffer

Completed source register : instr that have executed (to be read by slice) > save it in the FIFO

Dependent destination registers : destination operand of slice instr

key actions

identify and drain the slice

process and retire indep instr

exec the slice when miss returns

merge results of indep instr and slice instr > not difficult, the instrs are anyway independent

Perf

isolated cache misses handled effectively

memory level parallelism by going ahead after a cache miss

Wen-Hann Wang, Jean-Loup Baer, and Henry M. Levy. Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy, ISCA 1989. ACM DL Link

Virtually addressed cache

For rapid cache access

Problems

synonyms (different virtual addresses > same physical address)

addr translation anyway required for a miss

cache coherence bowled.

I/O addressed physically

context switch needs invalidation

Write through for the first level (coherence is simplified) and write back for the second level preferred.

Writes tend to occur in bulk (in procedure calls) > use write buffers (lots needed) > Cache coherence becomes a problem > use write back.

reverse translation table for detecting synonyms (second level). when v-cache miss > go down > hit > check if syn exists > inv the old one and copy over new virt address

for using virtual addressing :

+ memory bandwidth to processor

- write backs in burst when context switch occurs (paper suggests dirty dirty bit).

for physical :

cache coherence simple

synonyms headache avoided.

says hit rates for combined I & D cache comparable to split.

Bruce Jacob and Trevor Mudge. Virtual Memory on Contemporary Processors, IEEE Micro, vol. 18, no. 4, 1998. IEEE Xplore link

Classic Memory Management Unit > Translation look-aside buffer (cache of page table entries), an fsm that walks the page table

fully associative TLB > bad timing, set associative popular (power consumption/clock speed)

problem with software controlled > OS is typically written generic, > OS not tuned to hardware on which it operates. unlikely to live up to its potential performance.

MIPS <<<<

OS handles TLB misses. software fills the TLB, OS defines TLB replacement policy.

64 bit > top 2 bits : user, supervisor and kernel spaces. (in 32 bit R2000/R3000 : bit 31 : user/kernel, 30:29 : cached/uncached and mapped/unmapped regions.

virtual address extended with address space identifier (ASID) (6 bit = 64 processes).

TLB filled randomly/OS tells which slot to replace. (TLBWR) / TLBWI : instructions.

N noncacheable, D dirty (writable) V valid, G global (TLB ignores ASID match > shared memory).

ALPHA <<<<

Software managed, split TLB/cache.

Not OS but PALcode (privilege access code). not-most-recently-used policy for replacement.

Address space match (ASID indep match).

Granularity Hint : 2 bits > 8^GH pages = 1 super page.

PowerPC <<<<

inverted page table (scalability : physical page table size limits, not virtual)

paged segmentation : TLBs and page table map an extended virtual address space, not application's effective address space.

segment = 256MB of continuous regions of virtual space.

>> basically things are not continuos. each process has the same 4 GB, but 4x 1GB with pointetrs.

no ASIDs : OS can modify protection on segment registers.

Superpage > block address translation mechanism : BAT takes precedence over TLB when BAT signals hit.

BAT has only OS/User priv distinguish > if context switch happens and no sharing is desired, explicitly flush the BAT.

PA-RISC <<<<

segmentation like PowerPC.

Protection ID : running process compared with access ID : virtual page.

user level access to space registers > processes can change their address space. > additional protection mechanism required.

each page has associated ID protection > 2 processes can share (! all or nothing) parts of their space.

UltraSPARC <<<<

ASIs (ASIDs) > dont identify contexts but identifies data formats and privileges.

IA-32 <<<<

amalgamation of several techniques.

Pentium II

segmented with no explicit ASIDs.

segmentation is just not used in OSs, typically flush the TLB on context switch to provide protection.

Caches are physically indexed and tagged > no flush on context switch

TLB miss : hardware walks the page table (from CR3 : context page table root).

sharing : duplicating mapping information across page tables.

1 byte to 4GB segments : size set by software and can differ for every segment.

segment descriptor : privilege, granularity (size) etc.,

-----

virtual vs physical caches

protection methods (ASICs, segments, multiple IDs)

address-space organization(segmented/fla, page table support, super page support)

virtual cache > data consistency issues managed by OS.

protection mechanism > OS's model of obj allocaiton : allocate and destroy large/small objects,

besteshwar : hardware walked, inverted page table.

Vinod Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge, A performance comparison of contemporary DRAM architectures, ISCA 1999. IEEE Xplore link

Summary paper of commercial architectures.

Observations

One time tradeoff between cost, bandwidth and latency

Latency min > ganging together multiple DRAMs into a wide structure

(Page mode, interleaving etc.,)

Widening buses will present new optimization opportunities

(Locality)

Buses wide as L2 cache yield best mem latency, but they cannot halve the latency of a bus half as wide.

N/2 is a good design choice

Critical-word first does not work well with burst mode

Choice of refresh mechanism can alter ave mem latency.

Conventional :

Muxed row address, column address, strobes, data out

FPM Fast page mode DRAM

row address, column, dataout, column, dataout, column, dataout

Extended Data Out DRAM

dataout happens along with next column address (understand!)

Banked 4, > 4 rows remain open with each row address.

Direct Rambus

1.6Gbytes/s. 17 half-row buffers > 16 full row buffers (Area consideration).

Conclusions

> address bandwidth problem but not latency problem

> bus speed >> latency

> Locality exists even to DRAM accesses

> use this for future speedups

presently > page mode and internal interleaving to achieve one time performance boost.

Changkyu Kim, Doug Burger, Stephen W. Keckler: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS 2002. ACM DL Link

Wire Delay dominated cache > overall access latency is high.

Design space for Caches

Mapping > how many addressable banks and how lines are mapped

Search > set of possible places a block can be?

Movement > always in the same bank?

Structures considered

UCA : Uniform Cache Access.

ML-UCA : multilevel UCA

S-NUCA-1 : mapping predetermined based on block index.

S-NUCA-2 : 2-D switched logic instead of private per bank channels > large number of smaller faster banks.

D-NUCA : each bank form one way of a set, frequently accessed data in the closest set.

Eval

UCA Contention > bank contention / channel contention.

S-NUCA-1 : Private Channels > each bank can be acced indep at high speed.

S-NUCA-2 : Switched Channels > no large wires. lightweight, wormhole routed 2D mesh

Dynamic NUCA : spread sets : each set is spread across multiple banks (one way per bank).

mapping : simple (start below, go up every bank), fair (distance from focus equal), shared(closest banks split to everyone).

Locating : incremental search : closest bank searched first

multicast search : some or all banks searched (happens in ||)

limited multicast search : first M of N banks.

partitioined multicast : bank set is broken down into subsets of banks.

search : partial tag comparison : smart search : ss-performance, ss-energy.

dynamic movement : generational promotion > each hit, swapped with the closer cache controller.

insertion : which location to fill in / replace (zero copy/one copy policy)

Viji Srinivasan, Davidson, E.S., Tyson, G.S., "A prefetch taxonomy," Computers, IEEE Transactions on , vol.53, no.2, pp. 126-140, Feb 2004. IEEE Xplore link

Prefetching : access time reduction

- miss prediction => increasing mem traffic / cache pollution

- is it advanced enuf? >> address might be right, but it might still take more time

- if early enuf, might pollute cache, prefetched line might replace/may be replaced.

metrics to evaluate : achieved performance (number of misses, total traffic and IPC), coverage and accuracy

Arvind, Rishiyur S. Nikhil. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. Computers 39(3): 300-318 (1990). ACM DL Link

Programming languages > bad parallelism

loss of determinacy > significant complexity to establishing correctness

explicit parallelism management > boring

hardwarevon Neumann

long memory and communication latency > happens in ||

syncro?

sequential scheduling (PC based)

Tagged-Token Dataflow Architecture

Id : high level language with fine grained parallelism implicit in operational semantics

A good parallel programming lang should

insulate programmer from no_of_procs, topology/speed of comm network

identify parallelism auto

must be determinate (if algo is then program is)

problem with dataflow > loops and recursions > use contexts

W. W. Hwu, R. E. Hank, D. M. Gallagher, S. A. Mahlke, D. M. Lavery, G. E. Haab, J. C. Gyllenhaal, and D. I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE 83(12), December 1995. IEEE Xplore link

Paper talks about : 1995

Critical components of tech which deals with difficult problems that are encountered when compiling programs for a high degree of ILP. [ In particular : predicated compile and dependence info maintainance ]

Compilers improve program exec speed by
     eliminating unnec instr
     putting stuff used most in registers. (reduce memory accesses).
     Make use of hardware parallelism requried.

To parallelise a program (ILP compilation)
   Dependency extraction
     ILP optimizations
     Core Scheduling (reorder the instructions)

Definitions (and why)
basic blocks : control in control out block
Control flow graph : control dependence
Register data dependence : flow dependence, anti-dependence (WAW), loop-independence (within the same iterations in a loop), loop-carried (over the loop boundary) == register output dependence.
Memory data dependence : loop independent memory anti-dependence : inside loop unnecessary wait due to overwrite fear.
loop carried memory antidependence. memory input dependence : 2 instr read from the same memory location (can remove redundant loads)

Go beyond basic blocks : Scheduling

Superblock : formed from the trace by dupilcating blocks to remove merge points and their associated complications from the trace.
reorder instr with the superblock : global acyclic (in the control flow graph) scheduling.
Speculative code motion : move instructions.

Techniques : Register renaming/Loop unrolling / software pipelining (cyclic scheduling).
for (i=1) to bignumber A(i); B(i); C(i) end >> A(i) A(i+1) A(i+2) B(i)... in software.

Predicated Execution
Transformations of conditional branches into predicates : control depencences > to data dependences.
>>> Predicated exec allows the compiler to trade instr fetch efficiency for the capability to expose ILP to the hardware along multiple execution paths.

Hyperblock : connected basic blocks where control may only enter in the first block, and exit anywhere.
(tradeoff : branch performance with resource usage, hazard conditions and critical path lengths).

Branch combining replaces a group of exit branches with a corresponding group of predicate define instr. now OR-type sematics can be used, once you hit the exit code, find out why you got to the exit code, jump to that exit destination.

Loop peeling : compiler peels away first several iterations of a loop and combined with the outer loop.

Control height reduction : dependence chain length to compute execution contidtions of instr

Jerry Huck, Dale Morris, Jonathan Ross, Allan Knies, Hans Mulder, Rumi Zahir. Introducing the IA-64 Architecture. IEEE Micro, vol. 20, no. 5, pp. 12-23, Sep./Oct. 2000. IEEE Xplore link

Intel-HP Itanium architecture

Instr encoding

128 registers => 7 bits to specify 3 reg operands

128 bit bundle : encoding : 3 instr, ^ 7*3 + 5 bit template > start and stop of instr exec in parallel, decode and route the instr

Sequential schematics : compiler knows things commit in order

IA-64 : parallel sch :

ILP : instruction group (safely exec in parallel)

Control flow parallelism :

combine something like (a==0) || (b<=5) || (c!=d) then r8 = 4; into one parallel instr block

used specifically in multiway branch pred

predication

Control Speculation

speculative load > then propogate the possible exception to all depending instr, chk.a the very end to find if speculation is okay or branch to fix-up code

Data Speculation

because of pointers data speculation is pretty difficult

use speculation similar to above : load speculatively, store the address and size in an advanced load status table.

subsequent stores check this table for a possible match, if present then speculation success, else fix-up code.

First 32 are static, next registers are rotating.

when you enter a procedure, alloc reg locals and outputs. when exit, register rename revert.

when overflow happens, RSE (register stack engine) stores reg to mem and restore can take place.

Software pipelining

loop level parallism

Loop count register, epilog count

rotating registers help, predicates

Virtual memory

core of OS's multitasking and protection mech

because of 64 bit (16 billion GB) non exhaustive hardware reqd

Regions

on context switch, OS only flushes exit region

Protection Keys

page granular control (shared between OS/user kinds)

multiple page sizes

John Goodacre and Andrew N. Sloss, Parallelism and the ARM instruction set architecture. Computer, July 2005. IEEE Xplore

Typically : Performance and Efficiency methods

Variable execution time

subword parallelism

DSP (Specialization)

TLP, exception handling

multiprocessing

Variable execution time

multiple loads/stores on single instr

epilog and prologue of subroutines

code density

Inline barrel shifter

Conditional execution (predication)

16-bit thumb instr set (read separately!)

Data Level Parallelism

sub word SIMD. (divide the 32 bits into 8x4/2x16 and parallel)

Thread Level Parallelism

? Improve exception handling

increases complexity in interrupt handler, scheduler, context switch

> Special instructions :

CPS : Change processor state

RFE : Return from exception

SRS : Save return state

Multiprocessor Atomic instructions

LDREX (load exclusive)/ STREX

>> Physically tagged cache over virtually tagged : 20% improvement in overall performance.

Instruction level parallelism

2004 : cost-performance-through-MHz wall

Why multicores ?

High MHZ > costly

ILP is complex and costly (Extracting)

Programming multiple independent processors > non portable and inefficient

ARM 11 : multiprocessor

Generic interrupt controller

Snoop Control Unit

Enhanced atomic instructions

Lock-free syncronization > wake up/sleep spin locks

CPU number and context registers with privileges

Weakly ordered memory consistency

wmb() > write memory barrier

rmb() > read mem barrier

DSB() > drain store buffer

SMP performance

Cache Coherence (SCU : at CPU speed)

Inter processor communication (Software initiated interprocessor interrupt, async)

Load Balanced interrupt handling

SCU :

copy of physical address tag for fast access

migratory line > if a line moved from shared to write and another processor requests it, its assumed the other processor will eventually write it, moved directly to M'. cool for locks and stuff

Richard M. Russell. The Cray-1 Computer System, Communications of the ACM, January 1978, pp. 63-72. ACM DL Link

1978

Vector processing engine

128 MFLOPS! cool.

Cylinder to ensure wiring distances small

Banked memory

Registers

A > Address(24 bit), B > cache for A

S > Data (64 bit), T > cache for S

V > Vector, VM > vector mass

Supporting : VL > Vector Length, VM, PC, Base Address, Limit Address, Flag, Mode, and exchange Addr

Chaining

result of one func unit as input to other

> bypass basically

Interrupt

stop PC inc, complete mem operations, vector proc

call exchanges by OS to handler

Software/Vector processing

Cray OS, Fortran compiler

2 or 4 or more elements in vectors > better than scalar (earlier 100s required)

complex vector routines > scalar loop calls

array increments = integer, start point and array.

Front End

at least a minicomputer connected by a

cray link

Cooling problems

Circuit boards

make wires equal length, or standing waves on ground line. prop delay balancing > purely resistive load.

5 layer PCB

Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996, pp. 28-40. IEEE Xplore link

Look at the paper for figures. This paper is the coolest, in that it pretty much explains all that that goes into an out of order processor in enuf detail (Hardwarewise). You should read this paper fully 100 times.

Specs

Fetch and decode 4 instr per cycle.

speculate four-entry branch stack.

dynamic out-of-order exec

in-order graduation (precise exceptions)

Exec units Pipeline

Nonblocking load/store unit.

dual 64-bit int ALU

FPU

pipelined adder with 2 cycle latency

multi with 2 cycle latency

hierarchical, nonblocking mem subsystem

on chip 2 way associative primary caches

external 2 way assos sec cache

64 bit multiproc sysif with split transaction protocol

Pipeline

Stage 1 : Fetches and Aligns (4 instr)

Stage 2 : Decode and rename (target address for branch and jump)

Stage 3 : Move to Queue and check if all operands ready (Wait here till ready)

Stage 4-7 : Exec

Int and FP have independent register files.

Instruction Fetch

fetch and decode at higher bandwidth than exec

keep the queues full.

and many instrs are discarded as mispredictions happen.

16 word cache line => fetch 16 instrc, aligner picks 4. in 0,1,2,3 order with a rotate (dependency logic minimization)

Branch Unit

addr[11:3] > 512 entry branch history table, 2-bit algo.

87% on spec 92.

delay slot instr executed. no advantage in ooo, but backward compatibility.

Branch Stack

> Saves state in a 4-entry branch stack.

(i) Alternate branch address

(ii) complete! copies of integer and fp map tables.

(iii) misc control bits

Branch Verif

Fetching along mispredicted paths > unneeded Cache refills! Its kind of okay because eventually this might be used.

4-bit branch mask accompanies each instr (on the queue and exec pipe). pending verification. if mispredicted, abort

Decoding Logic

Stops when queue/active list is full.

Hi and Lo for multiply and divide => 2 register writes => exception handling is difficult.

control register modify execute serially. (mostly in OS kernel).

dependencies : register operands, memory address, condition bits.

logical register numebrs into physcial-register numbers.

instruction order is taken care of.

Each physical register is written exactly once after assignment from the free list.

Until its written its busy.

When new mapping is written => old mapping is free.

33 logical and 64 physical.

Read the mapping for 3 operands

Writes the current operand mappins and new destination mapping into instr queues.

Active list saved previous destination mapping.

Free List

Unassigned physical registers.

4 parallel (graduates 4 at a time). 8 deep circular FIFO.

Active list

All instr currently active.

Upto 32 instr can be active > 4 parallel, 8 deep circular FIFO.

5-bit tag that equals and address in teh active list.

exec complete => tags sent here and marked complete.

Also has : logical-destination register number and old physical register number.

An instruction's graduation commits its new mapping. > Old physical register goes to free list.

Branch rollback > from Branch Stack (you know it might mispredict early)

Exception rollback > undo mapping from the active list (because you cant checkpoint at every instr)

Busy-bit tables

when register exits the free list, set bit.

used for dependency between integer and floating point pipes.

Instruction Queues

Integer Queue

16 entries, no specific order

Address Queue

Preserves program order (others need not)

Basically Load/Store q.

Two Matries (16 bit by 16 bit) track dependencies between memory accesses.

1: stop unnecessary Cache trashing (older instr obviously gets prefrence)

2: Load bypass : if pending store has the value.

Sequential memory consistency. (this is inorder) but : external access to memory ? if a jackass from outside writes into the memory after its read from the cache but register is not updated, soft exception and restart from load instr.

Store instr : precisely write into cache and active list is cleared.

Atomic instr : Load-Link, Store-Conditional pair.

Floating Point Queue

Memory Hier

The usuals. Non blocking 2 level set associative cache.

44-bit virtual address > 40 bit physcial address.

iCache has predecoded instructions (makes decoding easy later on). which functional unit executes in 4 extra bits, modifies opcodes slightly.

9 bit ECC and 1 bit parity.

Timothy J. Slegel, et al. IBM's S/390 G5 Microprocessor, IEEE Micro, Mar/Apr 1999, pp. 12-23. IEEE Xplore link

S/390 G5 processor : 1999

took 10 years to reach the performance of the last bipolar system for CMOS

L2 ran at half speed (500Mhz core clock at 1.7V)

full custom

microarch

not superscalar

ESA/390 : older ISA : numerous, relatively commonly used, instr that require tens/hundred of clk cycles to execute.

not load-store

too complex : difficult to implenet : decimal data instr, addressing modes, multiple address spaces, precise interrupts VM emulation and 2 different floating point archs.

G5 uses millicode

L1 cache : buffer control element : cache itself, cache directory, TLB, adress translate logic

I unit : instr fetch, decode, addr gen, queue of instr waiting

E unit : exec, local working copy of registers

R unit : recovery unit : checkpointed copy of the inter microarch state timing facility

L1 cache unified instr, opearnd, millicode data and is store through. 2 way interleaved

>> 256 bytes : ideal cache line size : compromise between fetch time of the last byte and perf improvement

Absolute address history table : predict preTLBed address

TLB : dynamic address translation and access register translation

ART : access register translation ART lookaside buffer (ALB)

2 way set associative BTB 2048 entries

Decimal unit for financial data

Exception handler :

I unit tries to find out which instructions cause an exception

does something called single instruction mode : all previous instructions are cleared and this particular instr is sent thru

normal speed deduction : gross, pessimistic check

single instr mode : actual precise check

Registers

E unit has local copies

R unit does have RAM type master copy (architectural state) of registers

on commit, R unit is written (with ECC)

R unit used for recovery (checkpoint)

Millicode : executing a complex instr is like hardwired subroutine call

> uses completely indep set of register

> also service functions : hardware error logs, scrubbing memory for correctable errors, supporting operator console functions and controlling low-level I/Ooper

Virtual machine emulation

hw support : 3 complete copies of all architected control register, 3 copies of timing facility registers. (host mode, first level guest and second-level)

2 types of floating point

symmetric multiprocessor : memory : uniform access time.

lots of relaiabiltiy features

error check, parity, state checking, local duplication of control logic and so on. 20-30% error correction logic!

full duplicate I unit and E unit. if the signals dont match, hardware error recovery is invoked.

in a array of processors : delete mechanism removes a processor completely.

memory : redundant word-lines to automatically replace defective sections (even in customer site)

Processor availability facility : scans out the latches from a check-stopped processor, stores it in a special area, os then resumes this on another processor

PAF + concurrent sparing => completely transparent to customer, appatakar.

T. Mudge. Power: A first class design constraint. Computer, vol. 34, no. 4, April 2001, pp. 52-57. IEEE Xplore link

Power equations for CMOS

P = A C V^2 f + A V Ishort + V Ileak

fmax = (V - Vt)^2 / V

Ileak = exp (-Vt/35mV)

reducing voltage has significant effect on power consumption P prop V^2

reducing activiity does as well : clock gating

parallel processing is good if it can be done efficiently > if computation can be split in two and run as two parallel independent tasks => power cut in half, no perf loss!

Half frequency and half swing clocks

parallel processing and pipelining

Voltage scaling

scheduling

with High MIPS/W

do scheduling such that applications dont complete well before deadline.

Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor V. Zyuban, Philip N. Strenski, Philip G. Emma: Optimizing pipelines for power and performance. MICRO 2002: 333-344. IEEE Xplore link

Develop an analytical model to understand power and perf tradeoffs for super scalar pipelines.

contributions

energy models for dynamic and leakage power to capture the scaling of different power components as a function of pipeline depth

analytical perf model that can predict optimal pipeline depth

cycle accurate detailed power-performance simulation (with sensitivity analysis of optimal pipeline depth).

someother paper : 7.7 FO4 = optimal pipeline depth for SPEC2000, 5.5 FO4 for traditional and Java/C++. (+latch insertion of 3 FO4).

P is total power of pipe i > energy delay product ED = P/G^2. optimal d(BIPS)^y/W)/ds = 0.

Hold power primarily consists of power dissipated in latches, local clock buffers, global clock network, data independent fraction of arrays.

switching power : combo logic, data-dependent array power dissipation, switching factors.

Dynamic power = CV2f*(a+b)*CFG a = functionality, b = glitching factor.

leakage power = VIleak

Latch Scale = latch ratio * (Fo4base design/FO4 logic)^LGF

fact adj hold power diss = hold pwr to total power, LFG = growth of latch count dur to logic shape func.typically > 1

Dynamic power increases more than quadratically with increased pipeline depth.

Metrics with Less emphasis on performance (BIPS/W, BIPS^2/W : shallower pipelines, deeper pipelines for more pef). Optimal for BITS3/W is 19-24.

Using BIPS3/W :

high LGF favor shallower pipelines, lower values of Latch Ratio favor deeper pipelines, use of low power latches favors shallower pipelines, higher values of Glitch factor favors deeper pipelines and high leakage > deeper pipelines.

depend on workload characteristics.

Dan Ernst, et al., A Low-Power Pipeline Based on Circuit-Level Timing Speculation, MICRO 2003. IEEE Xplore link

Physical Design paper MICRO 03

A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Background

Dynamic Voltage scaling
Critical supply voltage < V

OCV to account for process variations

or delay chains (inverters) to measure it after chip manufacturing

Razor

Shadow latch clocked by a delayed clock, guaranteed to hold correct value

compare actual ff vs shadow for error

combination of architectural and circuit level dagajis

recovery > one cycle penalty.

Do something like instruction stall/clock gating to account for that one cycle

Only used in PVT corner affected flops replaced by Razor

metastability handling > takes 2 cycles => flush pipeline required.

Recovery methods

Clock gating

Counter flow pipelining

Proportional control system

Voltage scaling based on measured (sampled) error rates

Shubhendu S. Mukherjee, Christoper Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin, "Measuring Architectural Vulnerability Factors," Top picks of 2003 in IEEE Micro, Nov/Dec 2003. IEEE Xplore link

Stuck at faults, single-event upsets (cosmic rays, 8% increase over generation)

device degradation

Cause of variations

random dopant variations (in channel > threshold variations)

sub-wavelength lithography

heat flux (dynamic variation)

tolerant design

leakage invp speed (body bias to control)

freq + active/leakage power distribution > synopsys

deterministic to probabilistic /statistic

burn in > high voltage, high temp for short periods just after the chips are manufactured to accelerate aging > faults can be caught fast

synthesis :

design (multivariate for) perf, active and leakage power, reliability, yield and bin splits.

Physical :

Razor

Arch :

Reliability and security engine

small checker hardware

resilient hardware > multicore redundant

D. Burger et al. Scaling to the End of Silicon with EDGE Architectures. IEEE Computer 2004. Volume: 37, Issue: 7. ACM DL Link

Cool paper

what drove comp arch
1970 : Memory was expensive
1980 : no of devices that could fit in chip
now : MHz wall

4 major emerging tech characteristics
Pipeline depth limits > fine-grained concurrency mechanisms to improve performance
Power
on-chip communication dominant execution
polymorphism (multi function)

EDGE : Explicit data flow graph execution
Groups of instructions > hyperblock that can be run in parallel

Architecture + compile balance :
RISC : Arch has to look for (reconstruct) dependencies (dynamic placement, dynamic issue)
VLIW : very difficult for compiler. (static p, static i)
EDGE : static placement, dynamic issue

Compiler's jobs
>> Large hyperblock formation
>> Predicated execution (just create a dependency on the first of the instructions, dependency following will auto block rest)
>> Generating legal hyperblocks (device constraints)
>> Physical placement

Parallelism
Data level P is kind of obvious
TLP : 4 threads > evenly distribute exec blocks to threads

Shekhar Y. Borkar: Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro 25(6): 10-16 (2005). IEEE Xplore link

Soft-errors single event upsets

Special radiation hardened circuits

architectural redundancy

localized ECC

Soft error rate

Silent Data corruption (SDC)

Detected unrecoverable error (DUE)

Architectural vulnerability factor > what effect does a soft error have on a program's actual output?

branch predictor's AVF = 0

ACE : Architecturally correct execution are bits whose errors cause damage to arch state

Use Little's law to calculate ave ACE (conservative) for a program

unACE sources

no-op

performance enhancing instructions

predicated false instructions

dynamically dead code

logically masked values

We translate Little’s law as N = B °— L , where N is the average number of bits in a processor structure, B is the average bandwidth of bits per cycle into the structure, and L is the average residence time of an individual bit in the structure.

Find AVFs using performance modeling and SPEC benchmarks

Gabriel H. Loh, Yuan Xie, Bryan Black, Processor Design in Three-Dimensional Die-Stacking Technologies, In IEEE Micro, vol. 27(3), pp. 31-48, May-June, 2007. IEEE Xplore link

two natural topologies : face to face or face to back

copper-copper bonding process builds an interdie connection > die-to-die (d2d) or 3D via.

d2d via pitch significantly larger than individual transistor.

> size determines possible 3d part of procesor blocks and funct units.

> latency.

> RC delay : 35% of full stack of vias connecting met 1 to 9.

30% of pins used for power

high power density

face to face > requires through silicon vias (TSVs) for I/O and power. low inductance, so no problem.

3d => 50% area of 2 2d chips => 50% area for pins => power delivery 50%

++ because wire loads are low : power demand goes down

thermals

successive layer farther away from the heat sink

power density

sim result : 2d processors ~ 3d config worst case temp!

wire reduction 3D placement

Partitioning granularity Layer 1| Layer 2

Entire cores CPU | L2

Functional block units ROB | ALU

Logic gate Mux [31:16] | Mux [15:0]

Transistor level PMOS | NMOS

3D cache

cores on one layer, cache on another

2x cores + 2x cache on one, 2x core + 2x cache on other

within cache

stacked bit lines

stacked word lines

word lines more delay than bit lines => stacking word lines provides lower overall latency.

but power => lower bit lines (much longer)

(1) Eliminating critical wires > latency + power reductions

(2) different partitioning strategies to match communication density of a given d2d via interface

wire via pitch does not scale at the same rate as feature-size.

(3) partitioning > power, performance, area.

Mixed process integration

DRAM

onstack DC to DC convertors

decoupling capacitors

Cool places to use

eliminate pipeline wires

higher timing margins : clock skew/jitter is low

better performance/watt ratio

clock frequency improvements > most effective for power

reduce the number of pipeline stages (wires)

NUCA : inherent problem : managing data that cores share.

Dynamic NUCA

3D version of L2 : 90% reductions in cache migrations

- - needs 3D place and route, floorplanning tools, 3D visualization and layout.

fault on a single layer > complete waste of entire stack

DFT > difficult in the presence of finely partitioned 3D structs, a die might have only 50% complete circuit.

Taeho Kgil, David Roberts, Trevor Mudge, "Improving NAND Flash Based Disk Caches," ISCA 2008. ACM DL link

NOR and NAND flash differ in two important ways:

the connections of the individual memory cells are different
the interface provided for reading and writing the memory is different (NOR allows random-access for reading, NAND allows only page access)

NAND problems

operations have to performed in pages

erases have to be made in blocks

erase before writing.

out-of-place write policy

overcome with BCC as flash degrades and thats about it.

Garbage collection

to be able to write only pages, and blocks need to be erased, you pretty much need to GC every now and then. 80% size used, 20% of the time goes in GC

W. Daniel Hillis and Guy L. Steele. Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183. ACM DL Link

Programming styles.
Control parallel style (multiprocessing)
Data Parallel

Connection Machine Architecture
     16384/65536 processors with 4k bits of memory each.
     Front end processor (von Neumann style) and array of the above proc.
     Proc array extends the ISA of the front-end processor.
     SIMD.

Processors :
     State bit > if an instr will be executed.
     Context : set of processors > context flag > statebit.
     Immediate values broadcast
     ILLIAC IV, Goodyear MPP, Non-Von ICL DAP, CM > data parallel.

pointer-based communication > SEND instr : proc address + memory address. kind of indirect store.

virtual processors > microcoded controller between frontend and array : helps program abstraction from hardware details.

1+ same program for different hardware

2+ processors may be regarded as expandable => (Fortran preallocates all memory), other prog languages > procedure call to allocate a "fresh" processor. (eg : malloc in C can) => processor-cons > memory comes with a processor.

Paper goes on and on about algorithm adaptations to data parallel archs. and yes they can be done.

Coolest conclusion :

When operating on lots of data : The parallelism to be gained by concurrently operating on multiple data elements will be greater than the parallelism to be gained by concurrently executing lines of code.

Review from cs757

“Data parallel algorithms” talks about algorithms that efficiently run on a large array of parallel processors and introduces techniques for parallelising seemingly serial algorithms with large input data. The system considered executes the front end processor’s (VAX or Symobolics 3600) ISA plus floating point arithmetic on large data in parallel (supported by an array of Connection Machine Processors). The introduction of a flexible network is necessary to make programing/compiling easier (hides the architecture) . By using 2 types of instructions, one that executes on all processors (conditionally based on a context) and another that specifies the interconnect, this system retains the advantages of a SIMD albeit multiple processors. The authors give implementations for algorithms that are inherently parallel such as sum of an array of numbers, partial sums on an array, sorting etc., Furthermore, adaptations of serial tasks to make full use of the parallel architecture are explained. For example, parsing of a string by starting with each character in parallel and forming larger tokens by analysing neighbors recursively is explained. In the 80s, processing of pointers was part of most tasks, the paper explains how this can be done over parallel processors.

The paper provides data on run time measured on the 65536 processor Connection Machine system for certain algorithms. The model used assumes unit time for all instructions including data transfer between processors over the interconnect. Also, resource availability is assumed to be infinite. Parallel algorithms can effectively be mapped to the Connection Machine System. The paper tries to prove that many applications that involve a large enough data set can be efficiently mapped to a parallel system by converting it to a binary tree. The key idea is to look at increasing performance by employing one processor per data element.

The paper is more of a marketing white paper for the Connection Machine system. Several examples given are efficiently executed in a parallel processor architecture. However a few inherently serial operations are tweaked to fit the parallel architecture. The resulting efficiency is highly questionable, extra complexity is introduced just to find and fit algorithms that run efficiently on this system. Written in 1986, it discusses one of the first vector data kind processors. Unlike current day implementation of large vector processors, several small processors that cycle synchronously execute the same instruction on a different piece of data is introduced. This idea of looking at computing as operating on groups of data rather than sequential execution, by employing a group of simple processors connected by a flexible interconnect network still inspires new research like Explicit Data Graph Execution.

John M. Mellor-Crummey, Michael L. Scott: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9(1): 21-65 (1991). ACM DL link

review from 757

Modern processors include hardware support, atomic instructions in particular, to help correctness in shared memory system. As the number of processors accessing a data element increase, so do the wait times and the network traffic resulting in decrease in performance. The paper first talks about spin locks typically used by busy-wait type programs. On a bare minimum, with the help of atomic instructions, a lock can be test_and_set by a requesting thread before the shared memory is accessed. However this is not scalable, the test_and_set operation is expensive. To reduce this read modify write operation, the lock are checked before test and locking it. As no request order is maintained, this method can lead to starvation. Instead, counters indicating number of access requests and releases can be compared to acquire a “ticket” lock. Backoffs now cause a problem, a waiting thread does not know when to re-check availability and this directly causes other threads to wait. Array based queuing locks maintain a FIFO of requests. This eliminates starvation (assuming required FIFO depth is available) but a central FIFO causes network traffic. The authors propose a new “MCS” lock, where a pointer to the lock is locally stored. However a release needs to be explicitly spun, informing every other lock list to drop the exiting processor.

Barriers are essential synchronization constructs. They ensure that threads proceed ahead only when all threads have reached. The location or data structure that maintains barrier information is a shared data structure. The most obvious is the counters that are incremented when each thread reaches the barrier. The last thread to reach resets it, triggering a go ahead. This is not scalable as too many processors will over load the counters. A combining tree like structure ensures that each leaf has only a small number of processors updating it. Dissemination barriers works with processors signaling others about their completion and tournament barriers forward a winner processor up every node. The authors propose a new tree-based barrier where each processor spins on a local copy of the barrier, that is woken up by a winner processor.

The authors measure the performance of spin locks and barrier algorithms on the BBN Butterfly and the Sequent Symmetry multiprocessor machines with distributed shared memory. At every lock acquisition or barrier time taken is saved, averaged out and reported versus number of processors contending. More than the new MCS lock and the barrier proposed, the paper excels in the criteria that are defined for evaluation.

When the locks are locally cached, heavy network traffic occurs each time any change occurs to maintain coherency. Also all these algorithms are built on top of atomic instructions (noteable exception : Lampert’s bakery algo). This would also mean strict in order complete of loads and stores. Lock behavior when precise interrupts or exceptions occur must also be defined. This comprehensive paper talks about the evolution of ideas in locks and barriers till 1991. The criteria and flowchart they suggest in choosing an approach given constraints is its key contribution.

Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, Anoop Gupta: The SPLASH-2 Programs: Characterization and Methodological Considerations. ISCA 1995: 24-36. ACM DL Link

SPLASH-2 programs look at

computational load balance

communication to computation ratio and traffic needs

important working set sizes

issues related to spatial locality

how these scale with problem size

number of processors

benchmark intro paper :

characterize the programs : basic properties, architectural interactions

user help : what veal params to choose

Leonardo Dagumand and Ramesh Menon, OpenMP: An Industry Standard API for Shared Memory Programming. IEEE Computational Science and Engineering, Jan-Mar, 1998. IEEE Xplore link, Local uw link

Scalable hardware exists : k-ary n-cube, fat tree, multistage interconnect

MPI typical scalable approach.

zeroth order scalable : no cache coherence, just scalable interconnect network (MPI to sync, coh)

SSMP : scalable Shared memory multiprocessor. (MPI can be built on)

writing code? : not portable!

Why SSMP : fast shared memory locks, directly access mem throughout the system

MPI

-- need to explicitly partition data structures.

pthreads : task parallelism, not data p.

OpenMP

compiler primitives + callable runtime library routines

fork/join execution model.

Control Structure > parallel, do and single

Data env > share, private or reduction (sum = sum + 1), copyin etc.,

syncronization > Implicit (beginning and end of parallel), removed with no wait clause

> Explicit : Atomic, Flush

runtime library > query and lock function etc.,

A.R. Alameldeen and D.A. Wood,, "Variability in Architectural Simulations of Multi-threaded Workloads," HPCA 2003. IEEE Xplore

Space Variation

two runs exhibit different performance characteristics (OS scheduling decision, lock orders, measurement interval etc.,)

Time variability

workloads exhibit different performance char over time

Accounting for space v

confidence interval : +/- 1 sigma

enuf sample size : n : CLT application

Hypothesis testing

Accounting for Time V

start at different points

ANOVA : analysis of variance (start at same point go to n, n+x n+2x and see if variance changes) and then do some dagajchi

L. Lamport, “How to make a multiprocessor computer that correctly executes multiprocess programs,” IEEE Transactions on Computers, vol. 28, no. 9, pp. 241-248, Sept. 1979. IEEE Xplore link

Sequential consistency

Requirement Rl: Each processor issues memory requests in the order specified by its program

Requirement R2: Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue

Anoop Gupta, Wolf-Dietrich Weber: Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Trans. Computers 41(7): 794-810 (1992). IEEE Xplore link

Conclusion : Directory based schemes with 3-4 pointers per entry should work well for executing well-designed || programs.

cache line sizes

increase > larger invalidations

> data traffic goes up

> coherence traffic comes down

> overall traffic min when line size = 32.

Classification of data objects

Code and read-only data

Migratory data high proportion of single invalidations

Mostly-read data small invalidations

Freq read/written objects large inv, eg : number of processors waiting in a global queue

sync objects locks and barriers

low contention sync objects distributed locks, easy to implement, optimal for directory based

high contention sync objects

distinguish : large invalidation > write to a line cached in many processors ; frequent invalidation > ...

Effect of Cache line size

large size :

> better hardware efficiency, prefetching, inc in message traffic (increases min communication granularity between processors).

> parallel programs exhibit less spacial locality than sequential programs.

> false sharing significant.

> increase the number of proc sharing a cache line (false sharing) > increase in size of invalidations.

> spatiality depends on classes of objects

> fewer messages of each type (control/data), but size of each data increases.

Of course : best case cache line == size of the object thats shared.

Sarita V. Adve and Kourosh Gharachorloo. Shared Memory Consistency Models: A Tutorial, IEEE Computer, December 1996, pp. 66-76. IEEE Xplore link

Memory consistency model

specifies the constraints in the order in which memory operations must appear to be performed wrt to the other processor

Sequential consistency

1) All processors execute sequentially

2) Some sequentiality exists between these procs operations on memory

3) catch : for processor 0, if write A, read A happens, read A should wait till write A is seen by all the other processors

(1) All cores insert their loads and stores into the order <m respecting their program order,

regardless of whether they are to the same or different addresses (i.e., a=b or a≠b). There are four

cases:

If L(a) <p L(b) ⇒ L(a) <m L(b) /* Load→Load */

If L(a) <p S(b) ⇒ L(a) <m S(b) /* Load→Store */

If S(a) <p S(b) ⇒ S(a) <m S(b) /* Store→Store */

If S(a) <p L(b) ⇒ S(a) <m L(b) /* Store→Load */

(2) Every load gets its value from the last store before it (in global memory order) to the same

address:

Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a)}, where MAX <m denotes “latest in

memory order.”

Coherence makes caches invisible. Consistency can make shared memory

look like a single memory module.

Total Store Order (Sparc)

A TSO execution requires:

(1) All cores insert their loads and stores into the memory order <m respecting their program

order, regardless of whether they are to the same or different addresses (i.e., a==b or a!=b). There

are four cases:

If L(a) <p L(b) ⇒ L(a) <m L(b) /* Load → Load */

If L(a) <p S(b) ⇒ L(a) <m S(b) /* Load → Store */

If S(a) <p S(b) ⇒ S(a) <m S(b) /* Store → Store */

If S(a) <p L(b) ==> S(a) <m L(b) /* Store-->Load */ /* Change 1: Enable FIFO Write

Buffer */

(2) Every load gets its value from the last store before it to the same address:

Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a)} /* Change 2: Need Bypassing */

Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a) or S(a) <p L(a)}

Weak Consistency

Syncronization actions are consistent (sequential)

Release consistency

Lock aquire/release...

eager, where all coherence actions are performed on release operations, and

lazy, where all coherence actions are delayed until after a subsequent acquire

Michael Zhang, Krste Asanovic: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. ISCA 2005: 336-345. ACM DL Link

ISCA 05

Problem tackled : in a tiled CMP whats the best way to organise the L2?

Private L2 => - replication | + fast access | - directory (coherence)

Shared L2 => - network congestion hops. each address has home tile.

Proposed

Victim Replication L2VR :

Capture evictions from the local primary cache in the local L2 slice.

Retained victim is a local L2 replica of a line that already exists in the L2 of the remote home tile.

Whats replaced : (1) An invalid line (2) A global line with no sharers (3) An existing replica.

- Area overhead : L2 tags must be wide enuf (lines can exist in non homes).

MIT Raw architecture : claimed non coherent L2 private distributed evenly among 16 processors

Milo M. K. Martin, Mark D. Hill, and David A. Wood, "Token Coherence: Decoupling Performance and Correctness," International Symposium on Computer Architecture (ISCA), June 2003. IEEE Xplore Link

Technology trend

workload (good thread level parallelism, low latency preferred) > snooping

Technology trend > directory

+ Fast cache to cache misses

+ No bus like interconnect

avoid virtual bus ordering

+ Bandwidth efficiency

Token counting > ensures safety

Persistent requests > prevent starvation

Decoupling correctness and performance

Broadcast with direct responses

use unordered interconnect

Need one token to read, all tokens to write.

To avoid data sent with all tokens : owner token (clean/dirty)

- Non-silent eviction

Prevent starvation

invoke after a timeout

send to all componenets

all comp remembers in a table and continually redirects all tokens to requestor

Deactivate when complete

What if many processors issue persistent req simulat?

Use starvation-free arbiter

Single/Banked/Distributed

need to prevent reordering of activate and deactive messages

- Scalability of persistent requests

Predictor based (reduce broadcasts)

broadcast-if-shared.

Owner

Groups

Steven L. Scott, Synchronization and Communication in the T3E Multiprocessor, Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp. 26-36. ACM DL Link

Parallelism

Explicit : MPI

Implicit : Shared memory (better suited for irregular/ dynamic par)

Low sync overheard > lots of processors, lesser granularity of work split.

Commodity microprocessors inefficient for multiproc

memory access is cache line based > stride access/vector access pretty bad.

Address space limited (TLB)

helpful to have non cached memory access (MPI to other proc)

latency reduction rather than latency toleration

T3D <<<

Shared address space

bidirectional 3D torus writing

barrier network : 4 wire wide, degree four spanning tree

remote memory access

Block transfer engine : bulk, async data transfer between proc memories

prefetch queue

DTB annex : extend the memory space outside the processor

T3E overview

PE : DEC 21164 + shell(control + router + local memory)

self-hosted : Unicos/mk

I/O : GigaRing channel

E registers : shared area

virtual eureka/barrier network

Global virtual address

any part can be masked to produce virtual PE

Get and put operations to write to E-registers

Atomic memory operations (swap, fetch&inc, fetch&add, compare&swap, masked_swap

Messaging

SEND command

message queue control work with tail, limit and threshold to interrupt

Eureka/Barrier synchronization

James Laudon, Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. ISCA 1997: 241-251. ACM DL Link

ccNUMA Highly scalable server

Cache-coherent globally addressable memory

Distributed shared memory <> Directory based protocol

minimize latency between remote and local memory > hardware/software support to insure most references are local

also > easy migration for existing SMP software (by maintaining ratio of latencies a min)

Effective page migration and replication

per-page hardware memory ref counters

block copy engine (at near peak mem speed)

reduce TLB update cost

interconnect > multi-level fat-hypercube topology.

sync help > fetch-and-op primitives (MIPS already has load-linked/store-conditional)

Network

6x 2x links per router

wormhole routing

global aribitration to max utilization under load

4 virtual channels

adaptive switch between 2 virtual channels

CRC checking and retransmission on error

Cache Coherence

non-blocking

request forwarding (like DASH)

silent CEX replacement : clean exclusive state

network ordering does not matter!

detect and resolve all ooo deliveries.

adaptive routing to deal with congestion

deadlock

negative acknowledgements to all requests (DASH does this)

SGI : backoff message : target of intervention or the list of shares to invalidate.

> invalidations + invention backoff > better forward progress (when heavily loaded systems)

Coherent request buffer > read and write request tracked for both processors

Page migration > the requestor's count and home count are compared, if difference exceeds a software programmable threshold register, an interrupt is generated to home node > potential migration handled by the OS.

Directory poisoning

Avoid TLB update problem (shootdown > The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown)

during read phase > latest copy is written back to memory, directory in Poison state.

another processor access > synchronous bus error.

error handler > error due to page migration, invalidates the TLB entry for that page.

>> low cost to migrate

OTHER architectures

Stanford DASH

4 processor SMP cluster is one node.

ORI : 2 proc node with coherence handled by directory-based protocol.

> + cache to cache sharing within the node.

> -- intranode sharing significant > more processors required > bus will become huge.

> -- remote latency

> -- remote bandwidth = half local bandwidth

SGI Origin : 2:1 remote to local access.

Convex Exemplar X

512 proc in 8x4 torus config

third level cluster cache

DSM Systems comparison

overhead of SMP nodes sets minimum on how effective machines are in small configs

workload throughput oriented?

> SMP has lower communication costs between processors within the same SMP node +++

> parallelism beyond the size of the SMP node is important > latency + bandwidth overhead shall be very large.

Conclusions

highly modular system

fat hypercube network high bisection bandwidth, low latency interconnect.

low latency to both remote memory and local memory

hw/sw page migration and fast synchronization

Erik Hagersten and Michael Koster, WildFire: A Scalable Path for SMPs, Proc. 5th IEEE Symposium on High-Performance Computer Architecture, January 1999, 172-181. IEEE Xplore link

SUN

Distributed shared-memory (DSM) prototype

Origin is a DSM optimized architecture > reduces remote memory latency

cc-NUMA has greater potential for scalability, they are less optimal for access patterns caused by real communications (producer consumer and migratory data).

OS optimization for capacity miss and conflict miss.

scheduling is difficult.

SMP : does not care where code/data is placed, suspended process can be resch on any other proc.

problem : not possible to build SMP with huge number of CPUs spanning several physical boxes.

DSM protocol : keeps only a handful of nodes coherent. (complexity/latency of nodes reduced by small number of nodes). large node has memory banks > higher bandwidth (interleaving). better node locality. Node is aware of load balancer.

Wild fire ccNUMA built from unusually large nodes.

when possible, local memory is used.

only when a process has more threads than processors, multiple nodes

> CMR : coherent memory replication (version of S-COMA, simple cache only memory arch).

Allocates local "shadow" physical pages.

Solaris OS uses integrated hw counters to determine which pages to switch from ccNUMA to CMR.

responding to memory access patterns. (adaptive algo)

-- Memory-resident pages and "large" physical pages cannot be replicated. > can be explicitly replicated.

interface implementation

NIAC : Network Interface Address controller and NIDC.

4 NIDC chips controlled by 1NIAC chip (high bandwidth required on the coherent interface).

NIAC : bus interface and global coherence layer.

WFI > 2 translations : local phyical CMR to global phy address and back.

deterministic directory : state of cache and state of directory are always in agreement.

> blocking directory (only one outstanding transaction per cache line)

> three-phase writebacks

Hierarchical affinity scheduling > OS tries to schedule a process first on the processor it last ran. then on some processor on the same node.

>>>>>>>>

SGI Origin : 2 R10000 connected by Hub, Mem + Directory (the 2 processors share the local cache, fast access to remote cache) page migration.

Wildfire : large number of SMP nodes connected to distributed directory network

NUMA-Q node : i dont know.

>>>>>>>>

created locality : doing something other than just responding to create

Coherent Memory Replication and Hierarchical Affinity Scheduling control are implemented as Deamon processes (OS)

Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing, Proc. International Symposium on Computer Architecture, June 2000, pp. 282-293. ACM DL link

lots of Alpha in order in one chip

commercial workloads : large memory stall component. data-dependent naure, lack of ILP. high floating=point and multimedia functionality.

SMT > superior in single thread performance > very wide-issue processors which are more complex to design

CMP > simpler processor cores at potential loss in single thread performance.

Piranha : extremely simple processor : single -issue, in-order, eight stage pipeline.

processing node (8 CPUS, Intra chip switch, system control, home and remote engines, packet switch router and L2 -> memory.

IO chips : same as above, just 1 CPU inside

Intrachip switch : back-to-back transfers without dead-cycles. reduce latency < modules issue target destinations of future requests (hint) > pre-allocate datapaths to speculatively assert requester's grant signal. (high priority and low priority).

fully ordered.

8 banks, 8 way set associative.

duplicate copy of L1 tags and state at the L2 controllers.

L2 is a very large victim cache.

in case of multiple sharers write back happens only when owner of data replaces it.

full map centralized directory-based.

home engine

exporting memory whose home is at the local node. remote imports.

directory storage

limited pointers

coase vector.

inter node coherence protocol

invalidation based directory

does not depend on point to point ordering => adaptive routing.

no NAKs (no protocol deadlocks/network deadlocks)

no livelock and starvation.

hot potato routing : as age increases, so does priority.

bound the number of messages that can be inserted as a result of a single request.

cruise-missile-invalidates

inband clock distribution/S-connect.

Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005. IEEE Xplore

Commercial server applications (databases and Web services) have huge TLP.

60 W of power

Data center racks limited by power supply envelope. supplying power and dissipating generated heat

32 way threaded > OS layer abstracting away hardware sharing.

performance metric : sustained throughput of client requests.

good performance per watt.

commercial server applications : low ILP : large working sets and poor locality of reference on memory access. data dependent branch very difficult to predict.

Niagara : simple cores aggregated on a single die shared on chip cache and high bandwidth off-chip memory.

Pipeline : 4 threads into a thread group > shares the processing pipeline > Sparc pipe. 8 such groups = 32

crossbar interconnect provides communication link between Sparc pipes, L2 cache and shared resources. (also the point of memory ordering).

each thread : unique set of registers, instr and store buffers.

share : L1 caches, TLBs, exec units, most pipeline registers. (critical path by 64 entry fully associative ITLB).

each thread has 8 register windows. (4 such files for 4 threads)

set of registers visible to a thread is the working set (register file cells)

complete set : architectural (SRAM)

copy happens when window is changed. also the thread issue is stopped. cool!

Memory

simple cache coherence with write through policy.

10 % miss rate.

allocate on load, no-allocate on store.

stores do not update the local caches until they have updated the L2 cache.

copy-back policy, writing back dirty evicts and dropping clean evicts.

------

Hardware multithreading primer

Single issue single thread machine

Coarse grained multithreading / switch on event multithreading > thread has full use of CPU until long-latency event such as DRAM miss occurs.

Fine grained multithreading / interleaved multithreading > cycle boundary.

time sliced or vertical multithreaded (SMT)

Chip multiprocessing > each processor executes one thread.

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, "NVIDIA TESLA: A Unified Graphics and Computing Architecture", IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55. IEEE Xplore

Compute Unified Device Architecture (CUDA) parallel programming model and development tools

Tesla = unified graphics and computing architecture

Old GeForce :

vertex transform

lighting processor

fixed-function integer pixel fragment pipeline

(OpenGL DirectX)

Vertex shaders, floating point fragment pipeline

Vertex processors

> operate on vertices of primitives such as points, lines, triangles

> transform : coordinates to screen space

> lighting and texture parameters

> output to setup unit and rasterizer (convert lines to dots/pixels)

> low-latency, high precision math operations

> programmable typically

Pixel fragment processors

> interior of primitives/interpolation

> high latency lower precision texture filtering

> more pixels than vertices => more of this

Tesla > Vertex + Pixel into one unified processor architecture.

scalable processor array

memory > DRAM + fixed-function raster operation processors that perform color and depth frame buffer directly in memory

Command processing

> interface to CPU > responds to commands from CPU, fetches data from system memory, checks command consistency and performs context switching

> input assembler > collects geometric primitives and fetches associated vertex input attribute data.

> execute vertex/geometry/pixel shader/compute

barrier syncronization

SIMT (multiple thread)

unit creates, manages, schedules and executes threads in groups of 32 || threads called wraps.

each thread executes independently. individual threads can be inactive due to independent branching or predication

full potential > if all threads take same exec path.

SIMD/SIMT

SIMT : applies one instr to multiple independent threads in ||, not just to multiple data lands > controls the exec and branching behavior

SIMD : controls a vector of multiple data lanes.

wrap : 32 threads of the same type (vertex/g/p/c)

Implementing zero overhead > scoreboard qualifies each wrap for issue each cycle. instr scheduler prioritizes all ready wraps... fairness

intermediate languages use virtual registers, optimizer analyzes data dependencies, allocates real registers, eliminate dead code, folds instr together, optimizes SIMT branch divergence and convergence points.

Memory access

> local/shared (SM)/global

> coalescing memory over separate requests.

Texture unit input > texture coordinates outputs > filtered samples, typically four-component (RGBA) color.

Rasterization > Viewpoint > clip > setup > raster > zcull > block.

GPU ^

-------------------------------------------

Parallel computing Architecture

Throughput applications

extensive data parallelism

modest task ||

intensive floating point arithmetic

latency tolerance (perf is amount of work done)

streaming data flow (relatively low data reuse)

modest inter-thread sync

Difference between GPU : require that || threads syncronize, communicate, share data and cooperate > thread block : Cooperative thread array

thread : computes result elements selected by its TID

CTA : computes result blocks selected by CTA ID

grid computes result blocks and sequential grids compute seq dep application steps.

relaxed memory order : preserves order of reads and writes to the same memory address.

Scaling required.

Scalability : varying number of SMs, TPCs, ROPs, caches and memory partitions.

future development : scheduling, load-balancing, enhanced scalability for derivative products, reduced sync and comm overhead, new graphics features, increased mem bandwidth and power efficiency.

Charles E. Leiserson, et al., The Network Architecture of the Connection Machine CM-5, Proc. ACM Symposium on Parallel Algorithms and Architectures, June 1992, pp. 272-295. ACM DL link

Thinking Machines Corp

1996

economy of mechanism > machine should have only 1 comm network to convey info.

CM 5 has 3 networks

Syncronized MIMD machine, shared memory, procs communicate via message passing

data network > send messages

control network > sync and multiparty comm primitives

32-16384 32 MHz SPARC procs, control procs : Sun Micro work station front ends

processors can be split into user partitions, privileged or non. users time muxed. low os overhead to user task communication.

Network interface

+ simple and uniform view of the network

+ support for time sharing, space sharing and mapping out of failed components

+ decoples design decisions made for networks

> memory mapped registers on protected pages << MMU takes care of privilege.

> Context switching : automatic checkpointing of user tasks.

> user's view of the networks is independent of network topology.

CM-5 Data Network

balancing message loads > fetch deadlock problem

fat-tree. user partition = subtree in network.

route to least common ancestor of src and dest. pseudo random choice at each level > load balancing

differential pair comm > noise immunity and reduced overall power requirements.

similar to cut-through/wormhole routing

>>> data network is bound by a contract with the processors to guarantee that deadlock never occurs.

typically > reservation mechanism > a max no of messages are outstanding between 2 processors.

substantial over head

CM-5 > left port and right port (virtual channels)

send the request on the left port always (response can be on any port)

good efficiency as user controls directly

all-fall-down mode : user time over : just drop the messages down the tree to any node, when resumes, that node transmits it to actual destination

CM-5 Control Network

split-phase barrier mechanism for synchronization

broadcasting

collision > error

combining

reduction, forward scan (parallel prefix), backward scan (parallel suffix), router done.

first 3 : bitwise logical OR, XOR, signed max, max, addition etc.,

router done : when data mesages have completed.

async OR of all procs.

implemented as binary tree.

message sent up, broadcast to all nodes in that partition.

CM-5 diagnostic Network

user does not know this exists.

detects program functionality dependent and independent (DFT)

JTAG DFT is connected on back plane on diagnostic network > geographical address (cabinets, backplane, slot type slot etc., and network address)

Dally and Towles, "Route packets, not wires: on-chip interconnection networks" , DAC 2001. IEEE Xplore link, http://cva.stanford.edu/publications/2001/onchip_dac01.pdf

DAC 2001

Use soemthing like PCI on chip. all modules talk to each other thru a standardized interface, network routes packets to destinations.

Suggested network
     input port : 256 bit data + control field.
          Type (Flow control digit or FLIT) : 2 bits : new packet (head), tail, middle or idle.
          Size (4 bits) log of data size.
          Virtual Channel (8 bits) bit mask.
          Route (16 bits) : encoded as 2 bits for each hop till the destination. (NEWS).
          Ready : a signal from the network back to the client indicating network is ready.

Output ports

Registers : ex : time-slot reserv for certain classes of traffic.

Higher level protocol can be piggy backed on this.

Router : 5 input controller (NEWS + self), 5 output contrllers.

Fault tolerance : can find out which routers are crappy. and adapt accordingly.

Pre-scheduled and dynamic traffic. (using virtual channels)

Differences between intra and interchip networks :
1) lots of pins available. => wide flits possible. folded torus topology is employed. (2x wire demand, 2x bandwidth demand).
2) reduced buffer storage requirements.
3) Specialized transievers possible.

Area overhead exists (for routers) but
1) predicatable electrical parameters => tweakable
2) reuse with universal interface
3) reuse to network
4) network => duty factor of wires increased.

Structure performance and modularity adv if you replace top-level wiritng with on-chip interconnects.

Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, David Webb, "The Alpha 21364 Network Architecture," IEEE Micro, vol. 22, no. 1, pp. 26-35, Jan./Feb. 2002. IEEE Xplore link

152 mil transistors 22.4Gbytes/s router bandwidth.

paper talks about the router : 1.2GHz (same speed as the processor)

using aggressive routing algos, distributed arbiters. large on-chip buffering and pipleined router implementation

support for directory-based cache coherence (virtual packet classes)

Classes :

flit > packet portion transported in parallel on a single clock edge

Request

forward

block response (with data)

nonblock response (ack)

write I/O

Read I/O

special (no-op packets), buffer dealloc info between routers etc.,

2D torus

Virtual cut through routing

flits of a packet proceed through multiple routers until a router blocks the header flit.

blocking router buffers all the packet's flits until congestion clears

blocking router schedules the packet to destination. [buffer space for 316 packets]

Adaptive routing

minimum rectangle (diagonal distance)

either continue on the same direction or turn.

Deadlock avoidance

because of cyclic dependences

eg : request packets can fill up a network and prevent block response packets

> solve : virtual channels with priority.

Adaptive routing can create intra-dimension and inter-dimension deadlock

Jose Duato's theory : as long as packets can drain via a deadlock free path, no deadlock will occur.

intradimention : VC0 and VC1 (non adaptive, horizontal first, vertical next) on primary axis depend on secondary axis, but not vice versa. when they change dimension, they recompute their virtual channels in the new dimension.

Dally scheme : incremental number of all processors in a dimension : if source's number is < destination : VC0 else VC1.

but VC0, VC1 not well balanced.

VC0, VC1 can return to adaptive if a subsequent router is not blocked.

Router Architecture

input and output ports : local (cache and memory controller), interprocessor (off chip network), I/O.

clock forwarding

Local arbiters : various readiness tests to determine if a packet can be speculatively dispatched via the router.

valid input, output port is free, next buffer has a free buffer, target output port is free, anti-starvation mechanism, read I/O packet does not pass write I/O packet.

fairness. : coherence dependence priority mode and rotary rule.

John Kim, James Balfour, and William J Dally. Flattened butterfly topology for on-chip networks. In MICRO 40: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007. ACM DL Link

Most OCN low radix (router has one node connected) 2-D mesh : RAW processor, TRIPS, 80-node Intel's teraflops and Tilera 64

Off chip networks : Cost increases with channel count channel count increases with hop count

On chip networks : bandwidth is plentiful but buffers are expensive. but reducing diameter of OCN > power consumption lesser

concentration is practical (reduces wiring complexity)

Latency on chip

Header latency + serialization latency + time of flight on wires

Hop count * router delay + packet size / channel bandwidth + time of flight on wires

Butterfly

topology

on a 2D layout, all routers in a column and row are connected. (direct connections to nth router in a row allowed)

long wire impact : reduced by optimally inserting repeaters and pipeline register to preserve channel bandwidth > but traversal takes several cycles

routing and deadlock

VCs may be needed to prevent deadlock

Dimension Order routing (go right first then up/down)

non minimal routing allows path diversity available on flattened butterfly to improve load balance and performance.

UGAL : 2 steps, minimal to an intermediate node and then minimal to destination

Bypass channels

router bypass arch

mux arbiter

direct input is given priority at input of the mux

bypass channel is given priority at output of the mux.

switch architecture

flow control and routing

Paper has cool numbers about power and area of flattened butterfly high radix networks :

Power = Channel + Router + Memory

Area ( Butterfly ) smaller < due to high radix several connections possible between 2 nodes => bi section bandwidth constant => thinner wires => lower power

Smaller buffers required : wormhole : Virtual channel to prevent deadlock. Because of multiple paths : a packet can block another packet from the same source and destination.

Future

optical signalling

virtual channels to break deadlock

ideal latency ?

flit-reservation flow control

J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. A Scalable Approach to Thread-Level Speculation. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. ACM DL link

parallelizing non numeric and irregular numeric difficult (complex control flow and memory access patterns)

TLS : auto parallelize code
Epochs : time stamped with a epoch number (ordering)
Track data dependency between epochs
homefree token : when guaranteed no violations, this epoch can commit.

Objectives
large-scale parallel machine (single chip mups or SMT) seamlessly perfom TLS across entire machine > communication diff
no recompilation

Detect data dependence violation at run time > leverage invalidation based cache coherence.
if an invalidation arrives from a logically-earlier epoch for a line that we have speculatively loaded > bonkers.

Speculation level > Cache at which speculation occurs

whats required
   (i) notion of whether a  a cache line has been speculatively loaded and/or modified
   (ii) guarentee that a pec cache line will not be propagated to regular memory
        spec fails if cache line is replaced!
   (iii) ordering of all spec mmory references (epoch numbers and homefree token)

Hardware
Cache states
   Apart from Dirty, Shared, Exclusive and Invalid, speculatively loaded and specul modified.
   no kicking out till that epoch becomes home free. (if a must, speculation fails)
Messages
   read-exclusive-speculative, invalidation speculative, upgrade request speculative
   + epoch number of the requester
   > only hints, no real need to oblige.

when speculation succeeds
   instead of scanning all lines and changing spec states to normal
   Ownership required buffer > when a line becomes both speculatively modified and shared.
   when home free token arrives, generate upgrade request  for each entry in ORB.

Optimizations
   Forwarding data between epochs : have wait-signal synchronization
   Dirty and spec loaded state : anyway speculated dirty is never evicted, so if you load and then modify, just store it as DSpL state
   suspend the epochs that have violations (resume when u get homefree token)
   Support for multiple writers : combine results (ah!) fine grained SM bits (bytes/words in cache line)

support in SMT machines
   (i) two epochs may not modify the same line
   (ii) t- epoch cannot see t+ modification

Conclusions
8-75% speedups!
ORB overhead low (used to commit speculative modifications faster)

Ravi Rajwar, James R. Goodman: Speculative lock elision: enabling highly concurrent multithreaded execution. MICRO 2001: 294-305. ACM DL link

Speculative Lock Elision

Try to eliminate locks because they cause delays and serialize parallel codes. Instead detect and recover from critical section share violations.

Claims

+ enables highly concurrent multithreaded execution

+ simplifies coding (conservatively/randomly put locks, dude method will remove them anyway)

+ Can be implemented easily (store squashing with hardware, Coherence protocol : detection)

Attacks :

Conservative locking

locking granularity

Old thread unsafe libraries.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill and David A. Wood. LogTM: Log-based Transactional Memory. HPCA 2006. IEEE Xplore link

Version Management
>> Eager : in place on store
>> Lazy : temporarily leave old value in place

Conflict detection
? overlap between write set of trans 1 with write/read or trans 2,3,4...
>> Eager : detect offending loads/stores immediately
>> Lazy : when transactions commit.

Lazy, Lazy : OCC DBMSs, Stanford TCC
Lazy conflict, Eager Version : None (zombies)
Eager, Lazy : MIT LTM, Intel/Brown VTM (conflict using cache conflicts)
Eager Eager : MIT UTM, CCC (conservative concurrency control) DBMSs, LogTM

Log TM

Version management
>> per-thread before image log in cacheable virtual memory
>> Last in first out undo.

Conflict detection
>> coherence request to directly
>> Directory responds and possibly forwards the request to one or more procs
>> each responding processor examines some loacl state to detect a conflict
>> the responding processors ack (no conflict) or nack (conflict) the req
>> the requesting processor resolves any conflict
?? What if cache block is evicted?
>>> directly still forwards to all "appropriate processors"
>>> Go from M to sticky M [on overflow bit set, check if even sticky M violation happened]
>>> Lazy cleaning of sticky state (when overflow is not set and its sticky => it was sticky sometime ago, not now)

From Prof Hill's talk in MSR :

What happens when conflict happens ?

wait >> could lead to deadlock

abort >> could lead to livelock

possible soln >> contention manager (requesting processor traps to software contention manager)

Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, Andreas Moshovos: Mechanisms for store-wait-free multiprocessors. ISCA 2007: 266-277. ACM DL link

2007

This is a cool paper

eliminates Capacity stalls > scalable store buffer, ordering stalls > aso

Store stalls occur in hardware that supports pretty much all consistency models

Why?

Strongly-ordered