Disclaimer
These are notes I made from papers in the reading list. At times I might have included screenshots of figures. Please shoot me an email if you want that taken down : though you can download free copy of almost all of these papers (scholar.google.com). Also its all my take on the papers, depends on my mood, level of understanding, etc., These are not reviews, just points I thought were important.
Feel free to copy.
Uploaded : Feb, 2012


J. E. Smith and A. R. Pleszkun. Implementing Precise Interrupts in Pipelined Processors, IEEE Trans. on Computers, May 1988, pp. 562-573. ACM DL Link


Precise Interrupts

1) All instruction preceding the instruction have been executed and have modified processor state correctly

2) All instr following the instruction indicated by the saved program counter are unexecuted and have not modified

3) Saved program counter points to the interrupted instruction. 


Classification of interrupts

1) program interrupts

2) external interrupts. 


I/O timer interrupts (for restarting), virtual memory page faults, software debugging, arithmetic exceptions, unimplemented opcodes. virtual machines can be implemented if privileged instructions faults cause precise interrupts. 





Joseph A. Fisher and B. Ramakrishna Rau. Instruction-Level Parallel Processing, Science, 13 September 1991, pp. 1233-1241. Science Link

1992

For ILP 
  1. The dependences between operations must be determined
  2. The operations that are independent of any other operatation that has not as yet completed must be determined
  3. These independent operations must be scheduled to execute at some particular time, on some specific funct unit, and must be assigned a register to which the result may be deposited. 


Hardware and Software Techniques for ILP Exec
HW
Pipelining / multiple functional units 
+ ILP increase 
-  interconn network/reg files goes up as square of funct units.
Superpipelining > eg : int ALU is pipelined. 
Superscalar : multiple instruction issue of seq programs
Horizontal microprogramming > 
Compiler decides
VLIW issues
 
Problem : small size of basic blocks
Dynamic schemes for speculative execution
terminate unnec speculative computation once the branch has been resolved
undo the effects of operation that should not been executed
ensure no exceptions are reported
preserve enuf exec state to resume
 
Compile time > trace scheduling
 
Branch prediction to guide speculative scheduling
gather execution statistics
compile time profiling (Trace sched/superblock sched)
Dynamic branch predition ++
Predicted execution : to selectively squash. 
 
ILP Compilation
Scheduling > NP complete problem
Only a problem on VLIW kind processors (non sequential)
Local Scheduling
sheduling barrier is assumed to exist between adjacent basic blocks in control flow graph
(Local code compaction)
+ computationally inexpensive (one-pass)
+ near optimal
 
Global Acyclic Scheduling
Trace Sched
compiler often can, by rearranging its generated machine instructions for faster execution, improve program performance. Trace scheduling is one of many known techniques for doing so.

Trace scheduling was originally developed for Very Long Instruction Word, or VLIW machines, and is a form of global code motion. It works by converting a loop to long straight-line code sequence using loop unrolling and static branch prediction. This process separates out "unlikely" code and adds handlers for exits from trace. The goal is to have the most common case executed as a sequential set of instructions without branches.

SuperBlock Schel 

a simplified form of trace scheduling which does not attempt to merge control flow paths at trace "side entrances". Instead, code can be implemented by more than one schedule, vastly simplifying the code generator
Cyclic Scheduling 
Software pipelining
Take loops, unroll them find repeating patterns and reroll them. modulo constraint : can be because of initiation interval or Resource constraints. 
>> wiki : Instruction scheduling
Register Allocation
register allocation first?
+ adequate registers more important than optimizing schedule. 
+ shorter schedules > higher register pressure. 
- Antidependences and output dependences 
prepass scheduling (sched before reg allocation)
graph coloring
best : schedule first, register allocate, shedule again. 
 


T-Y. Yeh and Y. Patt, Two-level Adaptive Training Branch Prediction, Proc. 24th Annual International Symposium on Microarchitecture, November 1991, pp. 51-61. ACM DL Link


Gurindar S. Sohi, Scott E. Breach, and T.N. Vijaykumar, Multiscalar Processors, Proc. 22nd Annual Symposium on Computer Architecture, June 1995, pp. 414-425. ACM DL Link


Static program > Control Flow Graph. 
Dynamic program execution > walking thru the prog CFG, generating a dyn seq of basic blocks. 

Speculation : To achieve high ILP, loads are speculated (if stored by a predecessor task). 
1) control speculation
2) data

Multiscalar programs 
The actual code
details of the structure of the CFG
communication characterisctics of individual codes 
task descriptor : what registers a task may produce (create mask) : conservative. 
operate-and-forward instruction : one task done, can forward results to following processing units. 
dead memory value analysis to release registers/memory for future tasks

Hardware required :
Scheduler
Address Resolution Buffer : before data cache : holds speculative reads/writes
 
What happens to cycles ?
Non-Useful compute cycles 
Syncronization and data communication
Early validation of prediction required. 
No computation cycles
Inter and intra task dependencies
load balancing
 
 
How multiscalar differs ?
Branch prediction accuracy does not limit, because compiler has an idea of CFG, almost 100% bp can be achieved between task scheduling (forget inside task prediction). 
n instructions => n^2 complexity to perform dependence cross checks. (not quite necessary in tasks)
load stores can be whatever memory model inside tasks. 
 
vs multithreaded : multiple threads or loci of conrol which are contrl independent and (typically) data independent. 
executing on a multiscalar proc are related as different parts of a sequential walk through the same program, and are not control and data independent. 

granularity of tasks matter

Summary
expoiting fine-grain ILP
combination of hw/sw
dividing the program CFG into tasks, stepping thru the CFG speculatively, taking large steps, a task at a time, without pausing to inspect the contents of a task. 
processor complex uses multiple PC to sequence through different parts of the CFG simultaneously.
 




Subbarao Palacharla, Norman P. Jouppi and J. E. Smith. Complexity-effective Superscalar Processors, Proc. 24th Annual International Symposium on Computer Architecture, June 1997, pp. 206-218.



paper looks at complexity (in delay of critical path)

logic associated with issue window and data bypasses cause max delay

R10000 and DEC 21264 : register commit, active list
PowerPC, HP PA-8000, HAL SPARC 64 : rob holds non commited renamed register values. 
> diff : size of register file.

IW == issue width

checkpointing rename logic 
     RAM : logical address to phy addr direct 
          Trename = Tdecode + Twordline + Tbitline + Tsenseamp
          Tdecode, Twordline, Tbitline = c0 + c1 x IW + c2x IW^2
     CAM : phy reg to logical registers 

Wakeup logic
     issue window > CAM array holding one instr per entry. 
     Delay = Ttagdrive + Ttagmatch + TmatchOR > quadratic all.

Selection logic
     Tselc = c0 + c1 x log4(WINSIZE)

Data Bypass Logic
     S pipestages > (2 x IW^2 x S) bypass paths (2 inp func units)
     datapath and control
     0.5 x R x C x L^2 (length of wire)
Even if alternative layouts are used : quadratic delay with bypasses remain
reg file/bypass within cluster single cycle, across clusters multi.
     
>> biggest problems > window logic and bypasses

>> Atomic operations
     wakeup and select
     data bypassing

New Microarch defn
     dependence based microarchitecture
          issue window > number of FIFOs. 
          inorder issue of dependent instructions
          bypass is propogated to the head of the instr queue
          
     FIFO entry : 
          all operands of I are ready : new empty FIFO
          1 outstanding operand : FIFO generating that thing just behind the Isource else new FIFO
          two outstanding : try both operands.

     clustering based dependent 
          group issue and exec units into clusters (these cna be fixed/windows)
          PEW : parallel execution windows     

Conclusions : 
     parameters that affect > issue width and issue window size.
     data bypass logic
     may affect in future > register files and caches


Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, Gurindar S. Sohi, Dynamic Speculation and Synchronization of Data Dependences. ISCA 1997: 181-193. ACM DL Link



> predict if the execution of an instruction is likely to result in a data dependence mis-speculation
> provide sync needed to avoid mis-speculation. 

Control speculation and data speculation

data > value and dependence speculation.

small instr window sizes of modern dynamically scheduled processors, the probability of mis-speculation is small, perf loss due to erroneous data dep sep is small. this paper : what if the instr windows are large?

> predict instr whose immediate exec is going to violate true data dependence
> delay the execution of those instr as long as it is nec to avoid the mis-spec. 

Dependencies > unambiguous or ambiguous. 2 instr linked via an ambiguous dependence gets resolved to no dependence => speculation is okay. 
problem most acute : production and consumption of data through memory. this paper only looks at this.

a load is allowed before a store which its ambiguously dependent. (normally processors do this blindly). 
As window sizes grow larger, min the net cost of mis-speculation important. 
     min the amount of work lost due to misspec
     reduce the time required to redo the work that is lost on misspeculation
     reduce the prob of mis-spec. 

to min the impact 
     predict if the immediate exec is likely to violate true data dependence
     predict the store/stores that the load depends upon
     enforce synchronization between the dependent instru.

selective data dependence speculation > choose which loads should be speculated 

static store-load instr pairs that cause most of the dynamic data mis-speculations are relatively few and exhibit temporal locality. > build the association eg., save the address accessed. or : use dependence edge as a handle : PC of the store, load. but static dependence between a given store-load pair > multiple dynamic dependencies. 

differentiate dynamic instances : same static dependence > a tag for each instance. dependence distance can be used

Implementation 
     Memory dependence prediction table
          valid flag, LD PC, ST PC, DIST, optional prediction
     Memory dependence synchronization table
          Valid, LD PC, ST PC, LD ID, ST ID, INST, full/empty flag (conditional variable).

Issues
     Intelligent prediction required
     Incomplete synchronization
          avoid deadlock
          free MDST entry allocated for sync that will never occur (pipeline flush say)
     Misspeculations
          squash MDST entries. 
     Multiple Table Entry matches
          discard all but most frequent mis-speculations. 
     
Conclusions
     larger dynamic window sizes > net performance loss due to data dependence mis-speculation is higher.
     min work lost, min time required to redo, improve spec accuracy
     static data dependencies responsible for majority of mis-speculations are few and dynamically exhibit temporal locality.
     dynamic dependence prediction and synchronization in hardware

Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton, Continual flow pipelines. ASPLOS 2004. ACM DL Link



High system throughput
     Many cores per chip
     constant chip size and power envelops > large cores cannot fit on a single chip/small cores traditionally do not provide high single-thread performance

CFP : efficiently manages cycle-critical resources > continues processing instr even in the presense of a long latency data cache miss to memory.

CPF does : 
     draining out long-latency-dependent slice (along with ready source values) while releasing scheduler entries and registers
     reacquiring there resources upon reinsertion into the pipeline
     integrating results of independent instr without their re-execution

     splits instr dependent on miss different from free to go

decoupling critical resources 
     Scheduler 
          take out L2 misses and store them in a temp small first-in, first out FIFO Slice Data Buffer
     Register file
          Completed source register : instr that have executed (to be read by slice) > save it in the FIFO
          Dependent destination registers : destination operand of slice instr 

key actions
     identify and drain the slice
     process and retire indep instr
     exec the slice when miss returns
     merge results of indep instr and slice instr > not difficult, the instrs are anyway independent

Perf
     isolated cache misses handled effectively
     memory level parallelism by going ahead after a cache miss



Wen-Hann Wang, Jean-Loup Baer, and Henry M. Levy. Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy, ISCA 1989. ACM DL Link




Virtually addressed cache
     For rapid cache access

Problems
     synonyms (different virtual addresses > same physical address)
     addr translation anyway required for a miss
     cache coherence bowled.
     I/O addressed physically
     context switch needs invalidation

Write through for the first level (coherence is simplified) and write back for the second level preferred. 

Writes tend to occur in bulk (in procedure calls) > use write buffers (lots needed) > Cache coherence becomes a problem > use write back. 

reverse translation table for detecting synonyms (second level). when v-cache miss > go down > hit > check if syn exists > inv the old one and copy over new virt address

for using virtual addressing : 
     + memory bandwidth to processor
     -  write backs in burst when context switch occurs (paper suggests dirty dirty bit).

for physical : 
     cache coherence simple
     synonyms headache avoided.
     
     says hit rates for combined I & D cache comparable to split. 

Bruce Jacob and Trevor Mudge. Virtual Memory on Contemporary Processors, IEEE Micro, vol. 18, no. 4, 1998. IEEE Xplore link



Classic Memory Management Unit > Translation look-aside buffer (cache of page table entries), an fsm that walks the page table 

fully associative TLB > bad timing, set associative popular (power consumption/clock speed)

problem with software controlled > OS is typically written generic, > OS not tuned to hardware on which it operates. unlikely to live up to its potential performance. 

MIPS <<<<
OS handles TLB misses. software fills the TLB, OS defines TLB replacement policy.

64 bit > top 2 bits : user, supervisor and kernel spaces. (in 32 bit R2000/R3000 : bit 31 : user/kernel, 30:29 : cached/uncached and mapped/unmapped regions. 
virtual address extended with address space identifier (ASID) (6 bit = 64 processes). 

TLB filled randomly/OS tells which slot to replace. (TLBWR) / TLBWI : instructions.

N noncacheable, D dirty (writable) V valid, G global (TLB ignores ASID match > shared memory). 

ALPHA <<<<
Software managed, split TLB/cache.
Not OS but PALcode (privilege access code). not-most-recently-used policy for replacement. 
Address space match (ASID indep match). 
Granularity Hint : 2 bits > 8^GH pages = 1 super page. 

PowerPC <<<<
inverted page table  (scalability : physical page table size limits, not virtual)
paged segmentation : TLBs and page table map an extended virtual address space, not application's effective address space. 
segment = 256MB of continuous regions of virtual space. 
>> basically things are not continuos. each process has the same 4 GB, but 4x 1GB with pointetrs. 
no ASIDs : OS can modify protection on segment registers.
Superpage > block address translation mechanism : BAT takes precedence over TLB when BAT signals hit. 
     BAT has only OS/User priv distinguish > if context switch happens and no sharing is desired, explicitly flush the BAT.

PA-RISC <<<<
segmentation like PowerPC.
Protection ID : running process compared with access ID : virtual page. 
user level access to space registers > processes can change their address space. > additional protection mechanism required. 
each page has associated ID protection > 2 processes can share (! all or nothing) parts of their space.

UltraSPARC <<<<
ASIs (ASIDs) > dont identify contexts but identifies data formats and privileges. 

IA-32 <<<<
amalgamation of several techniques.
Pentium II
segmented with no explicit ASIDs. 
segmentation is just not used in OSs, typically flush the TLB on context switch to provide protection. 
Caches are physically indexed and tagged > no flush on context switch
TLB miss : hardware walks the page table (from CR3 : context page table root). 
sharing : duplicating mapping information across page tables. 
1 byte to 4GB segments : size set by software and can differ for every segment. 
segment descriptor : privilege, granularity (size) etc., 

-----
virtual vs physical caches
protection methods (ASICs, segments, multiple IDs)
address-space organization(segmented/fla, page table support, super page support)

virtual cache > data consistency issues managed by OS. 
protection mechanism > OS's model of obj allocaiton : allocate and destroy large/small objects, 

besteshwar : hardware walked, inverted page table. 

Vinod Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge, A performance comparison of contemporary DRAM architectures, ISCA 1999. IEEE Xplore link



Summary paper of commercial architectures. 

Observations
     One time tradeoff between cost, bandwidth and latency
          Latency min > ganging together multiple DRAMs into a wide structure
          (Page mode, interleaving etc.,)
     Widening buses will present new optimization opportunities
          (Locality)
     Buses wide as L2 cache yield best mem latency, but they cannot halve the latency of a bus half as wide. 
          N/2 is a good design choice
     Critical-word first does not work well with burst mode
     Choice of refresh mechanism can alter ave mem latency.

Conventional : 
     Muxed row address, column address, strobes, data out
FPM Fast page mode DRAM
     row address, column, dataout, column, dataout, column, dataout
Extended Data Out DRAM
     dataout happens along with next column address (understand!)
Sync DRAM
     sync
Extended Sync
Rambus DRAM : 
     Banked 4, > 4 rows remain open with each row address. 
Direct Rambus
     1.6Gbytes/s. 17 half-row buffers > 16 full row buffers (Area consideration). 

Conclusions
> address bandwidth problem but not latency problem
> bus speed >> latency
> Locality exists even to DRAM accesses
> use this for future speedups
presently > page mode and internal interleaving to achieve one time performance boost. 


Changkyu Kim, Doug Burger, Stephen W. Keckler: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS 2002. ACM DL Link


Wire Delay dominated cache > overall access latency is high. 

Design space for Caches
     Mapping    > how many addressable banks and how lines are mapped
     Search      > set of possible places a block can be?
     Movement > always in the same bank?

Structures considered
UCA : Uniform Cache Access.
ML-UCA : multilevel UCA
S-NUCA-1 : mapping predetermined based on block index.
S-NUCA-2 : 2-D switched logic instead of private per bank channels > large number of smaller faster banks.
D-NUCA : each bank form one way of a set, frequently accessed data in the closest set. 

Eval
UCA Contention > bank contention / channel contention. 
S-NUCA-1 : Private Channels > each bank can be acced indep at high speed. 
S-NUCA-2 : Switched Channels > no large wires. lightweight, wormhole routed 2D mesh 
Dynamic NUCA : spread sets : each set is spread across multiple banks (one way per bank). 
     mapping : simple (start below, go up every bank), fair (distance from focus equal), shared(closest banks split to everyone). 
     Locating : incremental search : closest bank searched first
                    multicast search : some or all banks searched (happens in ||)
                    limited multicast search : first M of N banks.
                    partitioined multicast : bank set is broken down into subsets of banks.
     search :   partial tag comparison : smart search : ss-performance, ss-energy.
     dynamic movement : generational promotion > each hit, swapped with the closer cache controller. 
     insertion : which location to fill in / replace (zero copy/one copy policy)


Viji Srinivasan, Davidson, E.S., Tyson, G.S., "A prefetch taxonomy," Computers, IEEE Transactions on , vol.53, no.2, pp. 126-140, Feb 2004. IEEE Xplore link


Prefetching : access time reduction
     - miss prediction => increasing mem traffic / cache pollution
     - is it advanced enuf? >> address might be right, but it might still take more time 
     - if early enuf, might pollute cache, prefetched line might replace/may be replaced. 

metrics to evaluate : achieved performance (number of misses, total traffic and IPC), coverage and accuracy

Arvind, Rishiyur S. Nikhil. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. Computers 39(3): 300-318 (1990). ACM DL Link


Programming languages > bad parallelism

     loss of determinacy > significant complexity to establishing correctness

     explicit parallelism management > boring
hardwarevon Neumann 

     long memory and communication latency > happens in ||

     syncro?

     sequential scheduling (PC based)


Tagged-Token Dataflow Architecture


Id : high level language with fine grained parallelism implicit in operational semantics


A good parallel programming lang should

     insulate programmer from no_of_procs, topology/speed of comm network

     identify parallelism auto

     must be determinate (if algo is then program is)


problem with dataflow > loops and recursions > use contexts

W. W. Hwu, R. E. Hank, D. M. Gallagher, S. A. Mahlke, D. M. Lavery, G. E. Haab, J. C. Gyllenhaal, and D. I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE 83(12), December 1995. IEEE Xplore link



Paper talks about : 1995
Critical components of tech which deals with difficult problems that are encountered when compiling programs for a high degree of ILP.  [ In particular : predicated compile and dependence info maintainance ]

Compilers improve program exec speed by
     eliminating unnec instr
     putting stuff used most in registers. (reduce memory accesses). 
     Make use of hardware parallelism requried.

To parallelise a program (ILP compilation)
     Dependency extraction
     ILP optimizations
     Core Scheduling (reorder the instructions) 

Definitions (and why)
basic blocks : control in control out block
Control flow graph : control dependence
Register data dependence :  flow dependence, anti-dependence (WAW), loop-independence (within the same iterations in a loop), loop-carried (over the loop boundary) == register output dependence.
Memory data dependence :  loop independent memory anti-dependence : inside loop unnecessary wait due to overwrite fear.
loop carried memory antidependence. memory input dependence : 2 instr read from the same memory location (can remove redundant loads)

Go beyond basic blocks : Scheduling

Superblock : formed from the trace by dupilcating blocks to remove merge points and their associated complications from the trace.
reorder instr with the superblock : global acyclic (in the control flow graph) scheduling.
Speculative code motion  : move instructions.

Techniques : Register renaming/Loop unrolling / software pipelining (cyclic scheduling). 
for (i=1) to bignumber A(i); B(i); C(i) end >> A(i) A(i+1) A(i+2) B(i)... in software. 

Predicated Execution
Transformations of conditional branches into predicates : control depencences > to data dependences. 
>>> Predicated exec allows the compiler to trade instr fetch efficiency for the capability to expose ILP to the hardware along multiple execution paths. 

Hyperblock : connected basic blocks where control may only enter in the first block, and exit anywhere. 
(tradeoff : branch performance with resource usage, hazard conditions and critical path lengths).

Branch combining replaces a group of exit branches with a corresponding group of predicate define instr. now OR-type sematics can be used, once you hit the exit code, find out why you got to the exit code, jump to that exit destination. 

Loop peeling : compiler peels away first several iterations of a loop and combined with the outer loop. 

Control height reduction : dependence chain length to compute execution contidtions of instr 

 


  1. Jerry Huck, Dale Morris, Jonathan Ross, Allan Knies, Hans Mulder, Rumi Zahir. Introducing the IA-64 Architecture. IEEE Micro, vol. 20, no. 5, pp. 12-23, Sep./Oct. 2000. IEEE Xplore link

Intel-HP Itanium architecture

Instr encoding
     128 registers => 7 bits to specify 3 reg operands
     128 bit bundle : encoding : 3 instr, ^ 7*3 + 5 bit template > start and stop of instr exec in parallel, decode and route the instr

Sequential schematics : compiler knows things commit in order
IA-64 : parallel sch : 
     ILP : instruction group (safely exec in parallel)
     Control flow parallelism : 
          combine something like (a==0) || (b<=5) || (c!=d) then r8 = 4; into one parallel instr block
          used specifically in multiway branch pred
     predication 

     Control Speculation
           speculative load > then propogate the possible exception to all depending instr, chk.a the very end to find if speculation is okay or branch to fix-up code
     Data Speculation
          because of pointers data speculation is pretty difficult
          use speculation similar to above : load speculatively, store the address and size in an advanced load status table. 
          subsequent stores check this table for a possible match, if present then speculation success, else fix-up code.
     Register model
          First 32 are static, next registers are rotating. 
          when you enter a procedure, alloc reg locals and outputs. when exit, register rename revert.
          when overflow happens, RSE (register stack engine) stores reg to mem and restore can take place.
     Software pipelining 
          loop level parallism
          Loop count register, epilog count
          rotating registers help, predicates

Virtual memory
     core of     OS's multitasking and protection mech
     because of 64 bit (16 billion GB) non exhaustive hardware reqd
     Regions 
          on context switch, OS only flushes exit region
     Protection Keys 
          page granular control (shared between OS/user kinds)
     multiple page sizes 
     

John Goodacre and Andrew N. Sloss, Parallelism and the ARM instruction set architecture. Computer, July 2005. IEEE Xplore


Typically : Performance and Efficiency methods
     Variable execution time
     subword parallelism
     DSP (Specialization)
     TLP, exception handling
     multiprocessing

Variable execution time
     multiple loads/stores on single instr
          epilog and prologue of subroutines
          code density

Inline barrel shifter
Conditional execution (predication)
16-bit thumb instr set (read separately!)

Data Level Parallelism
     sub word SIMD. (divide the 32 bits into 8x4/2x16 and parallel)
Thread Level Parallelism
     ? Improve exception handling
     increases complexity in interrupt handler, scheduler, context switch
     > Special instructions : 
          CPS : Change processor state
          RFE : Return from exception
          SRS : Save return state
Multiprocessor Atomic instructions
     LDREX (load exclusive)/ STREX
     >> Physically tagged cache over virtually tagged : 20% improvement in overall performance.
Instruction level parallelism

2004 : cost-performance-through-MHz wall
Why multicores ? 
     High MHZ > costly
     ILP is complex and costly (Extracting)
     Programming multiple independent processors > non portable and inefficient
ARM 11 : multiprocessor 
     Generic interrupt controller
     Snoop Control Unit

Enhanced atomic instructions
Lock-free syncronization > wake up/sleep spin locks
CPU number and context registers with privileges
Weakly ordered memory consistency
     wmb() > write memory barrier
     rmb()  > read mem barrier
     DSB() > drain store buffer
    
SMP performance
     Cache Coherence (SCU : at CPU speed)
     Inter processor communication (Software initiated interprocessor interrupt, async)
     Load Balanced interrupt handling
     SCU : 
          copy of physical address tag for fast access
          migratory line > if a line moved from shared to write and another processor requests it, its assumed the other processor will eventually write it, moved directly to M'. cool for locks and stuff



  1. Richard M. Russell. The Cray-1 Computer System, Communications of the ACM, January 1978, pp. 63-72. ACM DL Link


1978

Vector processing engine
     128 MFLOPS! cool. 
     Cylinder to ensure wiring distances small

Banked memory

Registers
     A > Address(24 bit), B > cache for A
     S > Data (64 bit), T > cache for S
     V > Vector, VM > vector mass
     Supporting : VL > Vector Length, VM, PC, Base Address, Limit Address, Flag, Mode, and exchange Addr

Chaining
     result of one func unit as input to other 
     > bypass basically
Interrupt 
     stop PC inc, complete mem operations, vector proc
     call exchanges by OS to handler

Software/Vector processing
     Cray OS, Fortran compiler
     2 or 4 or more elements in vectors > better than scalar (earlier 100s required)
     complex vector routines > scalar loop calls
     array increments = integer, start point and array. 

Front End 
     at least a minicomputer connected by a 
     cray link

Cooling problems
Circuit boards
     make wires equal length, or standing waves on ground line. prop delay balancing > purely resistive load. 
     5 layer PCB


Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996, pp. 28-40. IEEE Xplore link

Look at the paper for figures. This paper is the coolest, in that it pretty much explains all that that goes into an out of order processor in enuf detail (Hardwarewise). You should read this paper fully 100 times. 

Specs
     Fetch and decode 4 instr per cycle. 
     speculate four-entry branch stack. 
     dynamic out-of-order exec
     register renaming (map tables)
     in-order graduation (precise exceptions)
Exec units Pipeline
     Nonblocking load/store unit. 
     dual 64-bit int ALU
     FPU
     pipelined adder with 2 cycle latency
     multi with 2 cycle latency
hierarchical, nonblocking mem subsystem 
     on chip 2 way associative primary caches
     external 2 way assos sec cache
     64 bit multiproc sysif with split transaction protocol


Pipeline
Stage 1 : Fetches and Aligns (4 instr)
Stage 2 : Decode and rename (target address for branch and jump)
Stage 3 : Move to Queue and check if all operands ready (Wait here till ready)
Stage 4-7 : Exec 
Int and FP have independent register files. 
Instruction Fetch 
fetch and decode at higher bandwidth than exec
keep the queues full.
and many instrs are discarded as mispredictions happen.
16 word cache line => fetch 16 instrc, aligner picks 4. in 0,1,2,3 order with a rotate (dependency logic minimization)
Branch Unit
addr[11:3] > 512 entry branch history table, 2-bit algo. 
87% on spec 92.
delay slot instr executed. no advantage in ooo, but backward compatibility.
Branch Stack
> Saves state in a 4-entry branch stack. 
(i) Alternate branch address
(ii) complete! copies of integer and fp map tables.
(iii) misc control bits
Branch Verif
Fetching along mispredicted paths > unneeded Cache refills! Its kind of okay because eventually this might be used.
4-bit branch mask accompanies each instr (on the queue and exec pipe). pending verification. if mispredicted, abort

Decoding Logic
Stops when queue/active list is full. 
Hi and Lo for multiply and divide => 2 register writes => exception handling is difficult.
control register modify execute serially. (mostly in OS kernel). 
Register Mapping
dependencies : register operands, memory address, condition bits. 
logical register numebrs into physcial-register numbers. 
instruction order is taken care of. 
Each physical register is written exactly once after assignment from the free list. 
Until its written its busy. 
When new mapping is written => old mapping is free. 
33 logical and 64 physical. 
Register Map table 
Read the mapping for 3 operands
Writes the current operand mappins and new destination mapping into instr queues.
Active list saved previous destination mapping. 
Free List
Unassigned physical registers. 
4 parallel (graduates 4 at a time). 8 deep circular FIFO. 
Active list
All instr currently active. 
Upto 32 instr can be active > 4 parallel, 8 deep circular FIFO. 
5-bit tag that equals and address in teh active list. 
exec complete => tags sent here and marked complete.
Also has : logical-destination register number and old physical register number. 
An instruction's graduation commits its new mapping. > Old physical register goes to free list.
Branch rollback > from Branch Stack (you know it might mispredict early)
Exception rollback > undo mapping from the active list (because you cant checkpoint at every instr)
Busy-bit tables
when register exits the free list, set bit. 
used for dependency between integer and floating point pipes. 
Instruction Queues
Integer Queue
     16 entries, no specific order
Address Queue
     Preserves program order (others need not)    
     Basically Load/Store q.
Two Matries (16 bit by 16 bit) track dependencies between memory accesses. 
1: stop unnecessary Cache trashing (older instr obviously gets prefrence)
2: Load bypass : if pending store has the value. 
Sequential memory consistency.  (this is inorder) but : external access to memory ? if a jackass from outside writes into the memory after its read from the cache but register is not updated, soft exception and restart from load instr.
Store instr : precisely write into cache and active list is cleared.
Atomic instr : Load-Link, Store-Conditional pair. 
Floating Point Queue
Memory Hier
The usuals. Non blocking 2 level set associative cache. 
44-bit virtual address > 40 bit physcial address. 
iCache has predecoded instructions (makes decoding easy later on). which functional unit executes in 4 extra bits, modifies opcodes slightly. 
9 bit ECC and 1 bit parity. 

Timothy J. Slegel, et al. IBM's S/390 G5 Microprocessor, IEEE Micro, Mar/Apr 1999, pp. 12-23. IEEE Xplore link

S/390 G5 processor : 1999 
     took 10 years to reach the performance of the last bipolar system for CMOS
     L2 ran at half speed (500Mhz core clock at 1.7V)
     full custom 

microarch
     not superscalar
     ESA/390 : older ISA :      numerous, relatively commonly used, instr that require tens/hundred of clk cycles to execute. 
                     not load-store
                     too complex : difficult to implenet : decimal data instr, addressing modes, multiple address spaces, precise interrupts VM emulation and 2 different floating point archs. 
     G5 uses millicode 




     L1 cache : buffer control element : cache itself, cache directory, TLB, adress translate logic
     I unit : instr fetch, decode, addr gen, queue of instr waiting
     E unit : exec, local working copy of registers
     R unit : recovery unit : checkpointed copy of the inter microarch state timing facility

L1 cache unified instr, opearnd, millicode data and is store through. 2 way interleaved

>> 256 bytes : ideal cache line size : compromise between fetch time of the last byte and perf improvement

Absolute address history table : predict preTLBed address 
TLB : dynamic address translation and access register translation 
ART : access register translation ART lookaside buffer (ALB) 

2 way set associative BTB 2048 entries
Decimal unit for financial data 

Exception handler : 
     I unit tries to find out which instructions cause an exception 
     does something called single instruction mode : all previous instructions are cleared and this particular instr is sent thru
     normal speed deduction : gross, pessimistic check
     single instr mode : actual precise check

Registers
     E unit has local copies
     R unit does have RAM type master copy (architectural state) of registers
     on commit, R unit is written (with ECC)
     R unit used for recovery (checkpoint)

Millicode : executing a complex instr is like hardwired subroutine call
     > uses completely indep set of register 
     > also service functions : hardware error logs, scrubbing memory for correctable errors, supporting operator console functions and controlling low-level I/Ooper
     
Virtual machine emulation
     hw support : 3 complete copies of all architected control register, 3 copies of timing facility registers. (host mode, first level guest and second-level)

2 types of floating point

symmetric multiprocessor : memory : uniform access time. 

lots of relaiabiltiy features
     error check, parity, state checking, local duplication of control logic and so on. 20-30% error correction logic!
     full duplicate I unit and E unit. if the signals dont match, hardware error recovery is invoked.
     in a array of processors : delete mechanism removes a processor completely.
     memory : redundant word-lines to automatically replace defective sections (even in customer site)
     Processor availability facility : scans out the latches from a check-stopped processor, stores it in a special area, os then resumes this on another processor
     PAF + concurrent sparing => completely transparent to customer, appatakar.


T. Mudge. Power: A first class design constraint. Computer, vol. 34, no. 4, April 2001, pp. 52-57. IEEE Xplore link

Power equations for CMOS
     P = A C V^2 f + A V Ishort + V Ileak
     fmax = (V - Vt)^2 / V
     Ileak = exp (-Vt/35mV)

     
     reducing voltage has significant effect on power consumption P prop V^2
     reducing activiity does as well : clock gating
     parallel processing is good if it can be done efficiently > if computation can be split in two and run as two parallel independent tasks => power cut in half, no perf loss!

worry about
     peak power
     dynamic power
     total energy
     
     >> Measure MIPS/W
     >> or energy*delay

Tech for reducing
     Logic
          Clock gating
          Half frequency and half swing clocks
          async logic
     Architecture
          Memory systems
          buses
               gray code
          parallel processing and pipelining
     OS
          Voltage scaling
          scheduling
with High MIPS/W
     do scheduling such that applications dont complete well before deadline.

Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor V. Zyuban, Philip N. Strenski, Philip G. Emma: Optimizing pipelines for power and performance. MICRO 2002: 333-344. IEEE Xplore link

Develop an analytical model to understand power and perf tradeoffs for super scalar pipelines.

contributions
    energy models for dynamic and leakage power to capture the scaling of different power components as a function of pipeline depth
    analytical perf model that can predict optimal pipeline depth
    cycle accurate detailed power-performance simulation (with sensitivity analysis of optimal pipeline depth).


someother paper : 7.7 FO4 = optimal pipeline depth for SPEC2000, 5.5 FO4 for traditional and Java/C++. (+latch insertion of 3 FO4).

P is total power of pipe i > energy delay product ED = P/G^2. optimal d(BIPS)^y/W)/ds = 0.

Hold power primarily consists of power dissipated in latches, local clock buffers, global clock network, data independent fraction of arrays.
switching power : combo logic, data-dependent array power dissipation, switching factors.

Dynamic power = CV2f*(a+b)*CFG a = functionality, b = glitching factor.
leakage power = VIleak

Latch Scale = latch ratio * (Fo4base design/FO4 logic)^LGF
  fact adj hold power diss = hold pwr to total power, LFG = growth of latch count dur to logic shape func.typically > 1

Dynamic power increases more than quadratically with increased pipeline depth.

Metrics with Less emphasis on performance (BIPS/W, BIPS^2/W : shallower pipelines, deeper pipelines for more pef). Optimal for BITS3/W is 19-24.


Using BIPS3/W :
high LGF favor shallower pipelines, lower values of Latch Ratio favor deeper pipelines, use of low power latches favors shallower pipelines, higher values of Glitch factor favors deeper pipelines and high leakage > deeper pipelines.

depend on workload characteristics.



Dan Ernst, et al., A Low-Power Pipeline Based on Circuit-Level Timing Speculation, MICRO 2003. IEEE Xplore link

Physical Design paper MICRO 03

A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Background 
     Dynamic Voltage scaling
     Critical supply voltage < V  
     OCV to account for process variations 
     or delay chains (inverters) to measure it after chip manufacturing

Razor
     Shadow latch clocked by a delayed clock, guaranteed to hold correct value
     compare actual ff vs shadow for error 
     combination of architectural and circuit level dagajis
recovery > one cycle penalty.
     Do something like instruction stall/clock gating to account for that one cycle
     Only used in PVT corner affected flops replaced by Razor
     metastability handling > takes 2 cycles => flush pipeline required.
Recovery methods
     Clock gating
     Counter flow pipelining
    
Proportional control system 
     Voltage scaling based on measured (sampled) error rates

     


Shubhendu S. Mukherjee, Christoper Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin, "Measuring Architectural Vulnerability Factors," Top picks of 2003 in IEEE Micro, Nov/Dec 2003. IEEE Xplore link

Focus 
     Energy consumption
     Power dissipation
     Power delivery
Later
     Stuck at faults, single-event upsets (cosmic rays, 8% increase over generation)
     device degradation

Cause of variations
     random dopant variations (in channel > threshold variations)
     sub-wavelength lithography
     heat flux (dynamic variation)

tolerant design
     leakage invp speed (body bias to control)
     freq + active/leakage power distribution > synopsys
     deterministic to probabilistic /statistic 
     
burn in > high voltage, high temp for short periods just after the chips are manufactured to accelerate aging > faults can be caught fast

synthesis : 
     design (multivariate for) perf, active and leakage power, reliability, yield and bin splits.
Physical : 
     Razor
Arch :      
     Reliability and security engine
     small checker hardware
     resilient hardware > multicore redundant


D. Burger et al. Scaling to the End of Silicon with EDGE Architectures. IEEE Computer 2004. Volume: 37, Issue: 7. ACM DL Link

Cool paper

what drove comp arch
1970 : Memory was expensive
1980 : no of devices that could fit in chip  
now : MHz wall

4 major emerging tech characteristics
Pipeline depth limits > fine-grained concurrency mechanisms to improve performance
Power
on-chip communication dominant execution
polymorphism (multi function)

EDGE : Explicit data flow graph execution
Groups of instructions > hyperblock that can be run in parallel

Architecture + compile balance : 
RISC : Arch has to look for (reconstruct) dependencies (dynamic placement, dynamic issue)
VLIW : very difficult for compiler.  (static p, static i)
EDGE : static placement, dynamic issue  

Compiler's jobs
>> Large hyperblock formation
>> Predicated execution (just create a dependency on the first of the instructions, dependency following will auto block rest)
>> Generating legal hyperblocks (device constraints)
>> Physical placement

Parallelism
Data level P is kind of obvious
TLP : 4 threads > evenly distribute exec blocks to threads


Shekhar Y. Borkar: Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro 25(6): 10-16 (2005). IEEE Xplore link

Soft-errors single event upsets

Special radiation hardened circuits
architectural redundancy
localized ECC

Soft error rate
     Silent Data corruption (SDC)
     Detected unrecoverable error (DUE)
Architectural vulnerability factor > what effect does a soft error have on a program's actual output?
     branch predictor's AVF = 0
ACE : Architecturally correct execution are bits whose errors cause damage to arch state
     Use Little's law to calculate ave ACE (conservative) for a program
     unACE sources
          no-op
          performance enhancing instructions
          predicated false instructions
          dynamically dead code
          logically masked values

We translate Little’s law as N  = B °— L , where N  is the average number of bits in a processor structure, B  is the average bandwidth of bits per cycle into the structure, and L  is the average residence time of an individual bit in the structure.
Find AVFs using performance modeling and SPEC benchmarks



Gabriel H. Loh, Yuan Xie, Bryan Black, Processor Design in Three-Dimensional Die-Stacking Technologies, In IEEE Micro, vol. 27(3), pp. 31-48, May-June, 2007. IEEE Xplore link

two natural topologies : face to face or face to back

copper-copper bonding process builds an interdie connection > die-to-die (d2d) or 3D via. 

d2d via pitch significantly larger than individual transistor.
     > size determines possible 3d part of procesor blocks and funct units.
     > latency.
     > RC delay : 35% of full stack of vias connecting met 1 to 9. 

30% of pins used for power

high power density

face to face > requires through silicon vias (TSVs) for I/O and power. low inductance, so no problem.
     3d => 50% area of 2 2d chips => 50% area for pins => power delivery 50%
     ++ because wire loads are low : power demand goes down

thermals
     successive layer farther away from the heat sink
     power density
     sim result : 2d processors ~ 3d config worst case temp!
     wire reduction 3D placement
     
Partitioning granularity Layer 1| Layer 2
     Entire cores CPU | L2 
     Functional block units ROB | ALU
     Logic gate Mux [31:16] | Mux [15:0]
     Transistor level PMOS | NMOS

3D cache
     cores on one layer, cache on another
     2x cores + 2x cache on one, 2x core + 2x cache on other
     within cache
          stacked bit lines
          stacked word lines
          word lines more delay than bit lines => stacking word lines provides lower overall latency.
          but power => lower bit lines (much longer)

(1) Eliminating critical wires > latency + power reductions
(2) different partitioning strategies to match communication density of a given d2d via interface 
     wire via pitch does not scale at the same rate as feature-size.
(3) partitioning > power, performance, area.

Mixed process integration
     DRAM
     onstack DC to DC convertors
     decoupling capacitors

Cool places to use
     eliminate pipeline wires
     higher timing margins : clock skew/jitter is low
     better performance/watt ratio
     clock frequency improvements > most effective for power
     reduce the number of pipeline stages (wires)
     
NUCA : inherent problem : managing data that cores share.
     Dynamic NUCA
     3D version of L2 : 90% reductions in cache migrations

- - needs 3D place and route, floorplanning tools, 3D visualization and layout.
     fault on a single layer > complete waste of entire stack
     DFT > difficult in the presence of finely partitioned 3D structs, a die might have only 50% complete circuit. 


Taeho Kgil, David Roberts, Trevor Mudge, "Improving NAND Flash Based Disk Caches," ISCA 2008. ACM DL link


NOR and NAND flash differ in two important ways:

  • the connections of the individual memory cells are different
  • the interface provided for reading and writing the memory is different (NOR allows random-access for reading, NAND allows only page access)

NAND problems
     operations have to performed in pages
     erases have to be made in blocks
     erase before writing. 
     out-of-place write policy 

overcome with BCC as flash degrades and thats about it.

Garbage collection
     to be able to write only pages, and blocks need to be erased, you pretty much need to GC every now and then. 80% size used, 20% of the time goes in GC



W. Daniel Hillis and Guy L. Steele. Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183. ACM DL Link

Programming styles.
     Control parallel style (multiprocessing)
     Data Parallel

Connection Machine Architecture
     16384/65536 processors with 4k bits of memory each. 
     Front end processor (von Neumann style) and array of the above proc.
     Proc array extends the ISA of the front-end processor. 
     SIMD. 

Processors : 
     State bit > if an instr will be executed.
     Context : set of processors > context flag > statebit.
     Immediate values broadcast     
     ILLIAC IV, Goodyear MPP, Non-Von ICL DAP, CM > data parallel.

pointer-based communication > SEND instr : proc address + memory address. kind of indirect store. 

virtual processors > microcoded controller between frontend and array : helps program abstraction from hardware details.

1+ same program for different hardware

2+ processors may be regarded as expandable => (Fortran preallocates all memory), other prog languages > procedure call to allocate a "fresh" processor. (eg : malloc in C can) => processor-cons > memory comes with a processor. 

Paper goes on and on about algorithm adaptations to data parallel archs. and yes they can be done. 

Coolest conclusion :

When operating on lots of data : The parallelism to be gained by concurrently operating on multiple data elements will be greater than the parallelism to be gained by concurrently executing lines of code.      


Review from cs757

 “Data parallel algorithms” talks about algorithms that efficiently run on a large array of parallel processors and introduces techniques for parallelising seemingly serial algorithms with large input data. The system considered executes the front end processor’s (VAX or Symobolics 3600) ISA plus floating point arithmetic on large data in parallel (supported by an array of Connection Machine Processors). The introduction of a flexible network is necessary to make programing/compiling easier (hides the architecture) . By using 2 types of instructions, one that executes on all processors (conditionally based on a context) and another that specifies the interconnect, this system retains the advantages of a SIMD albeit multiple processors. The authors give implementations for algorithms that are inherently parallel such as sum of an array of numbers, partial sums on an array, sorting etc., Furthermore, adaptations of serial tasks to make full use of the parallel architecture are explained. For example, parsing of a string by starting with each character in parallel and forming larger tokens by analysing neighbors recursively is explained. In the 80s, processing of pointers was part of most tasks, the paper explains how this can be done over parallel processors.


The paper provides data on run time measured on the 65536 processor Connection Machine system for certain algorithms. The model used assumes unit time for all instructions including data transfer between processors over the interconnect. Also, resource availability is assumed to be infinite. Parallel algorithms can effectively be mapped to the Connection Machine System. The paper tries to prove that many applications that involve a large enough data set can be efficiently mapped to a parallel system by converting it to a binary tree. The key idea is to look at increasing performance by employing one processor per data element.

The paper is more of a marketing white paper for the Connection Machine system. Several examples given are efficiently executed in a parallel processor architecture. However a few inherently serial operations are tweaked to fit the parallel architecture. The resulting efficiency is highly questionable, extra complexity is introduced just to find and fit algorithms that run efficiently on this system. Written in 1986, it discusses one of the first vector data kind processors. Unlike current day implementation of large vector processors, several small processors that cycle synchronously execute the same instruction on a different piece of data is introduced. This idea of looking at computing as operating on groups of data rather than sequential execution, by employing a group of simple processors connected by a flexible interconnect network still inspires new research like Explicit Data Graph Execution. 



John M. Mellor-Crummey, Michael L. Scott: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9(1): 21-65 (1991). ACM DL link

review from 757

Modern processors include hardware support, atomic instructions in particular, to help correctness in shared memory system. As the number of processors accessing a data element increase, so do the wait times and the network traffic resulting in decrease in performance. The paper first talks about spin locks typically used by busy-wait type programs. On a bare minimum, with the help of atomic instructions, a lock can be test_and_set by a requesting thread before the shared memory is accessed. However this is not scalable, the test_and_set operation is expensive. To reduce this read modify write operation, the lock are checked before test and locking it. As no request order is maintained, this method can lead to starvation. Instead, counters indicating number of access requests and releases can be compared to acquire a “ticket” lock. Backoffs now cause a problem, a waiting thread does not know when to re-check availability and this directly causes other threads to wait. Array based queuing locks maintain a FIFO of requests. This eliminates starvation (assuming required FIFO depth is available) but a central FIFO causes network traffic. The authors propose a new “MCS” lock, where a pointer to the lock is locally stored. However a release needs to be explicitly spun, informing every other lock list to drop the exiting processor.



Barriers are essential synchronization constructs. They ensure that threads proceed ahead only when all threads have reached. The location or data structure that maintains barrier information is a shared data structure. The most obvious is the counters that are incremented when each thread reaches the barrier. The last thread to reach resets it, triggering a go ahead. This is not scalable as too many processors will over load the counters. A combining tree like structure ensures that each leaf has only a small number of processors updating it. Dissemination barriers works with processors signaling others about their completion and tournament barriers forward a winner processor up every node. The authors propose a new tree-based barrier where each processor spins on a local copy of the barrier, that is woken up by a winner processor.



The authors measure the performance of spin locks and barrier algorithms on the BBN Butterfly and the Sequent Symmetry multiprocessor machines with distributed shared memory. At every lock acquisition or barrier time taken is saved, averaged out and reported versus number of processors contending. More than the new MCS lock and the barrier proposed, the paper excels in the criteria that are defined for evaluation.  



When the locks are locally cached, heavy network traffic occurs each time any change occurs to maintain coherency. Also all these algorithms are built on top of atomic instructions (noteable exception : Lampert’s bakery algo). This would also mean strict in order complete of loads and stores. Lock behavior when precise interrupts or exceptions occur must also be defined. This comprehensive paper talks about the evolution of ideas in locks and barriers till 1991. The criteria and flowchart they suggest in choosing an approach given constraints is its key contribution.

Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, Anoop Gupta: The SPLASH-2 Programs: Characterization and Methodological Considerations. ISCA 1995: 24-36. ACM DL Link



SPLASH-2 programs look at

     computational load balance

     communication to computation ratio and traffic needs

     important working set sizes

     issues related to spatial locality

     how these scale with problem size 

     number of processors


benchmark intro paper : 

     characterize the programs : basic properties, architectural interactions

     user help : what veal params to choose



Leonardo Dagumand and Ramesh Menon, OpenMP: An Industry Standard API for Shared Memory Programming. IEEE Computational Science and Engineering, Jan-Mar, 1998. IEEE Xplore link, Local uw link

Scalable hardware exists : k-ary n-cube, fat tree, multistage interconnect
MPI typical scalable approach. 
zeroth order scalable : no cache coherence, just scalable interconnect network (MPI to sync, coh)

SSMP : scalable Shared memory multiprocessor. (MPI can be built on)
writing code? : not portable!

Why SSMP : fast shared memory locks, directly access mem throughout the system

MPI 
     -- need to explicitly partition data structures. 

pthreads : task parallelism, not data p.

OpenMP 
     compiler primitives + callable runtime library routines
     fork/join execution model.

Control Structure     > parallel, do and single
Data env                > share, private or reduction (sum = sum + 1), copyin etc., 
syncronization        > Implicit (beginning and end of parallel), removed with no wait clause
                             > Explicit : Atomic, Flush
runtime library         > query and lock function etc.,

A.R. Alameldeen and D.A. Wood,, "Variability in Architectural Simulations of Multi-threaded Workloads," HPCA 2003. IEEE Xplore


Space Variation 
     two runs exhibit different performance characteristics (OS scheduling decision, lock orders, measurement interval etc.,)
Time variability
     workloads exhibit different performance char over time

Accounting for space v
     confidence interval : +/- 1 sigma
     enuf sample size : n : CLT application
     Hypothesis testing
Accounting for Time V
     start at different points
     ANOVA : analysis of variance (start at same point go to n, n+x n+2x and see if variance changes) and then do some dagajchi
     

L. Lamport, “How to make a multiprocessor computer that correctly executes multiprocess programs,” IEEE Transactions on Computers, vol. 28, no. 9, pp. 241-248, Sept. 1979. IEEE Xplore link


Sequential consistency
Requirement Rl: Each processor issues memory requests in the order specified by its program
Requirement R2: Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue

Anoop Gupta, Wolf-Dietrich Weber: Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Trans. Computers 41(7): 794-810 (1992). IEEE Xplore link

Conclusion : Directory based schemes with 3-4 pointers per entry should work well for executing well-designed || programs.

cache line sizes
    increase > larger invalidations
                 > data traffic goes up
                 > coherence traffic comes down
                 > overall traffic min when line size = 32.
         
Classification of data objects
    Code and read-only data
    Migratory data                                     high proportion of single invalidations
    Mostly-read data                                 small invalidations
    Freq read/written objects                     large inv, eg : number of processors waiting in a global queue
    sync objects                                         locks and barriers
         low contention sync objects             distributed locks, easy to implement, optimal for directory based
         high contention sync objects             

distinguish : large invalidation > write to a line cached in many processors ; frequent invalidation > ...

Effect of Cache line size
large size :
    > better hardware efficiency, prefetching, inc in message traffic (increases min communication granularity between processors). 
    > parallel programs exhibit less spacial locality than sequential programs.
    > false sharing significant. 
    > increase the number of proc sharing a cache line (false sharing) > increase in size of invalidations. 
    > spatiality depends on classes of objects
    > fewer messages of each type (control/data), but size of each data increases. 

Of course : best case cache line == size of the object thats shared. 

Sarita V. Adve and Kourosh Gharachorloo. Shared Memory Consistency Models: A Tutorial, IEEE Computer, December 1996, pp. 66-76. IEEE Xplore link

Memory consistency model
     specifies the constraints in the order in which memory operations must appear to be performed wrt to the other processor

Sequential consistency
     1) All processors execute sequentially
     2) Some sequentiality exists between these procs operations on memory
     3) catch : for processor 0, if write A, read A happens, read A should wait till write A is seen by all the other processors

(1) All cores insert their loads and stores into the order <m respecting their program order,

regardless of whether they are to the same or different addresses (i.e., a=b or ab). There are four

cases:

If L(a) <p L(b) L(a) <m L(b) /* LoadLoad */

If L(a) <p S(b) L(a) <m S(b) /* LoadStore */

If S(a) <p S(b) S(a) <m S(b) /* StoreStore */

If S(a) <p L(b) S(a) <m L(b) /* StoreLoad */

(2) Every load gets its value from the last store before it (in global memory order) to the same

address:

Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a)}, where MAX <m denotes “latest in

memory order.”

Coherence makes caches invisible. Consistency can make shared memory

look like a single memory module.

Total Store Order (Sparc)
     A TSO execution requires:

(1) All cores insert their loads and stores into the memory order <m respecting their program

order, regardless of whether they are to the same or different addresses (i.e., a==b or a!=b). There

are four cases:

If L(a) <p L(b) L(a) <m L(b) /* Load Load */

If L(a) <p S(b) L(a) <m S(b) /* Load Store */

If S(a) <p S(b) S(a) <m S(b) /* Store Store */

If S(a) <p L(b) ==> S(a) <m L(b) /* Store-->Load */ /* Change 1: Enable FIFO Write

Buffer */

(2) Every load gets its value from the last store before it to the same address:

Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a)} /* Change 2: Need Bypassing */

Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a) or S(a) <p L(a)}

Weak Consistency
     Syncronization actions are consistent (sequential)

Release consistency
Lock aquire/release...
     eager, where all coherence actions are performed on release operations, and
     lazy, where all coherence actions are delayed until after a subsequent acquire
.



Michael Zhang, Krste Asanovic: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. ISCA 2005: 336-345. ACM DL Link

ISCA 05

Problem tackled : in a tiled CMP whats the best way to organise the L2? 
     Private L2 => - replication | + fast access | - directory (coherence)
     Shared L2 => - network congestion hops. each address has home tile. 

Proposed
     Victim Replication L2VR : 
          Capture evictions from the local primary cache in the local L2 slice. 
          Retained victim is a local L2 replica of a line that already exists in the L2 of the remote home tile. 
          Whats replaced : (1) An invalid line (2) A global line with no sharers (3) An existing replica. 
          - Area overhead : L2 tags must be wide enuf (lines can exist in non homes). 

Related Work 
     Pirahna : Compaq's architecture : uses L2 shared non inclusive. snoopy bus.
     MIT Raw architecture : claimed non coherent L2 private distributed evenly among 16 processors



Milo M. K. Martin, Mark D. Hill, and David A. Wood, "Token Coherence: Decoupling Performance and Correctness," International Symposium on Computer Architecture (ISCA), June 2003. IEEE Xplore Link

Technology trend
     workload (good thread level parallelism, low latency preferred) > snooping
     Technology trend > directory

+ Fast cache to cache misses
+ No bus like interconnect
     avoid virtual bus ordering
+ Bandwidth efficiency

Token counting > ensures safety
Persistent requests > prevent starvation
Decoupling correctness and performance

Broadcast with direct responses
use unordered interconnect

Need one token to read, all tokens to write.
To avoid data sent with all tokens : owner token (clean/dirty)

- Non-silent eviction 

Prevent starvation
     invoke after a timeout
     send to all componenets
     all comp remembers in a table and continually redirects all tokens to requestor
     Deactivate when complete
What if many processors issue persistent req simulat?
     Use starvation-free arbiter
     Single/Banked/Distributed
need to prevent reordering of activate and deactive messages

- Scalability of persistent requests

Predictor based (reduce broadcasts)
     broadcast-if-shared. 
     Owner
     Groups



Steven L. Scott, Synchronization and Communication in the T3E Multiprocessor, Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp. 26-36. ACM DL Link

Parallelism 
     Explicit : MPI
     Implicit : Shared memory (better suited for irregular/ dynamic par)

Low sync overheard > lots of processors, lesser granularity of work split.

Commodity microprocessors inefficient for multiproc
     memory access is cache line based > stride access/vector access pretty bad. 
     Address space limited     (TLB)
     helpful to have non cached memory access (MPI to other proc)
     latency reduction rather than latency toleration

T3D <<<
     Shared address space
     bidirectional 3D torus writing
     barrier network : 4 wire wide, degree four spanning tree
     remote memory access
          Block transfer engine : bulk, async data transfer between proc memories
          prefetch queue
          DTB annex : extend the memory space outside the processor
     
T3E overview
     PE : DEC 21164 + shell(control + router + local memory)
     self-hosted : Unicos/mk 
     I/O : GigaRing channel
     E registers : shared area
     virtual eureka/barrier network

Global virtual address
     any part can be masked to produce virtual PE 
     Get and put operations to write to E-registers
Atomic memory operations (swap, fetch&inc, fetch&add, compare&swap, masked_swap
Messaging
     SEND command
     message queue control work with tail, limit and threshold to interrupt
Eureka/Barrier synchronization
     

James Laudon, Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. ISCA 1997: 241-251. ACM DL Link

ccNUMA Highly scalable server

Cache-coherent globally addressable memory 
Distributed shared memory <> Directory based protocol

minimize latency between remote and local memory > hardware/software support to insure most references are local
     also > easy migration for existing SMP software (by maintaining ratio of latencies a min)

Effective page migration and replication
     per-page hardware memory ref counters
     block copy engine (at near peak mem speed)
     reduce TLB update cost

interconnect > multi-level fat-hypercube topology.

sync help > fetch-and-op primitives (MIPS already has load-linked/store-conditional)

Network 
     6x 2x links per router
     wormhole routing
     global aribitration to max utilization under load
     4 virtual channels
     adaptive switch between 2 virtual channels
     CRC checking and retransmission on error

Cache Coherence
     non-blocking
     request forwarding (like DASH)
     silent CEX replacement : clean exclusive state
     network ordering does not matter!
          detect and resolve all ooo deliveries. 
     adaptive routing to deal with congestion
     deadlock
          negative acknowledgements to all requests (DASH does this)
          SGI : backoff message : target of intervention or the list of shares to invalidate. 
          > invalidations + invention backoff > better forward progress (when heavily loaded systems)

Coherent request buffer >     read and write request tracked for both processors

Page migration > the requestor's count and home count are compared, if difference exceeds a software programmable threshold register, an interrupt is generated to home node > potential migration handled by the OS.

Directory poisoning 
     Avoid TLB update problem (shootdown > The actions of one processor causing the TLBs to be flushed on other processors is what is called a TLB shootdown) 
     during read phase > latest copy is written back to memory, directory in Poison state. 
     another processor access > synchronous bus error. 
     error handler > error due to page migration, invalidates the TLB entry for that page. 
     >> low cost to migrate

OTHER architectures
Stanford DASH
     4 processor SMP cluster is one node.
     ORI : 2 proc node with coherence handled by directory-based protocol.
     > + cache to cache sharing within the node.
     > -- intranode sharing significant > more processors required > bus will become huge.
     > -- remote latency
     > -- remote bandwidth = half local bandwidth
     SGI Origin : 2:1 remote to local access.
Convex Exemplar X
     512 proc in 8x4 torus config
     third level cluster cache
     
DSM Systems comparison
     overhead of SMP nodes sets minimum on how effective machines are in small configs
     workload throughput oriented?
     > SMP has lower communication costs between processors within the same SMP node +++
     > parallelism beyond the size of the SMP node is important > latency + bandwidth overhead shall be very large.

Conclusions
     highly modular system
     fat hypercube network high bisection bandwidth, low latency interconnect.
     low latency to both remote memory and local memory
     hw/sw page migration and fast synchronization


Erik Hagersten and Michael Koster, WildFire: A Scalable Path for SMPs, Proc. 5th IEEE Symposium on High-Performance Computer Architecture, January 1999, 172-181. IEEE Xplore link

SUN

Distributed shared-memory (DSM) prototype

Origin is a DSM optimized architecture > reduces remote memory latency

cc-NUMA has greater potential for scalability, they are less optimal for access patterns caused by real communications (producer consumer and migratory data).
OS optimization for capacity miss and conflict miss. 
scheduling is difficult.

SMP : does not care where code/data is placed, suspended process can be resch on any other proc.

problem : not possible to build SMP with huge number of CPUs spanning several physical boxes. 

DSM protocol : keeps only a handful of nodes coherent. (complexity/latency of nodes reduced by small number of nodes). large node has memory banks > higher bandwidth (interleaving). better node locality. Node is aware of load balancer. 

Wild fire ccNUMA built from unusually large nodes. 

when possible, local memory is used. 
only when a process has more threads than processors, multiple nodes 
> CMR : coherent memory replication (version of S-COMA, simple cache only memory arch). 
   Allocates local "shadow" physical pages. 
   Solaris OS uses integrated hw counters to determine which pages to switch from ccNUMA to CMR. 
   responding to memory access patterns. (adaptive algo)
   -- Memory-resident pages and "large" physical pages cannot be replicated. > can be explicitly replicated.

interface implementation
   NIAC : Network Interface Address controller and NIDC.
   4 NIDC chips controlled by 1NIAC chip (high bandwidth required on the coherent interface).
   NIAC : bus interface and global coherence layer. 

   WFI > 2 translations : local phyical CMR to global phy address and back.

deterministic directory : state of cache and state of directory are always in agreement. 
   > blocking directory (only one outstanding transaction per cache line)
   > three-phase writebacks

Hierarchical affinity scheduling > OS tries to schedule a process first on the processor it last ran. then on some processor on the same node.

>>>>>>>>
SGI Origin : 2 R10000 connected by Hub, Mem + Directory (the 2 processors share the local cache, fast access to remote cache) page migration. 
Wildfire : large number of SMP nodes connected to distributed directory network
NUMA-Q node : i dont know. 

>>>>>>>>

created locality : doing something other than just responding to create 

Coherent Memory Replication and Hierarchical Affinity Scheduling control are implemented as Deamon processes (OS)




Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing, Proc. International Symposium on Computer Architecture, June 2000, pp. 282-293. ACM DL link



lots of Alpha in order in one chip

commercial workloads : large memory stall component. data-dependent naure, lack of ILP. high floating=point and multimedia functionality.

SMT > superior in single thread performance > very wide-issue processors which are more complex to design
CMP > simpler processor cores at potential loss in single thread performance. 

Piranha : extremely simple processor : single -issue, in-order, eight stage pipeline. 

processing node (8 CPUS, Intra chip switch, system control, home and remote engines, packet switch router and L2 -> memory.
IO chips : same as above, just 1 CPU inside

Intrachip switch : back-to-back transfers without dead-cycles. reduce latency < modules issue target destinations of future requests (hint) > pre-allocate datapaths to speculatively assert requester's grant signal. (high priority and low priority). 
fully ordered.

L2 
     8 banks, 8 way set associative. 
     duplicate copy of L1 tags and state at the L2 controllers.
     L2 is a very large victim cache. 
     in case of multiple sharers write back happens only when owner of data replaces it.
     full map centralized directory-based. 
     
home engine
     exporting memory whose home is at the local node. remote imports. 
     
directory storage
     limited pointers 
     coase vector. 

inter node coherence protocol
     invalidation based directory
     does not depend on point to point ordering => adaptive routing. 
     no NAKs (no protocol deadlocks/network deadlocks)
     no livelock and starvation.

hot potato routing : as age increases, so does priority.
bound the number of messages that can be inserted as a result of a single request.
cruise-missile-invalidates 

inband clock distribution/S-connect. 


Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005. IEEE Xplore

Commercial server applications (databases and Web services) have huge TLP. 

60 W of power

Data center racks limited by power supply envelope. supplying power and dissipating generated heat

32 way threaded > OS layer abstracting away hardware sharing. 

performance metric : sustained throughput of client requests.

good performance per watt. 

commercial server applications : low ILP : large working sets and poor locality of reference on memory access. data dependent branch very difficult to predict. 

Niagara : simple cores aggregated on a single die shared on chip cache and high bandwidth off-chip memory. 

Pipeline : 4 threads into a thread group > shares the processing pipeline > Sparc pipe. 8 such groups =  32 
crossbar interconnect provides communication link between Sparc pipes, L2 cache and shared resources. (also the point of memory ordering). 

each thread : unique set of registers, instr and store buffers. 
share : L1 caches, TLBs, exec units, most pipeline registers. (critical path by 64 entry fully associative ITLB). 




Register windows
     each thread has 8 register windows. (4 such files for 4 threads)
     set of registers visible to a thread is the working set (register file cells)
     complete set : architectural (SRAM)
          copy happens when window is changed. also the thread issue is stopped. cool!
     
Memory 
     simple cache coherence with write through policy. 
     10 % miss rate. 
     allocate on load, no-allocate on store. 
     stores do not update the local caches until they have updated the L2 cache. 
          copy-back policy, writing back dirty evicts and dropping clean evicts. 
------

Hardware multithreading primer
     Single issue single thread machine 
     Coarse grained multithreading / switch on event multithreading > thread has full use of CPU until long-latency event such as DRAM miss occurs. 
     Fine grained multithreading / interleaved multithreading > cycle boundary.

     time sliced or vertical multithreaded (SMT)
     Chip multiprocessing > each processor executes one thread. 


Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, "NVIDIA TESLA: A Unified Graphics and Computing Architecture", IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55. IEEE Xplore

Compute Unified Device Architecture (CUDA) parallel programming model and development tools

Tesla = unified graphics and computing architecture

Old GeForce : 
     vertex transform
     lighting processor
     fixed-function integer pixel fragment pipeline
     (OpenGL DirectX)
     Vertex shaders, floating point fragment pipeline
Vertex processors 
     > operate on vertices of primitives such as points, lines, triangles
     > transform : coordinates to screen space
     > lighting and texture parameters 
     > output to setup unit and rasterizer (convert lines to dots/pixels)
     > low-latency, high precision math operations
     > programmable typically
Pixel fragment processors
     > interior of primitives/interpolation
     > high latency lower precision texture filtering
     > more pixels than vertices => more of this
Tesla > Vertex + Pixel into one unified processor architecture.

scalable processor array 
     
memory > DRAM + fixed-function raster operation processors that perform color and depth frame buffer directly in memory

Command processing
     > interface to CPU > responds to commands from CPU, fetches data from system memory, checks command consistency and performs context switching
     > input assembler > collects geometric primitives and fetches associated vertex input attribute data.
     > execute vertex/geometry/pixel shader/compute




barrier syncronization

SIMT (multiple thread)
unit creates, manages, schedules and executes threads in groups of 32 || threads called wraps. 
each thread executes independently. individual threads can be inactive due to independent branching or predication
full potential > if all threads take same exec path.

SIMD/SIMT
     SIMT : applies one instr to multiple independent threads in ||, not just to multiple data lands > controls the exec and branching behavior
     SIMD : controls a vector of multiple data lanes. 

wrap : 32 threads of the same type (vertex/g/p/c)

Implementing zero overhead > scoreboard qualifies each wrap for issue each cycle. instr scheduler prioritizes all ready wraps... fairness

intermediate languages use virtual registers, optimizer analyzes data dependencies, allocates real registers, eliminate dead code, folds instr together, optimizes SIMT branch divergence and convergence points. 

Memory access 
     > local/shared (SM)/global
     > coalescing memory over separate requests. 

Texture unit input > texture coordinates outputs > filtered samples, typically four-component (RGBA) color. 
Rasterization > Viewpoint > clip > setup > raster > zcull > block. 

GPU ^
------------------------------------------- 
Parallel computing Architecture

Throughput applications
     extensive data parallelism
     modest task ||
     intensive floating point arithmetic
     latency tolerance (perf is amount of work done)
     streaming data flow (relatively low data reuse)
     modest inter-thread sync

Difference between GPU : require that || threads syncronize, communicate, share data and cooperate > thread block : Cooperative thread array
thread : computes result elements selected by its TID
CTA : computes result blocks selected by CTA ID
grid computes result blocks and sequential grids compute seq dep application steps.

relaxed memory order : preserves order of reads and writes to the same memory address.
Scaling required. 

Scalability : varying number of SMs, TPCs, ROPs, caches and memory partitions.
future development : scheduling, load-balancing, enhanced scalability for derivative products, reduced sync and comm overhead, new graphics features, increased mem bandwidth and power efficiency. 

     

Charles E. Leiserson, et al., The Network Architecture of the Connection Machine CM-5, Proc. ACM Symposium on Parallel Algorithms and Architectures, June 1992, pp. 272-295. ACM DL link


Thinking Machines Corp
1996

economy of mechanism > machine should have only 1 comm network to convey info. 
CM 5 has 3 networks

Syncronized MIMD machine, shared memory, procs communicate via message passing
data network > send messages
control network > sync and multiparty comm primitives

32-16384 32 MHz SPARC procs, control procs : Sun Micro work station front ends

processors can be split into user partitions, privileged or non. users time muxed. low os overhead to user task communication.

Network interface
     + simple and uniform view of the network
     + support for time sharing, space sharing and mapping out of failed components
     + decoples design decisions made for networks
     > memory mapped registers on protected pages << MMU takes care of privilege.
     > Context switching : automatic checkpointing of user tasks. 
     > user's view of the networks is independent of network topology.

CM-5 Data Network
     balancing message loads > fetch deadlock problem
     fat-tree. user partition = subtree in network.
     route to least common ancestor of src and dest. pseudo random choice at each level > load balancing
     differential pair comm > noise immunity and reduced overall power requirements.
     similar to cut-through/wormhole routing
     >>> data network is bound by a contract with the processors to guarantee that deadlock never occurs. 
          typically > reservation mechanism > a max no of messages are outstanding between 2 processors.
               substantial over head
          CM-5 > left port and right port (virtual channels)
               send the request on the left port always (response can be on any port)
     good efficiency as user controls directly

     all-fall-down mode : user time over : just drop the messages down the tree to any node, when resumes, that node transmits it to actual destination

CM-5 Control Network
     split-phase barrier mechanism for synchronization
     broadcasting
          collision > error
     combining
          reduction, forward scan (parallel prefix), backward scan (parallel suffix), router done. 
          first 3 : bitwise logical OR, XOR, signed max, max, addition etc.,
          router done : when data mesages have completed.
     async OR of all procs.
     
     implemented as binary tree. 
     message sent up, broadcast to all nodes in that partition.

CM-5 diagnostic Network
     user does not know this exists.
     detects program functionality dependent and independent (DFT)
     JTAG DFT is connected on back plane on diagnostic network > geographical address (cabinets, backplane, slot type slot etc., and network address)

Dally and Towles, "Route packets, not wires: on-chip interconnection networks" , DAC 2001. IEEE Xplore link, http://cva.stanford.edu/publications/2001/onchip_dac01.pdf

DAC 2001

Use soemthing like PCI on chip. all modules talk to each other thru a standardized interface, network routes packets to destinations. 

Suggested network
     input port : 256 bit data + control field. 
          Type (Flow control digit or FLIT) : 2 bits : new packet (head), tail, middle or idle.  
          Size (4 bits) log of data size.
          Virtual Channel (8 bits) bit mask.
          Route (16 bits) : encoded as 2 bits for each hop till the destination. (NEWS).
          Ready : a signal from the network back to the client indicating network is ready. 

     Output ports

     Registers : ex : time-slot reserv for certain classes of traffic. 

Higher level protocol can be piggy backed on this. 

Router : 5 input controller (NEWS + self), 5 output contrllers. 

Fault tolerance : can find out which routers are crappy. and adapt accordingly. 

Pre-scheduled and dynamic traffic. (using virtual channels)

Differences between intra and interchip networks : 
1) lots of pins available. => wide flits possible. folded torus topology is employed. (2x wire demand, 2x bandwidth demand).
2) reduced buffer storage requirements.
3) Specialized transievers possible.

Area overhead exists (for routers) but 
1) predicatable electrical parameters => tweakable
2) reuse with universal interface
3) reuse to network
4) network => duty factor of wires increased.

Structure performance and modularity adv if you replace top-level wiritng with on-chip interconnects.  


Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, David Webb, "The Alpha 21364 Network Architecture," IEEE Micro, vol. 22, no. 1, pp. 26-35, Jan./Feb. 2002. IEEE Xplore link
152 mil transistors 22.4Gbytes/s router bandwidth.

paper talks about the router : 1.2GHz (same speed as the processor)
    using aggressive routing algos,  distributed arbiters. large on-chip buffering and pipleined router implementation
    support for directory-based cache coherence (virtual packet classes)
    
Classes : 
    flit > packet portion transported in parallel on a single clock edge
    Request
    forward
    block response (with data)
    nonblock response (ack)
    write I/O
    Read I/O
    special (no-op packets), buffer dealloc info between routers etc.,

2D torus

Virtual cut through routing
    flits of a packet proceed through multiple routers until a router blocks the header flit. 
    blocking router buffers all the packet's flits until congestion clears
    blocking router schedules the packet to destination. [buffer space for 316 packets]

Adaptive routing 
    minimum rectangle (diagonal distance)
    either continue on the same direction or turn. 
Deadlock avoidance
    because of cyclic dependences
    eg : request packets can fill up a network and prevent block response packets
    > solve : virtual channels with priority. 
    Adaptive routing can create intra-dimension and inter-dimension deadlock
    Jose Duato's theory : as long as packets can drain via a deadlock free path, no deadlock will occur. 
     intradimention :     VC0 and VC1 (non adaptive, horizontal first, vertical next) on primary axis depend on secondary axis, but not vice versa. when they change dimension, they recompute their virtual channels in the new dimension. 
    Dally scheme : incremental number of all processors in a dimension : if source's number is < destination : VC0 else VC1.
         but VC0, VC1 not well balanced.
    VC0, VC1 can return to adaptive if a subsequent router is not blocked. 

Router Architecture
    input and output ports : local (cache and memory controller), interprocessor (off chip network), I/O.
    clock forwarding
    
Local arbiters : various readiness tests to determine if a packet can be speculatively dispatched via the router. 
         valid input, output port is free, next buffer has a free buffer, target output port is free, anti-starvation mechanism, read I/O packet does not pass write I/O packet.

fairness. : coherence dependence priority mode and rotary rule. 

John Kim, James Balfour, and William J Dally. Flattened butterfly topology for on-chip networks. In MICRO 40: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007. ACM DL Link

Most OCN low radix (router has one node connected) 2-D mesh : RAW processor, TRIPS, 80-node Intel's teraflops and Tilera 64

Off chip networks : Cost increases with channel count channel count increases with hop count
On chip networks : bandwidth is plentiful but buffers are expensive. but reducing diameter of OCN > power consumption lesser
                              concentration is practical (reduces wiring complexity)

Latency on chip 
     Header latency + serialization latency + time of flight on wires
     Hop count * router delay + packet size / channel bandwidth + time of flight on wires

Butterfly 
     topology
          on a 2D layout, all routers in a column and row are connected. (direct connections to nth router in a row allowed)
          long wire impact : reduced by optimally inserting repeaters and pipeline register to preserve channel bandwidth > but traversal takes several cycles
     routing and deadlock
          VCs may be needed to prevent deadlock
          Dimension Order routing (go right first then up/down)
          non minimal routing allows path diversity available on flattened butterfly to improve load balance and performance.
          UGAL : 2 steps, minimal to an intermediate node and then minimal to destination
     Bypass channels 
          router bypass arch
          mux arbiter
               direct input is given priority at input of the mux
               bypass channel is given priority at output of the mux. 
          switch architecture
          flow control and routing
Paper has cool numbers about power and area of flattened butterfly high radix networks : 
     Power = Channel + Router + Memory
     Area ( Butterfly ) smaller < due to high radix several connections possible between 2 nodes => bi section bandwidth constant => thinner wires => lower power
     Smaller buffers required : wormhole : Virtual channel to prevent deadlock. Because of multiple paths : a packet can block another packet from the same source and destination.

Future
     optical signalling
     virtual channels to break deadlock
     ideal latency ?
     flit-reservation flow control 

J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. A Scalable Approach to Thread-Level Speculation. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. ACM DL link

parallelizing non numeric and irregular numeric difficult (complex control flow and memory access patterns)

TLS : auto parallelize code 
Epochs : time stamped with a epoch number (ordering)
Track data dependency between epochs           
homefree token :  when guaranteed no violations, this epoch can commit.

Objectives
     large-scale parallel machine (single chip mups or SMT) seamlessly perfom TLS across entire machine > communication diff
     no recompilation 

Detect data dependence violation at run time > leverage invalidation based cache coherence.
     if an invalidation arrives from a logically-earlier epoch for a line that we have speculatively loaded > bonkers.

Speculation level > Cache at which speculation occurs

whats required
     (i) notion of whether a  a cache line has been speculatively loaded and/or modified
     (ii) guarentee that a pec cache line will not be propagated to regular memory
          spec fails if cache line is replaced!
     (iii) ordering of all spec mmory references (epoch numbers and homefree token)

Hardware 
Cache states
     Apart from Dirty, Shared, Exclusive and Invalid, speculatively loaded and specul modified. 
     no kicking out till that epoch becomes home free. (if a must, speculation fails)
Messages 
     read-exclusive-speculative, invalidation speculative, upgrade request speculative 
     + epoch number of the requester 
     > only hints, no real need to oblige.

when speculation succeeds
     instead of scanning all lines and changing spec states to normal
     Ownership required buffer > when a line becomes both speculatively modified and shared.
     when home free token arrives, generate upgrade request  for each entry in ORB. 

Optimizations
     Forwarding data between epochs : have wait-signal synchronization
     Dirty and spec loaded state : anyway speculated dirty is never evicted, so if you load and then modify, just store it as DSpL state
     suspend the epochs that have violations (resume when u get homefree token)
     Support for multiple writers : combine results (ah!) fine grained SM bits (bytes/words in cache line)

support in SMT machines
     (i) two epochs may not modify the same line
     (ii) t- epoch cannot see t+ modification
      

Conclusions
     8-75% speedups! 
     ORB overhead low (used to commit speculative modifications faster) 


Ravi Rajwar, James R. Goodman: Speculative lock elision: enabling highly concurrent multithreaded execution. MICRO 2001: 294-305. ACM DL link

Speculative Lock Elision

Try to eliminate locks because they cause delays and serialize parallel codes. Instead detect and recover from critical section share violations. 

Claims 
+ enables highly concurrent multithreaded execution
+ simplifies coding (conservatively/randomly put locks, dude method will remove them anyway)
+ Can be implemented easily (store squashing with hardware, Coherence protocol : detection)

Attacks : 
Conservative locking
locking granularity
Old thread unsafe libraries. 



Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill and David A. Wood. LogTM: Log-based Transactional Memory. HPCA 2006. IEEE Xplore link


Version Management
>> Eager : in place on store
>> Lazy  : temporarily leave old value in place 

Conflict detection
? overlap between write set of trans 1 with write/read or trans 2,3,4...
>> Eager : detect offending loads/stores immediately
>> Lazy   : when transactions commit.

Lazy, Lazy : OCC DBMSs, Stanford TCC
Lazy conflict, Eager Version : None (zombies)
Eager, Lazy : MIT LTM, Intel/Brown VTM (conflict using cache conflicts)
Eager Eager : MIT UTM, CCC (conservative concurrency control) DBMSs, LogTM

Log TM

Version management
>> per-thread before image log in cacheable virtual memory  
>> Last in first out undo.

Conflict detection
>> coherence request to directly
>> Directory responds and possibly forwards the request to one or more procs
>> each responding processor examines some loacl state to detect a conflict
>> the responding processors ack (no conflict) or nack (conflict) the req
>> the requesting processor resolves any conflict
??  What if cache block is evicted?     
>>> directly still forwards to all "appropriate processors"
>>> Go from M to sticky M [on overflow bit set, check if even sticky M violation happened] 
>>> Lazy cleaning of sticky state (when overflow is not set and its sticky => it was sticky sometime ago, not now) 


From Prof Hill's talk in MSR :       
What happens when conflict happens ?
     wait >> could lead to deadlock
     abort >> could lead to livelock
     possible soln >> contention manager (requesting processor traps to software contention manager)



Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, Andreas Moshovos: Mechanisms for store-wait-free multiprocessors. ISCA 2007: 266-277. ACM DL link

2007

This is a cool paper
eliminates Capacity stalls > scalable store buffer, ordering stalls > aso

Store stalls occur in hardware that supports pretty much all consistency models
Why?
Strongly-ordered
     retirement must stall at first load following a store miss. 
     store buffer to L1D, but useless  as few stores occur between consecutive loads.
     store prefetching and speculative load exec improve performance
     store stalls account to 1/3rd exec time
relax load-to-store ratio
     buffer all retired store
     loads bypass these buffered. 
     CAM to associatively search 
     store coalescing > pretty useless
     RMW and memory fence > drain store buffer < stall till
models that relax all order
     (weak ordering, release consistency, SPARC relaxed memory order (RMO), intel IA-64, HP alpha, IBM PowerPC)
     no store buffer pressure. freely coalesced stores
     memory sync > store buffer has to drain.
     thread sync > ordering delay (unnec when data race occurs)

RMW
     
Scalable Store Buffering << 
     closes gap between TSO and RMO
     Conventional store buffer
          forwards values to matching loads
          maintain order among stores for memory consistency
          >> CAMs are used
     Total Store Order Buffer (TOSB)
          no value forwarding
          >> FIFO
          If 2 processors update a cache block L1 may be partially invalidated. 
               CPU private writes (uncommitted stores still in TSOB) must be preserved
               merge uncommitted stores with any remote updates. 
               reloading affected line from memory system and replaying uncommitted stores.

Atomic Sequence Ordering (ASO)
     dynamically groups accesses into atomic sequences
     coarse grain enforcement of ordering constraints
     relaxes ordering among individual accesses. 
          algo : If (ordering required by processor)
               (a) checkpoint, speculatively retire all stores, RMWs, memory fences. > write permission for entire set
               (b) hardware creates new checkpoint and new atomic sequence
               (c) when all write permissions arrive, the first atomic sequence transitions to commit (hold W till complete commit)
          if another proc writes (data race)
               roll back to start of the sequence. 
               at least one store store before initiating another sequence. 
     Implementation
          (1) must detect when a read in an atomic sequence returns violation
          (2) when this occurs check point recovery
          (3) buffer all of the seq's write and atomically commit or discard (use SSB)

Hardware Implementation
     SSB
          TSOB > address, value and sizes of all stores in RAM structure
          L1 with sub-blocked cache valid.
          16 entry victim buffer for store conflicts
          Invalidations > TSOB are replayed to update the updated line
     ASO
          Atomicity violations : Speculatively read bits 

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23rd Annual International Symposium on Computer Architecture, May 1996, pp. 191-202. ACM DL link



SMT
     permits multiple independent threads to issue multiple instructions each cycle to a super scalar processor's functional units. 
     combines multiple-instr issue feature of superscalars with latency latency hiding ability of multithreaded archs
    
conventional multithreaded : fast context switching to share processor exec resources. (SMT : dynamic sharing)

impediments to processor utilization : long latencies and limited per-thread parallelism. 

This paper : 
    throughput gains of SMT possible without extensive chanegs to wide issue superscalar proc.
    SMT need not compromise on single-thread performance. 
    analyze and relieve bottlenecks
    advantage : choose best instr from all threads to fetch and issue in each cycle. 

Changes needed to support SMT in hardware 
    multiple PCs, choose logic
    separate return stacks
    per-thread instr retirement, instr queue flush, trap mechanisms
    thread id
    large register file : logical registers for all threads + additional registers for reg renaming
         this determines number of instr that canbe in the processor between rename stage and commit stage. 

Fetch unit partitioning
    frequency of branches, misalignment of branch destinations make it difficult to fill entire fetch bandwidth from one thread.

    RR.1.8 (round robin, 1 thread, 8 instr)
         indistinguishable from super scalar
    RR.2.4, RR.4.2
         cache banks are single ported > bank conflict logic is needed. 
         RR.2.4 : two cache outputs, 4.2 has 4. 
         >> address mux, multiple address buses, selection logic on the output drivers, bank conflict logic, multiple hit/miss calculations. access time affected. 
    RR.2.8 
         ideal.

Thread choice : how to choose which thread to issue from :
    IQ clog : too many instr that block for a long time, IQ is filled. 
    Size of IQ limited by search 
    Algos to choose 
    BRCOUNT : favor threads with fewest unresolved branches. 
    MISSCOUNT : fewest outstanding D cache misses
    ICOUNT : threads with fewest instr in decode, rename and IQs. 
         (1) prevents one thread from filling IQ, 
         (2) highest priority to threads moving instr through IQ most. 
         (3) even mix of instr, maximizing parallelism
    IQPOSN : lowest priority to instr closest to head of either the integer or floating point iqs (oldest instr)

    instruction counting provides the best gain. 

Unblocking the fetch unit
    BIGQ : increase IQ size (without area penalty : search only first 32 items)
    ITAG : pre-search the tags to find misses early (more ports required)

wasted instr : wrong -path instr (branch mispred), optimistically issued load-dependent instr. 

bottlenecks : 
    Issue bandwidth x
    IQ size               x
    Fetch bandwidth Branch frequency and PC alignment problems
    Branch prediction y
    Speculative Execution y
    Memory Throughput > SMT hides memory latency but not memory throughput. 
    Register file size
         Fetch thoughput is the bottle neck. 




Mark D. Hill, Michael R. Marty, "Amdahl's Law in the Multicore Era," Computer, vol. 41, no. 7, pp. 33-38, July 2008. IEEE Xplore link


Base core equivalents : 1 BCE
Chip area/power limitations == n BCEs can be active on a chip
If you combine r cores, lets say we get a performance (single thread) of pref(r) < r.

Amdhal's law : Speedup (f, S) = 1 / [(1-f) + f/s] 

Symmetric multicore chips S (f, n, r) = 1 / ( ( (1-f)/perf(r) ) + ( ( f*r)/perf(r)*n ) )
typically perf(r) = sqrt(r)
Asymmetric multicore S = 1 / ( (1-f) / perf(r) + f/ (perf(r) + n -r) )

dynamic = 1 / ( (1-f)/perf(r) + f/n ) )