Disclaimer
These are notes I made from papers in the reading list. At times I might have included screenshots of figures. Please shoot me an email if you want that taken down : though you can download free copy of almost all of these papers (scholar.google.com). Also its all my take on the papers, depends on my mood, level of understanding, etc., These are not reviews, just points I thought were important.
Feel free to copy.
Uploaded : Feb, 2012
J. E. Smith and A. R. Pleszkun. Implementing Precise Interrupts in Pipelined Processors, IEEE Trans. on Computers, May 1988, pp. 562-573. ACM DL Link |
1) All instruction preceding the instruction have been executed and have modified processor state correctly
2) All instr following the instruction indicated by the saved program counter are unexecuted and have not modified
3) Saved program counter points to the interrupted instruction.
Classification of interrupts
1) program interrupts
2) external interrupts.
I/O timer interrupts (for restarting), virtual memory page faults, software debugging, arithmetic exceptions, unimplemented opcodes. virtual machines can be implemented if privileged instructions faults cause precise interrupts.
Joseph A. Fisher and B. Ramakrishna Rau. Instruction-Level Parallel Processing, Science, 13 September 1991, pp. 1233-1241. Science Link |
Trace scheduling was originally developed for Very Long Instruction Word, or VLIW machines, and is a form of global code motion. It works by converting a loop to long straight-line code sequence using loop unrolling and static branch prediction. This process separates out "unlikely" code and adds handlers for exits from trace. The goal is to have the most common case executed as a sequential set of instructions without branches.
SuperBlock Schel
T-Y. Yeh and Y. Patt, Two-level Adaptive Training Branch Prediction, Proc. 24th Annual International Symposium on Microarchitecture, November 1991, pp. 51-61. ACM DL Link |
Gurindar S. Sohi, Scott E. Breach, and T.N. Vijaykumar, Multiscalar Processors, Proc. 22nd Annual Symposium on Computer Architecture, June 1995, pp. 414-425. ACM DL Link |
Subbarao Palacharla, Norman P. Jouppi and J. E. Smith. Complexity-effective Superscalar Processors, Proc. 24th Annual International Symposium on Computer Architecture, June 1997, pp. 206-218. |
Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, Gurindar S. Sohi, Dynamic Speculation and Synchronization of Data Dependences. ISCA 1997: 181-193. ACM DL Link |
Srikanth T Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton, Continual flow pipelines. ASPLOS 2004. ACM DL Link |
Wen-Hann Wang, Jean-Loup Baer, and Henry M. Levy. Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy, ISCA 1989. ACM DL Link |
Bruce Jacob and Trevor Mudge. Virtual Memory on Contemporary Processors, IEEE Micro, vol. 18, no. 4, 1998. IEEE Xplore link |
Vinod Cuppu, Bruce Jacob, Brian Davis, and Trevor Mudge, A performance comparison of contemporary DRAM architectures, ISCA 1999. IEEE Xplore link |
Changkyu Kim, Doug Burger, Stephen W. Keckler: An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS 2002. ACM DL Link |
Viji Srinivasan, Davidson, E.S., Tyson, G.S., "A prefetch taxonomy," Computers, IEEE Transactions on , vol.53, no.2, pp. 126-140, Feb 2004. IEEE Xplore link |
Arvind, Rishiyur S. Nikhil. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. Computers 39(3): 300-318 (1990). ACM DL Link |
loss of determinacy > significant complexity to establishing correctness
long memory and communication latency > happens in ||
syncro?
sequential scheduling (PC based)
Tagged-Token Dataflow Architecture
Id : high level language with fine grained parallelism implicit in operational semantics
A good parallel programming lang should
insulate programmer from no_of_procs, topology/speed of comm network
identify parallelism auto
must be determinate (if algo is then program is)
W. W. Hwu, R. E. Hank, D. M. Gallagher, S. A. Mahlke, D. M. Lavery, G. E. Haab, J. C. Gyllenhaal, and D. I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE 83(12), December 1995. IEEE Xplore link |
Compilers improve program exec speed by
eliminating unnec instr
putting stuff used most in registers. (reduce memory accesses).
Make use of hardware parallelism requried.
To parallelise a program (ILP compilation)
Dependency extraction
ILP optimizations
Core Scheduling (reorder the instructions)
Definitions (and why)
basic blocks : control in control out block
Control flow graph : control dependence
Register data dependence : flow dependence, anti-dependence (WAW), loop-independence (within the same iterations in a loop), loop-carried (over the loop boundary) == register output dependence.
Memory data dependence : loop independent memory anti-dependence : inside loop unnecessary wait due to overwrite fear.
loop carried memory antidependence. memory input dependence : 2 instr read from the same memory location (can remove redundant loads)
Go beyond basic blocks : Scheduling
Superblock : formed from the trace by dupilcating blocks to remove merge points and their associated complications from the trace.
reorder instr with the superblock : global acyclic (in the control flow graph) scheduling.
Speculative code motion : move instructions.
Techniques : Register renaming/Loop unrolling / software pipelining (cyclic scheduling).
for (i=1) to bignumber A(i); B(i); C(i) end >> A(i) A(i+1) A(i+2) B(i)... in software.
Predicated Execution
Transformations of conditional branches into predicates : control depencences > to data dependences.
>>> Predicated exec allows the compiler to trade instr fetch efficiency for the capability to expose ILP to the hardware along multiple execution paths.
Hyperblock : connected basic blocks where control may only enter in the first block, and exit anywhere.
(tradeoff : branch performance with resource usage, hazard conditions and critical path lengths).
Branch combining replaces a group of exit branches with a corresponding group of predicate define instr. now OR-type sematics can be used, once you hit the exit code, find out why you got to the exit code, jump to that exit destination.
Loop peeling : compiler peels away first several iterations of a loop and combined with the outer loop.
Control height reduction : dependence chain length to compute execution contidtions of instr
Jerry Huck, Dale Morris, Jonathan Ross, Allan Knies, Hans Mulder, Rumi Zahir. Introducing the IA-64 Architecture. IEEE Micro, vol. 20, no. 5, pp. 12-23, Sep./Oct. 2000. IEEE Xplore link |
John Goodacre and Andrew N. Sloss, Parallelism and the ARM instruction set architecture. Computer, July 2005. IEEE Xplore |
Richard M. Russell. The Cray-1 Computer System, Communications of the ACM, January 1978, pp. 63-72. ACM DL Link |
Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996, pp. 28-40. IEEE Xplore link |
Timothy J. Slegel, et al. IBM's S/390 G5 Microprocessor, IEEE Micro, Mar/Apr 1999, pp. 12-23. IEEE Xplore link |
T. Mudge. Power: A first class design constraint. Computer, vol. 34, no. 4, April 2001, pp. 52-57. IEEE Xplore link |
Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor V. Zyuban, Philip N. Strenski, Philip G. Emma: Optimizing pipelines for power and performance. MICRO 2002: 333-344. IEEE Xplore link |
Dan Ernst, et al., A Low-Power Pipeline Based on Circuit-Level Timing Speculation, MICRO 2003. IEEE Xplore link |
Shubhendu S. Mukherjee, Christoper Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin, "Measuring Architectural Vulnerability Factors," Top picks of 2003 in IEEE Micro, Nov/Dec 2003. IEEE Xplore link |
D. Burger et al. Scaling to the End of Silicon with EDGE Architectures. IEEE Computer 2004. Volume: 37, Issue: 7. ACM DL Link |
what drove comp arch
1970 : Memory was expensive
1980 : no of devices that could fit in chip
now : MHz wall
4 major emerging tech characteristics
Pipeline depth limits > fine-grained concurrency mechanisms to improve performance
Power
on-chip communication dominant execution
polymorphism (multi function)
EDGE : Explicit data flow graph execution
Groups of instructions > hyperblock that can be run in parallel
Architecture + compile balance :
RISC : Arch has to look for (reconstruct) dependencies (dynamic placement, dynamic issue)
VLIW : very difficult for compiler. (static p, static i)
EDGE : static placement, dynamic issue
Compiler's jobs
>> Large hyperblock formation
>> Predicated execution (just create a dependency on the first of the instructions, dependency following will auto block rest)
>> Generating legal hyperblocks (device constraints)
>> Physical placement
Parallelism
Data level P is kind of obvious
TLP : 4 threads > evenly distribute exec blocks to threads
Shekhar Y. Borkar: Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro 25(6): 10-16 (2005). IEEE Xplore link |
Gabriel H. Loh, Yuan Xie, Bryan Black, Processor Design in Three-Dimensional Die-Stacking Technologies, In IEEE Micro, vol. 27(3), pp. 31-48, May-June, 2007. IEEE Xplore link |
Taeho Kgil, David Roberts, Trevor Mudge, "Improving NAND Flash Based Disk Caches," ISCA 2008. ACM DL link |
W. Daniel Hillis and Guy L. Steele. Data Parallel Algorithms, Communications of the ACM, December 1986, pp. 1170-1183. ACM DL Link |
Connection Machine Architecture
16384/65536 processors with 4k bits of memory each.
Front end processor (von Neumann style) and array of the above proc.
Proc array extends the ISA of the front-end processor.
SIMD.
Processors :
State bit > if an instr will be executed.
Context : set of processors > context flag > statebit.
Immediate values broadcast
ILLIAC IV, Goodyear MPP, Non-Von ICL DAP, CM > data parallel.
pointer-based communication > SEND instr : proc address + memory address. kind of indirect store.
virtual processors > microcoded controller between frontend and array : helps program abstraction from hardware details.
1+ same program for different hardware
2+ processors may be regarded as expandable => (Fortran preallocates all memory), other prog languages > procedure call to allocate a "fresh" processor. (eg : malloc in C can) => processor-cons > memory comes with a processor.
Paper goes on and on about algorithm adaptations to data parallel archs. and yes they can be done.
Coolest conclusion :
When operating on lots of data : The parallelism to be gained by concurrently operating on multiple data elements will be greater than the parallelism to be gained by concurrently executing lines of code.
John M. Mellor-Crummey, Michael L. Scott: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9(1): 21-65 (1991). ACM DL link |
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, Anoop Gupta: The SPLASH-2 Programs: Characterization and Methodological Considerations. ISCA 1995: 24-36. ACM DL Link |
computational load balance
communication to computation ratio and traffic needs
important working set sizes
issues related to spatial locality
how these scale with problem size
number of processors
benchmark intro paper :
characterize the programs : basic properties, architectural interactions
user help : what veal params to choose
Leonardo Dagumand and Ramesh Menon, OpenMP: An Industry Standard API for Shared Memory Programming. IEEE Computational Science and Engineering, Jan-Mar, 1998. IEEE Xplore link, Local uw link |
A.R. Alameldeen and D.A. Wood,, "Variability in Architectural Simulations of Multi-threaded Workloads," HPCA 2003. IEEE Xplore |
L. Lamport, “How to make a multiprocessor computer that correctly executes multiprocess programs,” IEEE Transactions on Computers, vol. 28, no. 9, pp. 241-248, Sept. 1979. IEEE Xplore link |
Anoop Gupta, Wolf-Dietrich Weber: Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Trans. Computers 41(7): 794-810 (1992). IEEE Xplore link |
Sarita V. Adve and Kourosh Gharachorloo. Shared Memory Consistency Models: A Tutorial, IEEE Computer, December 1996, pp. 66-76. IEEE Xplore link |
(1) All cores insert their loads and stores into the order <m respecting their program order,
regardless of whether they are to the same or different addresses (i.e., a=b or a≠b). There are four
cases:
If L(a) <p L(b) ⇒ L(a) <m L(b) /* Load→Load */
If L(a) <p S(b) ⇒ L(a) <m S(b) /* Load→Store */
If S(a) <p S(b) ⇒ S(a) <m S(b) /* Store→Store */
If S(a) <p L(b) ⇒ S(a) <m L(b) /* Store→Load */
(2) Every load gets its value from the last store before it (in global memory order) to the same
address:
Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a)}, where MAX <m denotes “latest in
Coherence makes caches invisible. Consistency can make shared memory
(1) All cores insert their loads and stores into the memory order <m respecting their program
order, regardless of whether they are to the same or different addresses (i.e., a==b or a!=b). There
are four cases:
If L(a) <p L(b) ⇒ L(a) <m L(b) /* Load → Load */
If L(a) <p S(b) ⇒ L(a) <m S(b) /* Load → Store */
If S(a) <p S(b) ⇒ S(a) <m S(b) /* Store → Store */
If S(a) <p L(b) ==> S(a) <m L(b) /* Store-->Load */ /* Change 1: Enable FIFO Write
Buffer */
(2) Every load gets its value from the last store before it to the same address:
Value of L(a) = Value of MAX <m {S(a) | S(a) <m L(a)} /* Change 2: Need Bypassing */
Michael Zhang, Krste Asanovic: Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. ISCA 2005: 336-345. ACM DL Link |
Milo M. K. Martin, Mark D. Hill, and David A. Wood, "Token Coherence: Decoupling Performance and Correctness," International Symposium on Computer Architecture (ISCA), June 2003. IEEE Xplore Link |
Steven L. Scott, Synchronization and Communication in the T3E Multiprocessor, Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp. 26-36. ACM DL Link |
James Laudon, Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. ISCA 1997: 241-251. ACM DL Link |
Erik Hagersten and Michael Koster, WildFire: A Scalable Path for SMPs, Proc. 5th IEEE Symposium on High-Performance Computer Architecture, January 1999, 172-181. IEEE Xplore link |
Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing, Proc. International Symposium on Computer Architecture, June 2000, pp. 282-293. ACM DL link |
Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun, "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005. IEEE Xplore |
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym, "NVIDIA TESLA: A Unified Graphics and Computing Architecture", IEEE Micro Volume 28, Issue 2, Date: March-April 2008, Pages: 39-55. IEEE Xplore |
Charles E. Leiserson, et al., The Network Architecture of the Connection Machine CM-5, Proc. ACM Symposium on Parallel Algorithms and Architectures, June 1992, pp. 272-295. ACM DL link
Dally and Towles, "Route packets, not wires: on-chip interconnection networks" , DAC 2001. IEEE Xplore link, http://cva.stanford.edu/publications/2001/onchip_dac01.pdf |
Use soemthing like PCI on chip. all modules talk to each other thru a standardized interface, network routes packets to destinations.
Suggested network
input port : 256 bit data + control field.
Type (Flow control digit or FLIT) : 2 bits : new packet (head), tail, middle or idle.
Size (4 bits) log of data size.
Virtual Channel (8 bits) bit mask.
Route (16 bits) : encoded as 2 bits for each hop till the destination. (NEWS).
Ready : a signal from the network back to the client indicating network is ready.
Output ports
Registers : ex : time-slot reserv for certain classes of traffic.
Higher level protocol can be piggy backed on this.
Router : 5 input controller (NEWS + self), 5 output contrllers.
Fault tolerance : can find out which routers are crappy. and adapt accordingly.
Pre-scheduled and dynamic traffic. (using virtual channels)
Differences between intra and interchip networks :
1) lots of pins available. => wide flits possible. folded torus topology is employed. (2x wire demand, 2x bandwidth demand).
2) reduced buffer storage requirements.
3) Specialized transievers possible.
Area overhead exists (for routers) but
1) predicatable electrical parameters => tweakable
2) reuse with universal interface
3) reuse to network
4) network => duty factor of wires increased.
Structure performance and modularity adv if you replace top-level wiritng with on-chip interconnects.
Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, David Webb, "The Alpha 21364 Network Architecture," IEEE Micro, vol. 22, no. 1, pp. 26-35, Jan./Feb. 2002. IEEE Xplore link |
John Kim, James Balfour, and William J Dally. Flattened butterfly topology for on-chip networks. In MICRO 40: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007. ACM DL Link |
J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. A Scalable Approach to Thread-Level Speculation. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. ACM DL link |
TLS : auto parallelize code
Epochs : time stamped with a epoch number (ordering)
Track data dependency between epochs
homefree token : when guaranteed no violations, this epoch can commit.
Objectives
large-scale parallel machine (single chip mups or SMT) seamlessly perfom TLS across entire machine > communication diff
no recompilation
Detect data dependence violation at run time > leverage invalidation based cache coherence.
if an invalidation arrives from a logically-earlier epoch for a line that we have speculatively loaded > bonkers.
Speculation level > Cache at which speculation occurs
whats required
(i) notion of whether a a cache line has been speculatively loaded and/or modified
(ii) guarentee that a pec cache line will not be propagated to regular memory
spec fails if cache line is replaced!
(iii) ordering of all spec mmory references (epoch numbers and homefree token)
Hardware
Cache states
Apart from Dirty, Shared, Exclusive and Invalid, speculatively loaded and specul modified.
no kicking out till that epoch becomes home free. (if a must, speculation fails)
Messages
read-exclusive-speculative, invalidation speculative, upgrade request speculative
+ epoch number of the requester
> only hints, no real need to oblige.
when speculation succeeds
instead of scanning all lines and changing spec states to normal
Ownership required buffer > when a line becomes both speculatively modified and shared.
when home free token arrives, generate upgrade request for each entry in ORB.
Optimizations
Forwarding data between epochs : have wait-signal synchronization
Dirty and spec loaded state : anyway speculated dirty is never evicted, so if you load and then modify, just store it as DSpL state
suspend the epochs that have violations (resume when u get homefree token)
Support for multiple writers : combine results (ah!) fine grained SM bits (bytes/words in cache line)
support in SMT machines
(i) two epochs may not modify the same line
(ii) t- epoch cannot see t+ modification
Conclusions
8-75% speedups!
ORB overhead low (used to commit speculative modifications faster)
Ravi Rajwar, James R. Goodman: Speculative lock elision: enabling highly concurrent multithreaded execution. MICRO 2001: 294-305. ACM DL link |
Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill and David A. Wood. LogTM: Log-based Transactional Memory. HPCA 2006. IEEE Xplore link |
Conflict detection
? overlap between write set of trans 1 with write/read or trans 2,3,4...
>> Eager : detect offending loads/stores immediately
>> Lazy : when transactions commit.
Lazy, Lazy : OCC DBMSs, Stanford TCC
Lazy conflict, Eager Version : None (zombies)
Eager, Lazy : MIT LTM, Intel/Brown VTM (conflict using cache conflicts)
Eager Eager : MIT UTM, CCC (conservative concurrency control) DBMSs, LogTM
Log TM
Version management
>> per-thread before image log in cacheable virtual memory
>> Last in first out undo.
Conflict detection
>> coherence request to directly
>> Directory responds and possibly forwards the request to one or more procs
>> each responding processor examines some loacl state to detect a conflict
>> the responding processors ack (no conflict) or nack (conflict) the req
>> the requesting processor resolves any conflict
?? What if cache block is evicted?
>>> directly still forwards to all "appropriate processors"
>>> Go from M to sticky M [on overflow bit set, check if even sticky M violation happened]
>>> Lazy cleaning of sticky state (when overflow is not set and its sticky => it was sticky sometime ago, not now)
Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, Andreas Moshovos: Mechanisms for store-wait-free multiprocessors. ISCA 2007: 266-277. ACM DL link |
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23rd Annual International Symposium on Computer Architecture, May 1996, pp. 191-202. ACM DL link |
Mark D. Hill, Michael R. Marty, "Amdahl's Law in the Multicore Era," Computer, vol. 41, no. 7, pp. 33-38, July 2008. IEEE Xplore link |