John Goodacre and Andrew N. Sloss, Parallelism and the ARM instruction set architecture. Computer, July 2005. IEEE Xplore |
Typically : Performance and Efficiency methods
Variable execution time
subword parallelism
DSP (Specialization)
TLP, exception handling
multiprocessing
Variable execution time
multiple loads/stores on single instr
epilog and prologue of subroutines
code density
Inline barrel shifter
Conditional execution (predication)
16-bit thumb instr set (read separately!)
Data Level Parallelism
sub word SIMD. (divide the 32 bits into 8x4/2x16 and parallel)
Thread Level Parallelism
? Improve exception handling
increases complexity in interrupt handler, scheduler, context switch
> Special instructions :
CPS : Change processor state
RFE : Return from exception
SRS : Save return state
Multiprocessor Atomic instructions
LDREX (load exclusive)/ STREX
>> Physically tagged cache over virtually tagged : 20% improvement in overall performance.
Instruction level parallelism
2004 : cost-performance-through-MHz wall
Why multicores ?
High MHZ > costly
ILP is complex and costly (Extracting)
Programming multiple independent processors > non portable and inefficient
ARM 11 : multiprocessor
Generic interrupt controller
Snoop Control Unit
Enhanced atomic instructions
Lock-free syncronization > wake up/sleep spin locks
CPU number and context registers with privileges
Weakly ordered memory consistency
wmb() > write memory barrier
rmb() > read mem barrier
DSB() > drain store buffer
SMP performance
Cache Coherence (SCU : at CPU speed)
Inter processor communication (Software initiated interprocessor interrupt, async)
Load Balanced interrupt handling
SCU :
copy of physical address tag for fast access
migratory line > if a line moved from shared to write and another processor requests it, its assumed the other processor will eventually write it, moved directly to M'. cool for locks and stuff