(2.3.5) ARM ISA

John Goodacre and Andrew N. Sloss, Parallelism and the ARM instruction set architecture. Computer, July 2005. IEEE Xplore

Typically : Performance and Efficiency methods

Variable execution time

subword parallelism

DSP (Specialization)

TLP, exception handling

multiprocessing

Variable execution time

multiple loads/stores on single instr

epilog and prologue of subroutines

code density

Inline barrel shifter

Conditional execution (predication)

16-bit thumb instr set (read separately!)

Data Level Parallelism

sub word SIMD. (divide the 32 bits into 8x4/2x16 and parallel)

Thread Level Parallelism

? Improve exception handling

increases complexity in interrupt handler, scheduler, context switch

> Special instructions :

CPS : Change processor state

RFE : Return from exception

SRS : Save return state

Multiprocessor Atomic instructions

LDREX (load exclusive)/ STREX

>> Physically tagged cache over virtually tagged : 20% improvement in overall performance.

Instruction level parallelism

2004 : cost-performance-through-MHz wall

Why multicores ?

High MHZ > costly

ILP is complex and costly (Extracting)

Programming multiple independent processors > non portable and inefficient

ARM 11 : multiprocessor

Generic interrupt controller

Snoop Control Unit

Enhanced atomic instructions

Lock-free syncronization > wake up/sleep spin locks

CPU number and context registers with privileges

Weakly ordered memory consistency

wmb() > write memory barrier

rmb() > read mem barrier

DSB() > drain store buffer

SMP performance

Cache Coherence (SCU : at CPU speed)

Inter processor communication (Software initiated interprocessor interrupt, async)

Load Balanced interrupt handling

SCU :

copy of physical address tag for fast access

migratory line > if a line moved from shared to write and another processor requests it, its assumed the other processor will eventually write it, moved directly to M'. cool for locks and stuff