(3.6.1) IF and Issue on a SMT

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23rd Annual International Symposium on Computer Architecture, May 1996, pp. 191-202. ACM DL link

SMT

permits multiple independent threads to issue multiple instructions each cycle to a super scalar processor's functional units.

combines multiple-instr issue feature of superscalars with latency latency hiding ability of multithreaded archs

conventional multithreaded : fast context switching to share processor exec resources. (SMT : dynamic sharing)

impediments to processor utilization : long latencies and limited per-thread parallelism.

This paper :

throughput gains of SMT possible without extensive chanegs to wide issue superscalar proc.

SMT need not compromise on single-thread performance.

analyze and relieve bottlenecks

advantage : choose best instr from all threads to fetch and issue in each cycle.

Changes needed to support SMT in hardware

multiple PCs, choose logic

separate return stacks

per-thread instr retirement, instr queue flush, trap mechanisms

thread id

large register file : logical registers for all threads + additional registers for reg renaming

this determines number of instr that canbe in the processor between rename stage and commit stage.

Fetch unit partitioning

frequency of branches, misalignment of branch destinations make it difficult to fill entire fetch bandwidth from one thread.

RR.1.8 (round robin, 1 thread, 8 instr)

indistinguishable from super scalar

RR.2.4, RR.4.2

cache banks are single ported > bank conflict logic is needed.

RR.2.4 : two cache outputs, 4.2 has 4.

>> address mux, multiple address buses, selection logic on the output drivers, bank conflict logic, multiple hit/miss calculations. access time affected.

RR.2.8

ideal.

Thread choice : how to choose which thread to issue from :

IQ clog : too many instr that block for a long time, IQ is filled.

Size of IQ limited by search

Algos to choose

BRCOUNT : favor threads with fewest unresolved branches.

MISSCOUNT : fewest outstanding D cache misses

ICOUNT : threads with fewest instr in decode, rename and IQs.

(1) prevents one thread from filling IQ,

(2) highest priority to threads moving instr through IQ most.

(3) even mix of instr, maximizing parallelism

IQPOSN : lowest priority to instr closest to head of either the integer or floating point iqs (oldest instr)

instruction counting provides the best gain.

Unblocking the fetch unit

BIGQ : increase IQ size (without area penalty : search only first 32 items)

ITAG : pre-search the tags to find misses early (more ports required)

wasted instr : wrong -path instr (branch mispred), optimistically issued load-dependent instr.

bottlenecks :

Issue bandwidth x

IQ size x

Fetch bandwidth Branch frequency and PC alignment problems

Branch prediction y

Speculative Execution y

Memory Throughput > SMT hides memory latency but not memory throughput.

Fetch thoughput is the bottle neck.