Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23rd Annual International Symposium on Computer Architecture, May 1996, pp. 191-202. ACM DL link
|
SMT
permits multiple independent threads to issue multiple instructions each cycle to a super scalar processor's functional units.
combines multiple-instr issue feature of superscalars with latency latency hiding ability of multithreaded archs
conventional multithreaded : fast context switching to share processor exec resources. (SMT : dynamic sharing)
impediments to processor utilization : long latencies and limited per-thread parallelism.
This paper :
throughput gains of SMT possible without extensive chanegs to wide issue superscalar proc.
SMT need not compromise on single-thread performance.
analyze and relieve bottlenecks
advantage : choose best instr from all threads to fetch and issue in each cycle.
Changes needed to support SMT in hardware
multiple PCs, choose logic
separate return stacks
per-thread instr retirement, instr queue flush, trap mechanisms
thread id
large register file : logical registers for all threads + additional registers for reg renaming
this determines number of instr that canbe in the processor between rename stage and commit stage.
Fetch unit partitioning
frequency of branches, misalignment of branch destinations make it difficult to fill entire fetch bandwidth from one thread.
RR.1.8 (round robin, 1 thread, 8 instr)
indistinguishable from super scalar
RR.2.4, RR.4.2
cache banks are single ported > bank conflict logic is needed.
RR.2.4 : two cache outputs, 4.2 has 4.
>> address mux, multiple address buses, selection logic on the output drivers, bank conflict logic, multiple hit/miss calculations. access time affected.
RR.2.8
ideal.
Thread choice : how to choose which thread to issue from :
IQ clog : too many instr that block for a long time, IQ is filled.
Size of IQ limited by search
Algos to choose
BRCOUNT : favor threads with fewest unresolved branches.
MISSCOUNT : fewest outstanding D cache misses
ICOUNT : threads with fewest instr in decode, rename and IQs.
(1) prevents one thread from filling IQ,
(2) highest priority to threads moving instr through IQ most.
(3) even mix of instr, maximizing parallelism
IQPOSN : lowest priority to instr closest to head of either the integer or floating point iqs (oldest instr)
instruction counting provides the best gain.
Unblocking the fetch unit
BIGQ : increase IQ size (without area penalty : search only first 32 items)
ITAG : pre-search the tags to find misses early (more ports required)
wasted instr : wrong -path instr (branch mispred), optimistically issued load-dependent instr.
bottlenecks :
Issue bandwidth x
IQ size x
Fetch bandwidth Branch frequency and PC alignment problems
Branch prediction y
Speculative Execution y
Memory Throughput > SMT hides memory latency but not memory throughput.
Register file size
Fetch thoughput is the bottle neck.