(3.6.1) IF and Issue on a SMT

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Proc. 23rd Annual International Symposium on Computer Architecture, May 1996, pp. 191-202. ACM DL link



SMT
     permits multiple independent threads to issue multiple instructions each cycle to a super scalar processor's functional units. 
     combines multiple-instr issue feature of superscalars with latency latency hiding ability of multithreaded archs
    
conventional multithreaded : fast context switching to share processor exec resources. (SMT : dynamic sharing)

impediments to processor utilization : long latencies and limited per-thread parallelism. 

This paper : 
    throughput gains of SMT possible without extensive chanegs to wide issue superscalar proc.
    SMT need not compromise on single-thread performance. 
    analyze and relieve bottlenecks
    advantage : choose best instr from all threads to fetch and issue in each cycle. 

Changes needed to support SMT in hardware 
    multiple PCs, choose logic
    separate return stacks
    per-thread instr retirement, instr queue flush, trap mechanisms
    thread id
    large register file : logical registers for all threads + additional registers for reg renaming
         this determines number of instr that canbe in the processor between rename stage and commit stage. 

Fetch unit partitioning
    frequency of branches, misalignment of branch destinations make it difficult to fill entire fetch bandwidth from one thread.

    RR.1.8 (round robin, 1 thread, 8 instr)
         indistinguishable from super scalar
    RR.2.4, RR.4.2
         cache banks are single ported > bank conflict logic is needed. 
         RR.2.4 : two cache outputs, 4.2 has 4. 
         >> address mux, multiple address buses, selection logic on the output drivers, bank conflict logic, multiple hit/miss calculations. access time affected. 
    RR.2.8 
         ideal.

Thread choice : how to choose which thread to issue from :
    IQ clog : too many instr that block for a long time, IQ is filled. 
    Size of IQ limited by search 
    Algos to choose 
    BRCOUNT : favor threads with fewest unresolved branches. 
    MISSCOUNT : fewest outstanding D cache misses
    ICOUNT : threads with fewest instr in decode, rename and IQs. 
         (1) prevents one thread from filling IQ, 
         (2) highest priority to threads moving instr through IQ most. 
         (3) even mix of instr, maximizing parallelism
    IQPOSN : lowest priority to instr closest to head of either the integer or floating point iqs (oldest instr)

    instruction counting provides the best gain. 

Unblocking the fetch unit
    BIGQ : increase IQ size (without area penalty : search only first 32 items)
    ITAG : pre-search the tags to find misses early (more ports required)

wasted instr : wrong -path instr (branch mispred), optimistically issued load-dependent instr. 

bottlenecks : 
    Issue bandwidth x
    IQ size               x
    Fetch bandwidth Branch frequency and PC alignment problems
    Branch prediction y
    Speculative Execution y
    Memory Throughput > SMT hides memory latency but not memory throughput. 
    Register file size
         Fetch thoughput is the bottle neck.