Converting Thread-Level Parallelism to Instruction-Level
Parallelism via Simultaneous Mutlithreading
"The most compelling reason for running parallel
applications on an SMT processor is its ability to use thread-level parallelism
and instruction-level parallelism interchangeably."
SMT,
or simultaneous multithreading, is a way to exploit both thread level
parallelism <TLP> and instruction level parallelism <ILP> by being
able to partition processor recourses dynamically according to the workload
Main Idea
- By allowing multiple threads to share the processorÕs
functional units simultaneously, TLP is essentially converted into ILP
- An SMT processor can exploit whichever type of parallelism
is currently available in the workload, utilizing the functional units
more effectively to achieve greater throughput and program speedup
- Since an SMT comprises of a single processor, hardware
resources are not statically partitionedÑa limiting factor in CMPÕs and
other multiprocessor systemsÑand can thus exploit itÕs resources fully
- AmdahlÕs Law strikes again: with a multiprocessor or CMP,
one can only achieve speedup with workloads that have heavy TLP, but with
SMT, speedups involving TLP and ILP can be realized, giving it a greater
potential for versatility
SMT
vs CMPs and Multiprocessors
- With multiple threads, there is a lot of inter-thread ILP
since the instructions are all basically independent <can issue more
independent instructions per cycle>
- With a CMP or multiprocessor, multiple threads can execute
efficiently because there are multiple banks of functional units to be
utilized for each thread
- A CMP or multiprocessor fails in performance compared to
SMT when it comes to programs that donÕt have a lot of TLP because the
hardware of a CMP/multiprocessor is statically partitioned, and thus
cannot be shared when there arenÕt multiple threads
- An SMT processor is able to completely utilize its
functional units when a small number of threads are available, whereas on
a CMP/multiprocessor, there may be functional units sitting stagnant,
waiting for work and wasting precious processing resources
- The paper does not try and argue that SMT is any absolute Òx
percentÓ better than multiprocessors, but instead tries to simply
demonstrate that SMT can overcome some fundamental limitations of
multiprocessors
- The paper does not take into consideration chip size, SMT
processor design and implementation <which has a potential to be quite
complex>, and power
- It only focuses on extremely parallel applications, and
doesnÕt look at multiprogrammed workloads where there are a lot of stress
put onto the caches in a shared memory environment
Inaccuracies
- At one point, the paper uses some pretty weird Òinefficiency
metricsÓ for comparing performance on a multiprocessor, where the
percentages given were in terms of when there are functional units idle on
other processesors while the same functional units were swamped on another
processor within the same system
- None of the metrics were very well explained, and it was
difficult to really get a good idea of what the paper is trying to
accomplish with these vague statistics
Do
we think that SMT is better than CMP/MP?
- NOÑ weÕre not convinced that SMT is going to ever
completely take over everything that a CMP system can do, and are
skeptical as to the market that is out there for SMT processors
- Is the complexity of building a wide issue superscalar SMT
really worth it do get a somewhat noticeable <in some applications>
and otherwise not so much better system?