Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Mutlithreading

"The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably."

SMT, or simultaneous multithreading, is a way to exploit both thread level parallelism <TLP> and instruction level parallelism <ILP> by being able to partition processor recourses dynamically according to the workload

Main Idea

By allowing multiple threads to share the processor’s functional units simultaneously, TLP is essentially converted into ILP
An SMT processor can exploit whichever type of parallelism is currently available in the workload, utilizing the functional units more effectively to achieve greater throughput and program speedup
Since an SMT comprises of a single processor, hardware resources are not statically partitioned—a limiting factor in CMP’s and other multiprocessor systems—and can thus exploit it’s resources fully
Amdahl’s Law strikes again: with a multiprocessor or CMP, one can only achieve speedup with workloads that have heavy TLP, but with SMT, speedups involving TLP and ILP can be realized, giving it a greater potential for versatility

SMT vs CMPs and Multiprocessors

With multiple threads, there is a lot of inter-thread ILP since the instructions are all basically independent <can issue more independent instructions per cycle>
With a CMP or multiprocessor, multiple threads can execute efficiently because there are multiple banks of functional units to be utilized for each thread
A CMP or multiprocessor fails in performance compared to SMT when it comes to programs that don’t have a lot of TLP because the hardware of a CMP/multiprocessor is statically partitioned, and thus cannot be shared when there aren’t multiple threads
An SMT processor is able to completely utilize its functional units when a small number of threads are available, whereas on a CMP/multiprocessor, there may be functional units sitting stagnant, waiting for work and wasting precious processing resources

Things to Point Out

The paper does not try and argue that SMT is any absolute “x percent” better than multiprocessors, but instead tries to simply demonstrate that SMT can overcome some fundamental limitations of multiprocessors
The paper does not take into consideration chip size, SMT processor design and implementation <which has a potential to be quite complex>, and power
It only focuses on extremely parallel applications, and doesn’t look at multiprogrammed workloads where there are a lot of stress put onto the caches in a shared memory environment

Inaccuracies

At one point, the paper uses some pretty weird “inefficiency metrics” for comparing performance on a multiprocessor, where the percentages given were in terms of when there are functional units idle on other processesors while the same functional units were swamped on another processor within the same system
None of the metrics were very well explained, and it was difficult to really get a good idea of what the paper is trying to accomplish with these vague statistics

Do we think that SMT is better than CMP/MP?

NO— we’re not convinced that SMT is going to ever completely take over everything that a CMP system can do, and are skeptical as to the market that is out there for SMT processors
Is the complexity of building a wide issue superscalar SMT really worth it do get a somewhat noticeable <in some applications> and otherwise not so much better system?