/home/elhill/12-2-03.html

Exploring the Design Space of Future CMPs - Notes

Goals

Determine which CMP organization yields the highest throughput (in terms of IPC).
Explore how the optimal configurations change across successive technology generations.

Factors

Processor organization - How many processing cores should be on a chip, as well as whether the cores should be in-order or out-of-order.
The amount of cache memory available to each processor.
Off-chip bandwidth available
Application characteristics - whether the workloads are processor bound, cache sensitive, or bandwidth bound.
The category that a benchmark is placed in depends on its working set size, with processor bound being the smallest, and bandwidth bound being the largest.

Modeling

Assumed unique L1 and L2 cache per processor (issues with this discussed below)
Pin chip density is scaled according to SIA projections
Cache Byte Equivalent Area (CBE) model is used for area estimation
Rambus memory channels are modeled in detail, with 60 pins used per channel.
Throughput oriented benchmarks are run using different cache sizes to reason about area/performance tradeoffs as well as to explore effects of limited bandwidth

Conclusions

Transistor counts are growing at a significantly faster rate that the number of pins. If this does not change the number of cores in a CMP as well as the maximum achievable throughput will be limited. Apart from an insufficient number of pins being available, the fact that signaling speeds of I/O pins are not increasing at the same rate as processor clock rates will also affect CMP throughput.
Out-of-order core are more efficient that in-order cores in terms of performance per unit area.
The size of throughput-optimal L2 caches continues to grow with each succeeding technology generation.
It may be useful in future studies to model CMP performance in terms of bandwidth demand.

Limitations

“throughput oriented” spec2k benchmarks were used as workloads
spec2k benchmarks are uniprocessor benchmarks, so effects of synchronization and data sharing are not measured.
The workloads chosen by the authors are not the workloads that would be run on a real CMP, which limits the usefulness of the conclusions.
The authors claim that DSS workloads are similar to cache sensitive benchmarks studied in the paper, and that OLTP workloads are similar to the bandwidth bound benchmarks that were studied. This is not quite true because the OLTP benchmarks are more cache sensitive than the bandwidth bound benchmarks.
Instead of a using a memory hierarchy with a shared L2 and private L1 caches for each processor, this paper assumed private L1 and L2 caches for each processor. The authors argued that the performance of a shared L2 will be limited as CMPs scale due to global wiring delays and large cache bandwidth requirements.
It is more likely that this the memory hierarchy choices were made due to the simulation infrastructure the authors used.
A pipelined, switched interconnect could be used to reduce delays.
Piranha CMP used a non-inclusive cache hierarchy to deal with scalability problems.