(Hierarchical Trace Processors)
|
Faculty |
Students |
Alumni |
|
|
|
Indian Institute of Science |
The HiT Processors project designs hierarchical trace processors that simultaneously achieve high-performance (or cost-performance) and relative simplicity of implementation. The designs rely on novel, elegant micro-architectures that use (i) multiple levels of architectural hierarchy and organizational hierarchy, and (ii) localized communication. Architectural hierarchy is used to access and exploit far-flung ILP in programs, the source of huge performance improvements over current superscalar processors. Organizational hierarchy and locality of communication naturally result from the architectural hierarchy, allowing the replication of relatively simple components to alleviate implementation complexity. Current trace processor designs show the potential to outperform comparable superscalar processors by factors of roughly 2 to 4 -- a large improvement in an area where improvements of a few percent to a few tens of percent are more typical. We expect trace processors to be the choice for the next generation of microarchitectures, replacing current superscalars.
Trace Processors, proposed in our ISCA'97 paper, is the first and primary contribution of the project to date. Trace processors have made two major contributions:
· a processor design centred around traces rather than around individual instructions, and
· hardware-based dynamic (re)compilation of traces.
The former contribution enables replication of simpler sub-processors to build a wide-issue processor. It builds on the raw trace cache proposed by Eric Rotenberg and Jim Smith as a high-bandwidth, low-latency instruction supply mechanism for traditional superscalar processors.
The latter contribution introduces the idea of processing (cooking) a trace on first encounter and reusing the results on subsequent visits, i.e., hardware-based dynamic recompilation at runtime. This idea is the key to trace processors, enabling the processing of a 16-instruction trace in a single clock. This idea has opened up the possibility of a whole range of other ancillary runtime trace optimizations to further improve performance.
Dynamic Vectorization, proposed in our ISCA'99 paper, adds a further level of hierarchy to trace processors. The major contribution of this work is the idea of dynamically detecting repetitive executions of a trace and converting the trace to vector form. Such conversion overcomes compile-time hurdles to vectorization, enables the processing of post-loop code in parallel with loop processing (effectively increasing the logical instruction window size), and uses a small physical instruction window to capture the resulting large logical instruction window.
Trace processors employ multiple levels of architectural hierarchy. Small sequences of basic blocks are dynamically partitioned into traces, and each trace is handled as a CISC-like instruction with multiple sources and multiple destinations. Further, repetitive executions of a trace are captured as a vector trace; these have vector (as well as scalar) source and destination operands, and are handled as higher-level CISC-like instructions.
ILP is exploited hierarchically: Loop and post-loop code are executed in parallel by executing vector and scalar traces in parallel; multiple scalar traces execute in parallel; multiple instructions of a trace execute in parallel.
Register Value Communication is hierarchical: Register values are classified into those visible only within a trace, only within a vector trace, and across traces. Communication of register values is appropriately structured hierarchically. When traces are dispatched, only those registers visible across traces need to be handled. Registers visible only within a trace are handled separately, at trace processing time.
Architectural hierarchy naturally induces organizational hierarchy and locality of communication.
Organizational hierarchy in trace processors is a natural result of architectural hierarchy. The instruction window, register file, functional units, and bypass logic are partitioned and grouped around traces. Each trace executes in a small, relatively simple 2-way superscalar processor that has its own small instruction subwindow, relatively simple local instruction-issue logic, a small (16-entry), limited-port local register file, a small number of local functional units, and full local bypassing of results. (A vector trace executes in a PE that has the additional capability of handling vector issue and vector operands.) Multiple copies of these simple PEs are tied together by global trace dispatch logic and a shared global register file to form the trace processor.
Locality of communication in trace processors is a natural result of architectural hierarchy. A large fraction of the register values produced by a 16-instruction trace is typically consumed fully within the trace and is not visible to other traces. As trace size increases, the proportion of register values consumed locally increases. Further, a good fraction of register values that are visible across loop iterations but not beyond the loop (as in recurrences) are consumed within a vector trace. Such locality of communication results in a hierarchy of register files: multiple local register files and a global register file.
For vector traces, memory value communication also exhibits a significant amount of locality.
Current work on HiT processors attempts to address some of the following issues:
· detailed performance evaluation of a fleshed-out hierarchical trace processor design
· design of a high-bandwidth data memory system that also efficiently supports large-scale speculative memory accesses
· adapting HiT processors to cost-conscious environments (such as embedded processors)
· further extending the ability of HiT processors to build larger logical instruction windows and thus access more far-flung ILP
Primary Contributions
· Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences, Sriram Vajapeyam, Tulika Mitra, Int'l Symp. on Comp.Architecture, June 1997, Denver, Colorado, USA.
This paper proposes trace processors.
· Dynamic Vectorization: A Mechanism for Exploiting Far-Flung ILP in Ordinary Programs, Sriram Vajapeyam, P.J.Joseph, Tulika Mitra, Int'l Symp. on Comp.Architecture, May 1999, Atlanta, GA, USA.
This paper proposes runtime vectorization of programs.
Advocacy papers
· Trace Processors: Moving to Fourth-Generation Microarchitectures, James E. Smith, Sriram Vajapeyam, IEEE Computer, Special Issue on The Future of Microprocessors, September 1997.
This paper advocates trace processors as the next generation microarchitecture, replacing current superscalar processors.
Related projects by Masters students of Sriram Vajapeyam:
· A Decoupled Control/Execute Architecture for Speculative Execution, S. Jayashree, August 1993-to-June 1994.
· Performance Evaluation Of Register Renaming Feature Of RS/6000, D. Srinivasa Kumar, May-to-December 1994.
· Characterizing Benchmark Programs for Parallelism in a Multiple Execution Flow Machine, Shankar Velayudhan, May-to-December 1995.
· Study of ILP Techniques used in the PowerPC Microarchitecture, A. B. Kishor, May-to-December 1995.
· A Study of JAVA Trace Processors, V. Radhakrishna, May-to-December 1997.
· Characterization of Hierarchical ILP, Valsaraj, May-to-December, 1997.
· University of Wisconsin-Madison
· University of Michigan-Ann Arbor
Related Commercial Processors
contact us you are visitor number since July 24,1999.