%Z ------------------------------------------------------------------------- %Z %Z Refer/bib bibliographic entries for the 21st %Z INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE %Z (1994). Created by Julie Fingerson and Mark D. Hill. %Z Original abstracts typed by Mari Sailer in 9/96. %Z %Z These entries are correct to the best of our knowledge, %Z but we accept no responsibility for the consequences of %Z any errors. Email corrections to hoffman@cs.wisc.edu. %Z Last change: Sat Aug 31 21:49:19 CDT 1996 %Z %Z ------------------------------------------------------------------------- %Z %T Fast and Accurate Instruction Fetch and Branch Prediction %A Brad Calder %A Dirk Grunwald %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 2-11 %X Accurate branch prediction is critical to performance; mispredicted branches mean that ten's of cycles may be wasted in superscalar architectures. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue processors. These mechanisms require considerable processor resources. Concurrently, the larger address space of 64-bit architectures introduce new obstacles and opportunities. A larger address space means branch target buffers become more expensive. In this paper, we show how a combination of less expensive mechanisms can achieve better performance that BTB's. This combination relies on a number of design choices described in the paper. We used trace-driven simulation to show that our proposed design, which uses fewer resources, offers better performance than previously proposed alternatives for most programs, and indicate how to further improve this design. %T The Impact of Unresolved Branches on Branch Prediction Scheme Performance %A Adam R. Talcott %A Wayne Yamamoto %A Mauricio J. Serrano %A Roger C. Wood %A Mario Nemirovsky %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 12-21 %X In this paper, we examine the benefits of the early resolution of branch instructions and the impact of unresolved branches on history-based branch prediction schemes by using two new metrics that are more revealing than branch prediction accuracy alone. We first briefly review a number of branch prediction schemes and introduce two new branch prediction scheme performance metrics. We then utilize these metrics to gauge the improvement in branch prediction scheme performance when only the outcomes of unresolved branches are predicted. Finally, we examine two approaches for handling multiple unresolved branches in history-based branch prediction schemes, and determine that prediction accuracy remains quite stable when older branch histories are used. %T Evaluating Stream Buffers as a Secondary Cache Replacement %A Subbarao Palacharla %A R. E. Kessler %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 24-33 %X Today's commodity microprocessors require a low latency memory system to achieve high sustained performance. The conventional high-performance memory system provides fast data access via a large secondary cache. But large secondary caches can be expensive, particularly in large-scale parallel systems with many processors (and thus many caches). We evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Joupi's stream buffers [10] and a main memory. This memory system requires very little hardware compared to a large secondary cache and doesn't require modifications to commodity processors. We use trace-driven simulation of fifteen scientific applications from the NAS and PERFECT suites in our evaluation. We present two techniques to enhance the effectiveness of Jouppi's original stream buffers: filtering schemes to reduce their memory bandwidth requirement and a scheme that enables stream buffers to prefetch data being accessed in large strides. Our results show that, for the majority of our benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches. Also, we find that as the data-set size of the scientific workload increases the performance of the streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data-set sizes. %T Tradeoffs in Two-Level On-Chip Caching %A Norman P. Jouppi %A Steven J. E. Wilton %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 34-45 %X The performance of two-level on-chip caching is investigated for a range of technology and architecture assumptions. The area and access time of each level of cache is modeled in detail. The results indicate that for most workloads, two-level cache configurations (with a set-associative second level) perform marginally better than single-level cache configurations that require the same chip area once the first-level cache sizes are 64KB or larger. Two-level configurations become even more important in systems with no off-chip cache and in systems in which the memory cells in the first-level caches are multiported and hence larger than those in the second-level cache. Finally, a new replacement policy called two-level exclusive caching is introduced. Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity. %T Architectural Support for Performance Tuning: A Case Study on the SPARCcenter2000 %A Ashok Singhal %A Aaron J. Goldberg %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 48-59 %X Latency hiding techniques such as multilevel cache hierarchies yield high performance when applications map well onto hierarchy implementations, but performance can suffer drastically when they do not. Identifying and reducing mismatches between an application and the memory hierarchy is difficult without insight into the actual behavior of the hardware implementation. We advocate the use of hardware event counters, as a cheap, effective and practical way to tune applications for a given hardware platform. We take a case study approach, focussing on the counters available on the SPARCcenter 2000, a 20 processor, shared-bus based multiprocessor. We describe the tools we built to relate hardware event counts to user applications and give examples to illustrate how these tools are useful in practice. We conclude with a critique of the current hardware counters, offering a user's perspective on how they could be redesigned to be more effective. %T Characterization of Alpha AXP Performance Using TP and SPEC Workloads %A Zarka Cvetanovic %A Dileep Bhandarkar %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 60-70 %X The characteristics of several commercial and technical workloads on the DEC 7000 AXP system are compared using built-in hardware monitors. The data analyzed include total instructions, cycles, multiple-issued instructions, stall components, cache misses, and instruction types. The data indicates that the two classes of workloads have vastly different characteristics and impose different requirements on the system design. Compared to VAX, Alpha AXP takes advantage of lower cycles per instruction and cycle time to achieve a significant performance advantage. The cache and memory interconnect subsystems are expected to play a crucial role in the performance of future systems. A simple model for evaluating the effects of various design tradeoffs based on the data collected by using hardware monitors is proposed. %T Measurement-Based Characterization of Global Memory and Network Contention, Operating System and Parallelization Overheads: A Case Study on Shared-Memory Multiprocessor %A Chitra Natarajan %A Sanjay Sharma %A Ravishankar K. Iyer %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 71-80 %X This study presents a characterization of (1) the global memory and interconnection network contention overhead, (2) the operating system overheads, and (3) the runtime system parallelization overheads for the Cedar shared-memory multiprocessor. The measurements were obtained using five representative compute-intensive, scientific, loop parallel applications from the Perfect Benchmark Suite. The overheads were measured for a range of Cedar configurations from 1 processor to the full 4-cluster/32-processor configuration, thus characterizing the effect of this scaling on the overheads. For the full 4-cluster Cedar, the operating system overhead was found to constitute 5-21% of the total completion time of an application. The parallelization overhead accounts for 10-25% of the application completion time, and the overhead due to global memory and network contention contributes 8-21% of the application completion time. %T Evaluating the Memory Overhead Required for COMA Architectures %A Truman Joe %A John L. Hennessy %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 82-93 %X Cache only memory architectures (COMA) have an inherent memory overhead due to the organization of main memory as a large cache called an attraction memory. This overhead consists of memory left unallocated for performance reasons as well as additional physical memory required due to the cache organization of memory. In this work, we examine the effect of data reshuffling and data replication on the memory overhead. Data reshuffling occurs when space needs to be allocated to store a remote memory line in the local memory. Data that is reshuffled is sent between memories via replacement messages. A simple mathematical model predicts the frequency of data reshuffling as a function of the attraction memory parameters. Simulation data shows that the frequency of data reshuffling is sensitive to the allocation policy and associativity of the memory but is relatively unaffected by the block size chosen. The simulation data also shows that data replication in the attraction memory is important for good performance, but most gains can be achieved through replication in the processor caches. %T A Comparison of Message Passing and Shared Memory Architectures for Data Parallel Programs %A Alexander C. Klaiber %A Henry M. Levy %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 94-105 %X Shared memory and message passing are two opposing communication models for parallel multicomputer architectures. Comparing such architectures has been difficult, because applications must be hand-crafted for each architecture, often resulting in radically different sources for comparison. While it is clear that shared memory machines are currently easier to program, in the future, programs will be written in high-level languages and compiled to the specific parallel target, thus eliminating this difference. In this paper, we evaluate several parallel architecture alternatives -- message passing, NUMA, and cache-coherent shared memory -- for a collection of scientific benchmarks written in C*, a data-parallel language. Using a single suite of C* source programs, we compile each benchmark and simulate the interconnect for the alternative models. Our objective is to examine underlying, technology-independent costs inherent in each alternative. Our results show the relative work required to execute these data parallel programs on the different architectures, and point out where some models have inherent advantages for particular data-parallel program styles. %T Software Versus Hardware Shared-Memory Implementation: A Case Study %A Alan L. Cox %A Sandhya Dwarkadas %A Pete Keleher %A Honghui Lu %A Ramakrishnan Rajamony %A Willy Zwaenepoel %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 106-117 %X We compare the performance of software-supported shared memory on a general-purpose network to hardware-supported shared memory on a dedicated interconnect. Up to eight processors, our results are based on the execution of a set of application programs on a SGI 4D/480 multiprocessor and on TreadMarks, a distributed shared memory system that runs on a Fore ATM LAN of DECstation-5000/240s. Since the DECstation and the 4D/480 use the same processor, primary cache, and compiler, the shared-memory implementation is the principal difference between the systems. Our results show that TreadMarks performs comparably to the 4D/480 for applications with moderate amounts of synchronization, but the difference in performance grows as the synchronization frequency increases. For applications that require a large amount of memory bandwidth, TreadMarks can perform better than the SGI 4D/480. Beyond eight processors, our results are based on execution-driven simulation. Specifically, we compare a software implementation on a general-purpose network of uniprocessor nodes, a hardware implementation using a directory-based protocol on a dedicated interconnect, and a combined implementation using software to provide shared memory between multiprocessor nodes with hardware implementing shared memory within a node. For the modest size of the problems that we can simulate, the hardware implementation scales well and the software implementation scales poorly. The combined approach delivers performance close to that of the hardware implementation for applications with small to moderate synchronization rates and good locality. Reductions in communication overhead improve the performance of the software and the combined approach, but synchronization remains a bottleneck. %T Guarded Executing and Branch Prediction in Dynamic ILP Processors %A Dionisios N. Pnevmatikatos %A Gurindar S. Sohi %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 120-129 %X We evaluate the effects of guarded (or conditional, or predicated) execution on the performance of an instruction level parallel processor employing dynamic branch prediction. First, we assess the utility of guarded execution, both qualitatively and quantitatively, using a variety of application programs. Our assessment shows that guarded execution significantly increases the opportunities, for both compiler and dynamic hardware, to extract and exploit parallelism. However, existing methods of specifying guarded execution have several drawbacks that limit its use. Second, we study the interaction of guarded execution and dynamic branch prediction and show that the use of guarded execution significantly increases the number of instructions between mispredicted branches. Third, we propose a new method of specifying guarded execution. The proposed method uses a special GUARD instruction, which can be used to incorporate guarded execution into existing instruction sets. GUARD instructions realize the full power of guarded execution, without the drawbacks of existing methods of specifying guarded execution. %T Branch with Masked Squashing in Superpipelined Processors %A Ching-Long Su %A Alvin M. Despain %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 130-140 %X The performance of a superpipeline processor heavily relies on its branch performance. Traditional branch strategies used in pipelined processors are delayed branches and branch with squashing. Delayed branches use safe instructions to fill delay slots. However, for a deeply pipelined processor, a compiler may not be able to find sufficient safe instructions to fill the branch delay slots. Branch with squashing takes advantage of using instructions in target basic blocks to fill the branch delay slots. However, the penalty of branch misprediction is large in superpipelined processors. In this paper, we proposed a novel branch scheme, named branch with masked squashing, which takes advantage of both delayed branch and branch with squashing. The basic idea is to fill delay slots with safe instructions which may come from above or after the branch. For the remaining unfilled delay slots, we fill with instructions from the predicted target path. In the case of misprediction, only unsafe instructions are annulled. The safe instructions in branch delay slots are always executed. Simulation results show that this branch strategy performs better than traditional delayed branch and branch with squashing. %T Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer %A Matthias A. Blumrich %A Kai Li %A Richard Alpert %A Cezary Dubnicki %A Edward W. Felten %A Jonathan Sandberg %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 142-153 %X The network interfaces of existing multicomputers require a significant amount of software overhead to provide protection and to implement message passing protocols. This paper describes the design of a low-latency, high-bandwidth, virtual memory-mapped network interface for the SHRIMP multicomputer project at Princeton University. Without sacrificing protection, the network interface achieves low latency by using virtual memory mapping and write-latency hiding techniques, and obtains high bandwidth by providing a user-level block data transfer mechanism. We have implemented several message passing primitives in an experimental environment, demonstrating that our approach can reduce the message passing overhead to a few user-level instructions. %T Architecture and Evaluation of High-Speed Networking Subsystem for Distributed-Memory Systems %A Peter Steenkiste %A Michael Hemy %A Todd Mummert %A Brian Zill %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 154-163 %X Achieving high-speed network I/O on distributed-memory systems is difficult because their architecture is in general ill-suited for communication processing. Some of the common problems are: inability to do protocol processing, inefficient handling of data distribution, and poor management of the I/O. In this paper we present an I/O architecture that addresses these problems and supports high-speed network I/O on distributed-memory systems. The key to good performance is to partition the work appropriately between the system and the network interface. We perform some communication tasks on the distributed-memory parallel system since it is more powerful, and less likely to become a bottleneck than the network interface. Tasks that do not parallelize well are performed on the network interface and hardware support is provided for the most time-critical operations. We emphasize the use of simple I/O mechanisms that can be used by programming tools that map applications on the distributed-memory system to implement efficient I/O for the class of applications they support. This architecture has been implemented for the iWarp distributed-memory system. We describe this implementation and present performance results. %T Exploring the Design Space for a Shared-Cache Multiprocessor %A Basem A. Nayfeh %A Kunle Olukotun %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 166-175 %X In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques. %T Impact of Sharing-Based Thread Placement on Multithreaded Architectures %A Radhika Thekkath %A Susan J. Eggers %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 176-186 %X Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate threads that share data on the same processor. Multiple threads sharing in the cache should reduce compulsory and invalidation misses, thereby improving execution time. To test this hypothesis, we compared a variety of thread placement algorithms via trace-driven simulation of fourteen coarse- and medium-grain parallel applications on several multithreaded architectures. Our results contradict the hypothesis. Rather than decreasing, compulsory and invalidation misses remained nearly constant across all placement algorithms, for all processor configurations, even with an infinite cache. That is, sharing-based placement had no (positive) effect on execution time. Instead, load balancing was the critical factor that affected performance. Our results were due to one or both of the following reasons: (1) the sequential and uniform access of shared data by the application's threads and (2) the insignificant number of data references that require interconnect access, relative to the total number of instructions. %T Combined Performance Gains of Simple Cache Protocol Extensions %A Fredrik Dahlgren %A Michel Dubois %A Per Stenstrom %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 187-197 %X We consider three simple extensions to directory-based cache coherence protocols in shared-memory multiprocessors. These extensions are aimed at reducing the penalties associated with memory accesses and include a hardware prefetching scheme, a migratory sharing optimization, and a competitive-update mechanism. Since they target different components of the read and write penalties, they can be combined effectively. Detailed architectural simulations using five benchmarks show substantial combined performance gains obtained at a modest additional hardware cost. Prefetching in combination with competitive-update is the best combination under release consistency in systems with sufficient network bandwidth. By contrast, prefetching plus the migratory sharing optimization is advantageous under sequential consistency and/or in systems with limited network bandwidth. %T Speculative Disambiguation: A Compilation Technique for Dynamic Memory Disambiguation %A Andrew S. Huang %A Gert Slavenburg %A John Paul Shen %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 200-210 %X Ambiguous memory references have always been one of the main sources of performance bottlenecks. Many papers have addressed this problem using static disambiguation. These methods work extremely well when the memory access pattern is linear and predictable. However they are ineffective when the memory access pattern is nonlinear or when the access pattern cannot be determined statically. For these difficult problems, this paper presents speculative disambiguation, a compilation technique for architectures supporting instruction level parallelism and either speculative execution or conditional execution (or both). This technique produces specialized code at compile time to disambiguate memory references at run time. It is shown that on machines with sufficient resources, the technique will always result in lower execution time. Speculative disambiguation has been implemented for a VLIW architecture with guarded execution. Preliminary results indicate that it can help bridge a significant fraction of the performance gap between a good and a perfect static disambiguator. Occasionally it can outperform the perfect static disambiguator. %T Complexity/Performance Tradeoffs with Non-Blocking Loads %A Keith I. Farkas %A Norman P. Jouppi %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 211-222 %X Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. In this paper, we describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. We have investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in non-blocking systems. %T A Performance Study of Software and Hardware Data Prefetching Schemes %A Tien-Fu Chen %A Jean-Loup Baer %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 223-232 %X Prefetching, i.e. exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et. at.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes. In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead. %T RAID-II: A High-Bandwidth Network File Server %A Ann L. Drapeau %A Ken W. Shirriff %A John H Hartman %A Ethan L. Miller %A Srinivasan Seshan %A Randy H. Katz %A Ken Lutz %A David A. Patterson %A Edward K. Lee %A Peter M. Chen %A Garth A. Gibson %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 234-244 %X In 1989, the RAID (Redundant Arrays of Inexpensive Disks) group at U. C. Berkeley built a prototype disk array called RAID-I. The bandwidth delivered to clients by RAID-I was severely limited by the memory system bandwidth of the disk array's host workstation. We designed our second prototype, RAID-II, to deliver more of the disk array bandwidth to file server clients. A custom-built crossbar memory system called the XBUS board connects the disks directly to the high-speed network, allowing data for large requests to bypass the server workstation. RAID-II runs Log-Structured File System (LFS) software to optimize performance for bandwidth-intensive applications. The RAID-II hardware with a single XBUS controller board delivers 20 megabytes/second for large, random read operations and up to 31 megabytes/second for sequential read operations. A preliminary implementation of LFS on RAID-II delivers 21 megabytes/second on large read requests and 15 megabytes/second on large write operations. %T EVENODD: An Optimal Scheme for Tolerating Double Disk Failures in RAID Architectures %A Mario Blaum %A Jim Brady %A Jehoshua Bruck %A Jai Menon %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 245-254 %X We present a novel method, that we call EVENODD, for tolerating up to two disk failures in RAID architectures. EVENODD is the first known scheme for tolerating double disk failures that is optimal with regard to both storage and performance. EVENODD employs the addition of only two redundant disks and consists of simple exclusive-OR computations. A major advantage of EVENODD is that it only requires parity hardware, which is typically present in standard RAID-5 controllers. Hence, EVENODD can be implemented on standard RAID-5 controllers without any hardware changes. The only previously known scheme that employs optimal redundant storage (i.e. two extra disks) is based on Reed-Solomon (RS) error-correcting codes, requires computation over finite fields and results in a more complex implementation. For example, we show that the number of exclusive-OR operations involved in implementing EVENODD in a disk array with 15 disks is about 50% of the number required when using the RS scheme. %T Crosshatch Disk Array for Improved Reliability and Performance %A Spencer W. Ng %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 255-264 %X Redundant disk array architecture provides fault tolerance against disk drive failures. However, a storage subsystem consists of more than just disk drives. There must also be controllers for interfacing with the disk drives, cabling for providing data/control paths, power supplies, etc. For the subsystem to be fault tolerant, it must also be able to tolerate failure in any one of these components. While currently known array architectures can be designed to be tolerant to [a] single failure in the support hardware, little attention has been paid to performance in the event of such failures. The recovery procedures required after such failures have been repaired were also not fully considered. In this paper a new array architecture is presented which outperforms other currently known array architectures when some supporting hardware such as a controller or cable has failed. Furthermore, this architecture is fault tolerant to at least double failures in the support hardware and is, therefore, a more reliable and robust architecture. %T METRO: A Router Architecture for High-Performance, Short-Haul Routing Networks %A Frederic Chong %A Henry Minsky %A Andre DeHon %A Matthew Becker %A Samuel Peretz %A Eran Egozy %A Thomas F. Knight, Jr. %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 266-277 %X The Multipath Enhanced Transit Router Organization (METRO) is a flexible routing architecture for high-performance, tightly-coupled, multiprocessors and routing hubs. A MERTO router is a dilated crossbar routing component supporting half-duplex bidirectional, pipelined, circuit-switched connections. Each METRO router is self-routing and supports dynamic message traffic. The routers works [sic] in conjunction with source-responsible network interfaces to achieve reliable end-to-end data transmission in the presence of heavy network congestion and dynamic faults. METRO separates the fundamental architectural characteristics from implementation parameters. Simplicity of routing function coupled with freedom in the implementation parameters allows METRO implementations to fully exploit available technology to achieve low-latency and high-bandwidth. We illustrate the effects of this implementation freedom by summarizing the performance which various METRO configurations can extract from some modern CMOS technologies. Included in our illustrations is METROJR-ORBIT, a minimal instance of the METRO architecture we constructed in a 1.2micron gate-array technology. %T Ariadne - An Adaptive Router for Fault-Tolerant Multicomputers %A James D. Allen %A Patrick T. Gaughan %A David E. Schimmel %A Sudhakar Yalamanchili %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 278-288 %X Adaptive routing has been proposed as a means of improving performance and fault-tolerance in multicomputer networks. While a number of algorithms have been proposed, few adaptive routers have been implemented in hardware. This paper presents the design and implementation of Ariadne -- a prototype single chip, hardware router. The primary motivation is tolerance to link and router failures, while reconciling conflicting demands on performance. This is achieved by implementing the m-misroute backtracking protocol (MB-m) using the pipelined circuit-switching (PCS) communication mechanism[17]. Ariadne implements two virtual data channels and one virtual control channel per physical link. The router is self-timed with single flit buffering at the input and output ports, and is fully adaptive. %T Compressionless Routing: A Framework for Adaptive and Fault-Tolerant Routing %A Jae H. Kim %A Ziqiang Liu %A Andrew A. Chien %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 289-300 %X Compressionless Routing (CR) is a new adaptive routing framework which provides a unified framework for efficient deadlock-free adaptive routing and fault-tolerance. CR exploits the tight-coupling between wormhole routers for flow control to detect potential deadlock situations and recover from them. Fault-tolerant Compressionless Routing (FCR) extends Compressionless Routing to support end-to-end fault-tolerant delivery. Detailed routing algorithms, implementation complexity and performance simulation results for CR and FCR are presented. CR has the following advantages: deadlock-free adaptive routing in torus networks with no virtual channels, simple router designs, order-preserving message transmission, applicability to a wide variety of network topologies, and elimination of the need for buffer allocation messages. FCR has the following advantages: tolerates transient faults while maintaining data integrity (nonstop fault-tolerance), tolerates permanent faults, can be applied to a wide variety of network topologies, and eliminates the need for software buffering and retry for reliability. These advantages of CR and FCR not only simplify hardware support for adaptive routing and fault-tolerance, they also simplify communication software layers. %T The Stanford FLASH Multiprocessor %A Jeffrey Kuskin %A David Ofelt %A Mark Heinrich %A John Heinlein %A Richard Simoni %A Kourosh Gharachorloo %A John Chapin %A David Nakahira %A Joel Baxter %A Mark Horowitz %A Anoop Gupta %A Mendel Rosenblum %A John Hennessy %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 302-313 %X The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible--it can support a variety of different communication mechanisms--and simplifies the design and implementation. This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASH's current status. %T Software-Extended Coherent Shared Memory: Performance and Cost %A David Chiken %A Anant Agarwal %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 314-324 %X This paper evaluates the tradeoffs involved in the design of the software-extended memory system of Alewife, a multiprocessor architecture that implements coherent shared memory through a combination of hardware and software mechanisms. For each block of memory, Alewife implements between zero and five coherence directory pointers in hardware and allows software to handle requests when the pointers are exhausted. The software includes a flexible coherence interface that facilitates protocol software implementation. This interface is indispensable for conducting experiments and has proven important for implementing enhancements to the basic system. Simulations of a number of applications running on a complete system (with up to 256 processors) demonstrate that the hybrid architecture with five pointers achieves between 71% and 100% of full-map directory performance at a constant cost per processing element. Our experience in designing the software protocol interfaces and experiments with a variety of system configurations leads to a detailed understanding of the interaction of the hardware and software components of the system. The results show that a small amount of shared memory hardware provides adequate performance: One-pointer systems reach between 42% and 100% of full-map performance on our parallel benchmarks. A software-only directory architecture with no hardware pointers has lower performance but minimal cost. %T Tempest and Typhoon: User-Level Shared Memory %A Steven K. Reinhardt %A James R. Larus %A David A. Wood %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 325-336 %X Future parallel computers must efficiently execute not only hand-coded applications but also programs written in high-level, parallel programming languages. Today's machines limit these programs to a single communication paradigm, either message-passing or shared-memory, which results in uneven performance. This paper addresses this problem by defining an interface, Tempest, that exposes low-level communication and memory-system mechanisms so programmers and compilers can customize policies for a given application. Typhoon is a proposed hardware platform that implements these mechanisms with a fully-programmable, user-level processor in the network interface. We demonstrate the utility of Tempest with two examples. First, the Stache protocol uses Tempest's fine-grain access control mechanisms to manage part of a processor's local memory as a large, fully-associative cache for remote data. We simulated Typhoon on the Wisconsin Wind Tunnel and found that Stache running on Typhoon performs comparably (+/-30%) to an all-hardware Dir(N)NB cache-coherence protocol for five shared-memory programs. Second, we illustrate how programmers or compilers can use Tempest's flexibility to exploit an application's sharing patterns with a custom protocol. For the EM3D application, the custom protocol improves performance up to 35% over the all-hardware protocol. %T A Study of Single-Chip Processor/Cache Organizations for Large Numbers of Transistors %A Matthew Farrens %A Gary Tyson %A Andrew R. Pleszkun %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 338-347 %X This paper presents a trace-driven simulation-based study of a wide range of cache configurations and processor counts. This study was undertaken in an attempt to help answer the question of how best to allocate large numbers of transistors, a question that is rapidly increasing in importance as transistor densities continue to climb. At what point does continuing to increase the size of the on-chip first level cache cease to provide sufficient increases in hit rate and become prohibitively difficult to access in a single cycle? In order to compare different configurations, the concept of an Equivalent Cache Transistor is presented. Results indicate that the access time of the first-level data cache is more important than the size. In addition, it appears that once approximately 15 million transistors become available, a two processor configuration is preferable to a single processor with correspondingly larger caches. %T A Unified Architectural Tradeoff Methodology %A Chung-Ho Chen %A Arun K. Somani %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 348-357 %X We present a unified approach to assess the trade-off of architecture techniques that affect mean memory access time. The architectural features we consider include cache hit ratio, processor stalling features, line size, memory cycle time, the external data bus width of a processor, pipelined memory system, and read bypassing write buffers. We demonstrate how each of these features can be traded off to achieve the desired performance. The performance of an architecture feature is quantified in terms of cache hit ratio based on the equivalence of mean memory delay time. This paper investigates the implication [sic.] of architectural trade-offs on the pin count, memory system design, and on-chip cache area for microprocessor systems. %T Optimal Allocation of On-Chip Memory for Multiple-API Operating Systems %A David Nagle %A Richard Uhlig %A Trevor Mudge %A Stuart Sechrest %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 358-369 %X The allocation of die area to different processor components is a central issue in the design of single-chip microprocessors. Chip area is occupied by both core execution logic, such as ALU and FPU datapaths, and memory structures, such as caches, TLBs, and write buffers. This work focuses on the allocations of die area to memory structures through a cost-benefit analysis. The cost of memory structures with different sizes and associativities is estimated by using an established area model for on-chip memory. The performance benefits of selecting a given structure are measured through a collection of methods including on-the-fly hardware monitoring, trace-driven simulation and kernel-based analysis. Special consideration is given to operating systems that support multiple application programming interfaces (APIs), a software trend that substantially affects on-chip memory allocations decisions. Results: Small adjustments in cache and TLB design parameters can significantly impact overall performance. Operating systems that support multiple APIs, such as Mach 3.0, increase the relative importance of on-chip instruction caches and TLBs when compared against single-API systems such as Ultrix. %T Expected I-Cache Miss Rates via the Gap Model %A Russell W. Quong %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 372-383 %X To evaluate the performance of a memory system, computer architects must determine the miss rate m of a cache C when running program P. Typically, the measured miss rate depends on the specific address mapping M of P set arbitrarily by the compiler and linker. In this paper, we remove the effect of the address-mapping on the miss rate by analyzing a symbolic trace T of basic blocks. By assuming each basic block has an equal probability of ending up anywhere in the address map, we determine the expected miss rate averaged over all possible address mappings. Our resulting gap model gives the expected miss rate for instruction caches of varying cache size, line size, and set associativity. Our model is simple but robust, and turns out to be the familiar LRU stack model with a statistical viewpoint. Our model allows a trace of arbitrary length to be compactly summarized in a few thousand bytes of information. Our model also predicts how an intervening trace, such as an operating system call or a task switch, will affect the miss rate. Comparisons to measured miss rates from SPEC 92 instruction traces show that our model typically has relative differences of less than 20% for a variety of cache parameters. %T Decoupled Sectored Caches: Conciliating Low Tag Implementation Cost and Low Miss Ratio %A Andre Seznec %J Proc. 21st Annual Symposium on Computer Architecture %D April 1994 %P 384-393 %X Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size. In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache lines. Maintaining a low tag array size is a major issue in many cache designs (e.g. L2 caches). Using a sectored cache is a design trade-off between a low size of the tag array which is possible with large line size and a low memory traffic which requires a small line size. This technique has been used in many cache designs including small on-chip microprocessor caches and large external second level caches. Unfortunately, as on some applications, the miss ratio on a sectored cache is significantly higher than the miss ratio on a non-sectored cache (factors higher than two are commonly observed), a significant part of the potential performance may be wasted in miss penalties. Usually in a cache, a cache line location is statically linked to one and only one address tag word location. In the decoupled sectored cache we introduce in this paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locations. The tag volume on a decoupled sectored cache is in the same range as in the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache. A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.