Morgan Kaufmann Publishers Web Site Index

Readings in Computer Architecture: Web Component

Mark D. Hill, University of Wisconsin-Madison Computer Sciences,
Norman P. Jouppi, Compaq Western Research Laboratory, and
Gurindar S. Sohi, University of Wisconsin-Madison Computer Sciences

Version 2.0 Released August 8, 2000

Most recently released version:
http://www.mkp.com/architecture-readings/wc/index.html
All versions and change log:
http://www.mkp.com/architecture-readings/wc/all_versions.html

Preface
1. Classic Machines: Technology, Implementation, and Economics
2. Methods
3. Instruction Sets
4. Instruction Level Parallelism (ILP)
5. Dataflow and Multithreading
6. Memory Systems
7. I/O: Storage Systems, Networks, and Graphics
8. Single-Instruction Multiple Data (SIMD) Parallelism
9. Multiprocessors and Multicomputers
10. Recent Implementations and Future Prospects
11. Power Considerations
12. Reliability, Availability, and Servicability/Scalability Considerations

Preface

Welcome to the web component that complements our book Readings in Computer Architecture [1]! The book, available from Morgan Kaufmann Publishers [2], contains more than 50 computer architecture papers that have stood the test of time. It is organized into ten chapters that reflect the key topics in the field of computer architecture. In the first section of each chapter, we introduce the selected papers, placing them in context and often offering additional comments and insights that provide a broader background for the chapter topic.

The cutting edge of computer architecture changes too rapidly to be captured only in a book whose editions can come out no more frequently than every few years. There are many current papers that are worth reading, because they represent the current state-of-the-art (e.g., a new microprocessor) or because they may become classics. In addition, there are other useful web resources (e.g., the web page for a benchmark suite) whose existence is too transitory to cite in a book.

This web component will be updated at least annually to address these dynamic needs. It is modeled after the book with the same ten chapters plus two new chapters: 11. Power Considerations and 12. Reliability, Availability, and Servicability/Scalability Considerations. Each chapter contains pointers to recent papers and web resources together with our notes. These comments introduce the papers and put them in context with each other and current work. Whenever possible, we include Web uniform resource locators (URLs). The papers and resources selected are reviewed by our colleagues in academia and industry to help ensure we have considered the most important recently published papers.

You can always obtain the most recently released version of this web component at URL: http://www.mkp.com/architecture-readings/wc/index.html

Some scholars, however, are displeased by how web pages are easily changed and not archived. To address this concern, released versions of this web component will be permanent and unchanging (except to update outgoing URLs or fix trivial errors). To obtain the version released on September 9, 1999, for example, use URL: http://www.mkp.com/architecture-readings/wc/version_99_09_09.html

To look at all previously released versions and a log of changes, see: http://www.mkp.com/architecture-readings/wc/all_versions.html

There are, however, at least three challenges to the dynamic world of the Web. First, URLs can become obsolete. Please email us at carbugs@mkp.com if you notice this occurring. Second, web content can be extensive and can change. We point to what we find interesting. We do not and cannot vouch for the accuracy and balance of the web content to the same degree as can be achieved through peer review. Third, the reader may be responsible for conforming to copyrights on some web resources. To help readers we call attention to the copyrights we are aware of through "click boxes" that appear when dereferencing selected links.

We conclude this preface by calling your attention to the WWW Computer Architecture Home Page at Wisconsin [3]. Since 1995 this page has been a clearinghouse of computer architecture information. The current version includes URLs for calls for papers and participation for conferences (e.g., ISCA), research groups, researchers, organizations (e.g., ACM, DARPA, IEEE, and NSF), commercial pages (both technical and less technical), books, on-line publications, and internet newsgroups.

References

Mark D. Hill, Norman P. Jouppi, and Gurindar S. Sohi. Readings in Computer Architecture. Morgan Kaufmann Publishers, 2000.
Morgan Kaufmann Publishers. URL: http://www.mkp.com (html).
WWW Computer Architecture Home Page. URL: http://www.cs.wisc.edu/~arch/www (html).

1. Classic Machines: Technology, Implementation, and Economics

In this section we update the reader with more recent references on the technology, implementation, and economics of classic machines. Material already in the book is not duplicated here.

Gordon Bell has placed complete pdf copies of his books online [1]. This is very useful since Bell and Newell [2], Bell, Mudge, and McNamara [3], and Siewiorek, Bell, and Newell are all very valuable references that are out of print. Bell has also put the slides of his recent presentations online, including a talk on Seymour Cray's contributions to computing [4] that contains lots of photos of classic machines from the computer history center.

The long anticipated book by the two IBM 360 architects who went to academia [5] (Blaauw and Brooks) has been published. It contains many details on the instruction set architecture of early machines. They make a strong distinction between "computer architecture" and "computer organization." Their use of those terms corresponds to the terms "instruction-set architecture" and "microarchitecture", respectively, which are more commonly used today.

Recently Ceruzzi has written a great book on the history of computers[6], both from a discussion of the machines themselves as well as the marketplace. It has excellent coverage from the early 1960's until the PC.

The Computer Museum in Boston has a web site with historical information [7]. The Computer Museum's Computer History Center also has good coverage of historical machines [8].

Mark Smotherman has some web pages that cover some aspects of computer history in great detail [9]. Of particular interest is his discussion of the IBM Streatch and his chronology of IBM's ACS project. ACS was a response to IBM's loss of leadership in the high-performance computing segment. It evaluated such advanced concepts as multiple instruction issue and multithreading in the mid 1960's, but it was cancelled by IBM.

References

Online copies of Gordon Bell's out-of-print books. URL: http://www.research.microsoft.com/users/gbell/Pubs.htm (html).
Bell and Newell, Computer Structures: Readings and Examples URL: http://www.research.microsoft.com/users/gbell/
Computer_Structures__Readings_and_Examples/index.html
Bell, Mudge, and Newell, Computer Engineering URL: http://www.research.microsoft.com/users/
gbell/Computer_Engineering/index.html
Gordon Bell's talk on Seymour Cray's contributions to Computing. URL: http://www.research.microsoft.com/users/gbell/craytalk/index.htm (html).
G. A. Blaauw and F. P. Brooks, Jr. Computer Architecture: Concepts and Evolution. Addison Wesley, Reading, MA, 1997.
Paul E. Ceruzzi, A History of Modern Computing. MIT press, 1998.
The web site of the Computer Museum in Boston. URL: http://www.tcm.org (html).
The web site of the Computer Museum's Computer History Center. URL: http://www.tcm.org/html/history/histcenter.html (html).
Mark Smotherman's historical pages. URL: http://www.cs.clemson.edu/~mark/hist.html (html).

2. Methods

This section updates the book's "Methods" chapter in two ways. First, it gives pointers to three benchmark suites in common current use (SPEC, SPLASH-2, and TPC). Second it gives examples of state-of-the-art methods and tools for analytic modeling (CMOS cache access times), simulation (SimpleScalar, RSIM, Simics, and SimOS), and system monitoring (executable editing with ATOM).

Benchmark suites provide a way for comparing system alternatives that is much less work than porting your important workloads to each alternative machine. The answers they give, however, may not be completely representative of your workloads, especially when vendors tune their designs for popular benchmark suites. Here we point to the Web pages of three currently popular benchmark suites.

Standard Performance Evaluation Corporation (SPEC) [5] provides a variety of benchmarks for evaluating systems. SPEC recently announced SPEC CPU2000 to replace the widely-used but aging SPEC CPU95. Other SPEC suites include those for evaluating network file systems, Java virtual machines, and Web servers.

Stanford Parallel Applications for Shared Memory (SPLASH-2) [6] is the most popular benchmark suite for academic architects evaluating shared-memory multiprocessors. In early 1990s, SPLASH-2's predecessor, SPLASH, provided a tremendous boost to the field, by replacing benchmarks like matrix multiply and FFT. Today, however, SPLASH-2 has aged and it not clearly representative of what deployed shared-memory multiprocessors actually run.

Transaction processing Performance Council (TPC) [8] provides benchmarks and results for several important database management system (DBMS) scenarios. Currently, the most popular are TPC-C for on-line transaction processing (OLTP) and TPC-D for decision support systems (DSS). OLTP is characterized by small, well-defined queries (e.g., at an automated bank teller machine), while DSS has longer, ad hoc queries (e.g., how do sales correlate with sales person). Metrics measure throughput and cost-effectiveness (dollars per throughput). TPC benchmarks are highly-regarded for the specific scenarios they address.

A significant part of the book's "Methods" chapter focuses on the three basic methods computer architects use to study systems: analytic modeling, simulation, and system monitoring. We next point to examples that illustrate each.

CACTI [9] is a analytic model embodied in a downloadable tool. It allows one to estimate the access time of alternative cache structures on CMOS microprocessors within 10% of Hspice results. One can, for example, determine the access times for direct-mapped and a two-way set-associative caches, obtain miss ratio data from other tools, and decide which alterative is preferred.

Micro-architectural simulators remain popular tools with computer architects. Wisconsin SimpleScalar [3] and Rice RSIM [2] are widely used in both research and teaching. SimpleScalar is a uniprocessor simulator, while RSIM focuses on multiprocessor issues. Stanford SimOS [1] [7] and Virtutech Simics [10] (a commercial product) provide a more complete system model (including operating system and input/output behavior) at a natural cost of being more challenging to use.

Finally, a significant step forward on the system monitoring front is executable editing, first embodied in ATOM [3]. ATOM (for Alpha executables), Wisconsin EEL (SPARC), and Washington Etch (IA-32) allow users to modify executables so that when they run they can performance auxiliary functions (e.g, count basic block executions) and still run at near full speed (for modest auxiliary functions).

References

Mendel Rosenblum, Edouard Bugnion, Scott Devine, and Stephen Herrod. Using the SimOS Machine Simulator to Study Complex Computer Systems. ACM Trans. on Modeling and Computer Simulation, 7(1):78-103, January 1997. URL: ftp://www-flash.stanford.edu/pub/hive/TOMACS96-simos.pdf (PDF).
The RSIM Project. URL: http://www.ece.rice.edu/~rsim/ (html).
Amitabh Srivastava and Alan Eustace. ATOM A System for Building Customized Program Analysis Tools. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation (PLDI), pages 196-205, June 1994. URL: http://www.research.digital.com/wrl/publications/abstracts/94.2.html (html pointer to postscript and PDF).
Standard Performance Evaluation Corporation (SPEC). URL: http://www.specbench.org/ (html).
Stanford Parallel Applications for Shared Memory (SPLASH-2). URL: http://www-flash.stanford.edu/apps/SPLASH/ (html).
Stanford SimOS: The Complete Machine Simulator. URL: http://simos.stanford.edu/ (html).
Transaction Processing Council (TPC). URL: http://www.tpc.org (html).
Steven J. E. Wilton and Norman P. Jouppi. An Enhanced Access and Cycle Time Model for On-Chip Caches. Compaq WRL Research Report 93/5, July 1994. URL: http://www.research.digital.com/wrl/techreports/abstracts/93.5.html (html pointer to postscript and PDF).
SimpleScalar: Simulation Tools for Microprocessor & System Evaluation. URL: http://www.simplescalar.org (html).
Virtutech Simics. URL: http://www.simics.com (html).

3. Instruction Sets

Instruction sets that are in common use today include RISC instruction sets such as the Compaq Alpha [1] , Apple/IBM/Motorola PowerPC [7] , MIPS [5] , Sun SPARC [8], as well as CISC instruction sets such as the Motorola 68K [6] and the Intel IA-32 [2] . Implementations of these account for most of the volume of general-purpose microprocessors shipped today.

Recently HP and Intel have disclosed parts of the IA-64 architecture [3] . IA-64 is similar in many aspects to the PlayDoh architecture, a research vehicle developed at Hewlett-Packard Laboratories. The technical report by Kathial, Schlansker and Rau [4] describes the PlayDoh architecture.

Multimedia extensions to instruction sets are discussed in Chapter 8.

References

Compaq Alpha OEM Documentation Library. URL: http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html (html).
Intel IA-32 Architecture Reference Shelf. URL: http://developer.intel.com/vtune/cbts/refman.htm (html).
Intel IA-64 Architecture. URL: http://developer.intel.com/design/ia64/architecture.htm (html).
V. Kathial, M. S. Schlansker and B. R. Rau. HPL PlayDoh Architecture Specification: Version 1.0. Technical Report HPL-93-80 , Hewlett-Packard Laboratories Palo Alto, California February 1993. URL: http://www.hpl.hp.com/techreports/93/HPL-93-80.html (html).
MIPS Technologies Inc. Publications. URL: http://www.mips.com/publications/index.html (html).
Motorola 68K/Coldfire Documentation Library. URL: http://www.mot.com/SPS/HPESD/prod/docframe/docs_frame.html (html).
Motorola PowerPC documents home page. URL: http://www.mot.com/SPS/PowerPC/index.html (html).
UltraSPARC User's Manual. URL: http://www.sun.com/microelectronics/manuals/ultrasparc/802-7220-02.pdf (pdf).

4. Instruction Level Parallelism (ILP)

Instruction level parallelism continues to play a very important role in both in industrial processor designs as well as in academic research. This section looks at some of these recent trends.

An important subject of investigation is the exploitation of control independence. Current implementations of dynamically-scheduled processors squash all speculative instructions on a branch misprediction, including control independent instructions. The paper by Lam and Wilson [2] presents a limits study evaluating the limits of control flow on parallelism. In addition to being an example of how to carry out a limits study, it highlights the importance of exploiting control independence.

A processor's sustained instruction execution rate is limited by the rate at which instructions are fetched, and this is in turn limited by branch prediction (among other things such as instruction cache organization). Since the early 1990s there has been a lot of work in branch prediction, including work on innovative branch predictors and on understanding why they work. The paper by McFarling suggests combining multiple branch predictors [4] suggests that multiple branch predictors could be used to achieve prediction rates that are better that what could be achieved with a single predictor. The paper by Young, Gloy and Smith [9] provides a framework for characterizing different branch prediction schemes. The paper by Emer and Gloy [1] presents a language for describing predictors. A formal description allows the design space to be explored in a systematic manner, and also allows predictors to be systhesized.

In dynamically-scheduled processors, dependencies between instructions involving register operands can easily be determined since the register names are available for the decoder to see. Dependencies between memory instructions is more complicated since an address calculation is required. When addresses are not available (or can be used in an efficient manner) to determine the (in)dependence relationship between two memory operations, guesses need to be made about the relationship. (The conservative guess is that a dependence exists whereas an aggressive guess is that no dependence exists.) The paper by Moshovos, Breach, Vijaykumar and Sohi [5] discusses the subject of data dependence speculation . As processors become more aggressive, the use of data dependence speculation techniques is likely to become more pervasive.

With branch prediction being used to overcome control dependence constraints, renaming being used to overcome name dependence constraints, and data dependence speculation being used to overcome ambiguous dependence constraints, we are left with true dependence constraints. Value speculation is a technique to overcome true dependencies: the output of an instruction (or alternately the inputs of an instruction) are predicted, and instructions executed with the predicted value, in parallel with instructions that produce the actual value. The paper by Lipasti and Shen [3] discusses how value speculation can be used to exceed the dataflow limit inherent in a program's execution. In recent years there has been a lot of research on understanding the causes of value predictability, and on building better value predictors.

As shrinking feature sizes allow increased logic speeds, the importance of wire delays increases. Many microprocessor designers believe that wire delays will dominate in future microprocessor designs, and that building superscalar processors with centralized instruction windows will not be feasible. The paper by Palacharla, Jouppi and Smith [6] analyzes the important logic elements of a dynamic issue processor and provides quantitative evidence to support this sentiment. Designers and researchers are investigating decentralized or distributed microarchitectures in which critical functional elements are decentralized. The Compaq Alpha 21264 microprocessor is on the leading edge of this trend: it divides the instruction selection and execution logic into two. Future microprocessors are likely to see a decentralization of all aspects of instruction processing.

Finally we are to the realm of one model for future processors that is being investigated heavily: distributed microarchitectures which combine ILP and thread-level parallelism (TLP), using thread-level speculation. The paper by Sohi, Breach, and Vijaykumar [8] describes the multiscalar architecture in which all important functions are implemented in a distributed fashion. A serial program is speculatively parallelized using thread-level speculation and executed on a parallel microarchitecture with support for recovering from misspeculations.

Statically-scheduled ILP processors have also come back to the forefront with the announcement of HP and Intel's Explicitly Parallel Instruction Computing (EPIC) technology, as epitomized by the recently-announced IA-64 architecture. Research papers on technologies related to EPIC architectures can be found on the Web site of the Illinois IMPACT research group. A recent paper by Schlansker and Rau [7] describes the EPIC philosophy and the set of architectural features that characterize the EPIC style of ILP architectures.

References

Joel Emer and Nikolas Gloy. A Language for Describing Predictors and its Application to Automatic Synthesis, Proc. 24nd Annual International Symposium on Computer Architecture , June 1997. URL: http://www.dec.com/semiconductor/alpha/papers/bpgp-abstract.htm (html).
Monica S. Lam and Robert P. Wilson. Limits of Control Flow on Parallelism, Proc. 19th Annual International Symposium on Computer Architecture , May 1992. URL: http://suif.stanford.edu/papers/lam92/paper.html (html).
M. H. Lipasti and J. P. Shen. Exceeding the Dataflow Limit via Value Prediction, Proc. 29th Annual International Symposium on Microarchitecture , December 1996. URL: http://www.ece.cmu.edu/~mhl/www/micro29.ps.gz (postscript).
Scott McFarling. Combining Branch Predictors, WRL Technical Note, TN-36 , Digital Equipment Corp. June 1993. URL: http://www.research.digital.com/wrl/techreports/abstracts/TN-36.html (html).
A. Moshovos, S. E. Breach, T. N. Vijaykumar and G. S. Sohi. Dynamic Speculation and Synchronization of Data Dependences, Proc. 24th Annual International Symposium on Computer Architecture , June 1997. URL: ftp://ftp.cs.wisc.edu/sohi/papers/1997/isca.data-dep-spec.pdf (PDF).
Subbarao Palacharla, Norman P. Jouppi and J. E. Smith. Complexity-effective superscalar processors, Proc. 24th Annual International Symposium on Computer Architecture , June 1997. URL: ftp://ftp.cs.wisc.edu/sohi/papers/1997/isca.complexity.pdf (PDF).
M. Schlansker and B. R. Rau. EPIC: An Architecture for Instruction-Level Parallel Processors Technical Report HPL-1999-111 , Hewlett-Packard Laboratories Palo Alto, California February 2000. URL: http://www.hpl.hp.com/techreports/1999/HPL-1999-111.html (abstract in html).
G. S. Sohi, S. E. Breach and T.N. Vijaykumar. Multiscalar Processors, Proc. 22nd Annual International Symposium on Computer Architecture , June 1995. URL: ftp://ftp.cs.wisc.edu/sohi/papers/1995/isca.multiscalar.pdf (PDF).
Cliff Young, Nikolas Gloy and Michael D. Smith. A Comparative Analysis of Schemes for Correlated Branch Prediction, Proc. 22nd Annual International Symposium on Computer Architecture , June 1995. URL: http://www.eecs.harvard.edu/machsuif/publications/isca95.pdf (postscript).

5. Dataflow and Multithreading

Much of the work in dataflow and multithreading is finding its way into processor designs through the route of instruction-level parallelism. Dynamically-scheduled superscalar processors strive to achieve dataflow-like execution (albeit of control- and data-speculative instructions). The instruction window and instruction wakeup logic of current generation superscalar processors can be considered to be analogous to the token store of a classical tagged-token dataflow machine. Expectedly, researchers are investigating alternative ways of implementing this functionality (analogous to the explicit token store simplification of the tagged token store).

Many processors currently in the design process (circa 1999) are rumored to support the simultaneous or parallel variety of multithreading. If such support is available, an obvious question is how to use this functionality in non-traditional ways. Two recent proposals try to exploit the available multithreaded support to assist the main computation thread. In the first approach, software exception handlers (e.g., TLB miss handlers or profiling exception handlers) are executed as a separate thread so that the main computation thread does not have to be switched out. A paper by Zilles, Emer, and Sohi describes this approach. [3]. In the second approach, "microthreads" are created, either from the program itself, or via a separate specification. These microthreads are then executed either speculatively or non-speculatively to carry out functions that assist the main thread. Roth, Moshovos and Sohi illustrate how such a speculative thread could be extracted from the program and assist with the task of prefetching linked data structures [2]. Chappell, Stark, Kim, Reinhardt and Patt show a subordinate thread could be specified as a piece of microcode and give an example of how it could be used to assist the branch prediction process [1]. The use of such "microthreads" (also called "subordinate threads" or "scout threads" or "helper threads") is likely to grow in future microprocessors as the available chip real estate allows multiple threads to be supported.

References

Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt and Yale N. Patt. Simultaneous Subordinate Microthreading (SSMT), In Proc. 26nd Annual International Symposium on Computer Architecture , May 1999. URL: http://www.eecs.umich.edu/HPS/pub/ssmt_isca26.ps (postscript).
Amir Roth, Andreas Moshovos, and Gurindar S. Sohi Dependence Based Prefetching for Linked Data Structures, In 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Oct 1998. URL: ftp://ftp.cs.wisc.edu/sohi/papers/1998/asplos-prefetch-lds.pdf (PDF).
Craig Zilles, Joel Emer, and Gurindar S. Sohi. The Use of Multithreading for Software Exception Handling, In Proc. 32nd Annual International Symposium on Microarchitecture , November 1999. URL: ftp://ftp.cs.wisc.edu/sohi/papers/1999/except-thrd.micro.pdf (PDF).

6. Memory Systems

Technology and application trends continue to drive memory system changes. This section looks at four issues: instruction delivery, prefetching, translation lookaside buffers, and DRAM interface changes. These and related issues are discussed in the February 1999 special issue of IEEE Transactions on Computers [3].

Delivering instructions to a modern out-of-order pipeline is both critical to performance and challenging to do. For a pipeline to sustain R retired instructions per cycle, it must be fueled with I > R instructions per cycle (since some instructions are squashed). Since instruction cache misses cause cycles with no instructions delivered, one must both make cache misses rare and deliver J > I instructions when misses are not occurring. A key additional challenge occurs when J becomes larger than the number of instructions in a typical basic block.

Here we point to two papers that give some current thinking on instruction delivery and specifically address how to fetch from more than one basic block per cycle. Yeh et al. [9] propose a design that makes multiple branch predictions per cycle and then uses them to access an instruction cache that permits multiple fetches to different basic blocks each cycle. Rotenberg et al. [6] instead propose coalescing consecutive dynamic instructions (even across branches) into a trace and then caching the traces in a trace cache that need only provide one trace per cycle.

Prefetching attempts to obtain data or instructions from memory prior to its use. This typically reduces average memory latency and a cost of additional bandwidth used (e.g., for items incorrectly prefetched). An important type of prefetching is called non-binding prefetching. Non-binding prefetching moves data (or instructions) into caches but leaves it under the jurisdiction of coherence protocols so that prefetching cannot cause program correctness violations.

Here we point to two papers that highlight the two key ways of initiating non-binding prefetches: software and hardware. Mowry [4] discusses non-binding data prefetches initiated by software after being inserted in applications by a compiler. He reviews a uniprocessor compiler algorithm and then delves into the more challenging problem of prefetching in a shared-memory multiprocessor. Joseph and Grunwald [2] investigate hardware-initiated prefetching. They review well-established prior work, such as stride prefetchers and stream buffers, and then develop their aggressive Markovian prefetcher which will, at minimum, creatively use millions of transistors.

Another challenge is designing memory management units (MMUs) to implement virtual memory. MMU designs must support applications with very large memory footprints (e.g., databases with gigabyte buffer pools) and poor locality (e.g., some object-oriented programs) and yet keep costs down since the same microprocessor design is often deployed in both large servers and small desktops. Talluri and Hill [8] discuss translation lookaside buffer (TLB) alternatives to the brute-force approach of just increasing the number of TLB entries. In particular, they explore using large aligned pages called superpages and/or having multiple translations per TLB entry (called subblocking). Jacob and Mudge [1] survey six commercial MMU designs.

Finally, memory systems are being significantly changed through evolution and revolution of the interface between processors and DRAM memory. Driving these changes is the trend that the memory chip capacity is growing faster than typical memory system sizes. The impact of this trend is to reduce the number of memory chips in a system needed to provide memory capacity to far fewer than are needed to provide adequate memory bandwidth across the traditional DRAM interface. To address this problem, researchers and practitioners have proposed modest changes to DRAM chip interfaces, more radical changes, and even merging processing and DRAM on the same chip.

The November/December 1997 issue of IEEE Micro [7] provides a place to start to understand the issues forcing DRAM interface changes. It includes discussions of semiconductor issues, Rambus, SLDRAM, and merged logic and DRAM, but alas it is already two years old in the fast changing area. More recently, Rambus [5] provides a cogent (but naturally biased) case to their products. Readers should stay tuned as next year's picture of DRAM interfaces will likely be different that this year's.

References

Bruce Jacob and Trevor Mudge. Virtual Memory in Contemporary Microprocessors. IEEE Micro, 18(4):60-75, July/August 1998. URL: http://www.eecs.umich.edu/~tnm/papers/vm-hardware.pdf (PDF).
Doug Joseph and Dirk Grunwald Prefetching Using Markov Predictors. In Proceedings of the International Symposium on Computer Architecture, pages 252-263, June 1997. URL: http://www.cs.colorado.edu/~grunwald/Papers/ISCA97-MarkovPrefetch/paper.pdf (PDF).
Veljko Milutinovic and Mateo Valero, Editors. Special Issue on Cache Memory and Related Problems. IEEE Transactions on Computers, 48(2), February 1999. Web table of contents: http://computer.org/tc/tc1999/t2toc.htm (html). (IEEE members with Digital Library subscriptions can download PDF for all papers.)
Todd C. Mowry. Tolerating Latency in Multiprocessors Through Compiler-Inserted Prefetching. ACM Transactions on Computer Systems, 16(1):55-92, February 1998. URL: http://www.acm.org/pubs/articles/journals/tocs/1998-16-1/p55-mowry/p55-mowry.pdf (PDF).
Rambus Technology Overview. February 1999. URL: http://www.rambus.com/docs/techover.pdf (PDF).
Eric Rotenberg, Steve Bennett, J. E. Smith. Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching. In Proceedings of International Symposium on Microarchitecture, pages 24-34, December 1996. URL: ftp://ftp.cs.wisc.edu/sohi/papers/1996/micro.trace-cache.pdf (PDF).
Ken Sakamura. Special Issue on Advanced DRAM Technology. IEEE Micro, 17(6), November/December 1997. Web table of contents: http://computer.org/micro/mi1997/m6toc.htm (html). (IEEE members with Digital Library subscriptions can download PDF for all papers.)
Madhusudhan Talluri and Mark D. Hill Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, pages 171-182, October 1994. URL: ftp://ftp.cs.wisc.edu/markhill/Papers/asplos6_superpages.pdf (PDF).
Tse-Yu Yeh, D. Marr, and Yale N. Patt. Increasing the Instruction Fetch Rate Via Multiple Branch Prediction and Branch Address Cache. In Proceedings of the 1993 ACM International Conference on Supercomputing, pages 51-61, July 1993. URL: http://www.eecs.umich.edu/HPS/pub/bac_ics93.ps (postscript).

7. I/O: Storage Systems, Networks, and Graphics

In this section we update the reader with more recent references on storage systems, cluster and multiprocessor networks, and computer graphics. Material already in the book is not duplicated here.

There have been a number of developments in I/O standards in 1999-2000. Although the PCI bus has been extended to the PCI-X bus [14], within several years many future I/O systems will be based around the Infiniband [10] industry standard. Infiniband is the merger of several competing industry standards. It uses high-speed (2.5Gb/sec) point-to-point serial links as a basic building block.

Other recent developments include further expansion of Fibre Channel [5] and storage-area networks (SANs). Network-attached secure disks have been quickly moving from research [7] to high-volume products.

Adaptec's I/O Connection [2] is a pictorial reference on SCSI and PC cabling.

More recent work on RAID file servers is appeared in [3] and [12].

Dunning et al. [4] present Compaq/Intel/Microsoft's Virtual Interface Architecture (VIA).

Galles [6] describes the SGI Origin system interconnect.

Gillet [8] describes Compaq's memory channel, used for building clusters.

Horowitz et al. [9] is a good reference for electrical issues involved in high-speed signalling.

Intel's developer web pages [11] contain useful information on PC standards and initiatives.

The design of a 50-Gb/s IP router is described by Partridge et al. [13].

The PC Webopedia [15] and the PC Guide [16] contain lots of information on PC I/O standards.

The proceedings of the 1998 and 1999 Eurographics/SIGGRAPH Workshops on Graphics Hardware Workshops [17] [18] are both online. Recent proceedings of SIGGRAPH [19] also have a limited number of papers on graphics hardware, but they are not available online except through the ACM Digital Library [1].

References

The ACM Digital Library. URL: http://www.acm.org/dl/ (html).
Adaptec's "The I/O Connection". URL: http://www.adaptec.com/products/guide/ioposter.html (html).
Ann Drapeau et al. RAID-II: A High-Bandwidth Network File Server. In Proceedings of the 21st Annual International Symposium on Computer Architecture, Chicago, April, 1994, pp 234-244.
Dave Dunning et. al. The Virtual Interface Architecture. IEEE Micro, 18(2), March/April 1998, pp 66-76.
Fibre Channel Industry Association. URL: http://www.fibrechannel.com/(html).
Mike Galles. Spider: A High-Speed Network Interconnect. IEEE Micro, February 1997, pp 34-39.
Garth Gibson, et. al. Network-Attached Secure Disks. In the Proceedings of ASPLOS 1998 and also online at URL: http://www.pdl.cmu.edu/PDL-FTP/NASD/asplos98.ps.
Richard B. Gillett. Memory Channel Network for PCI. IEEE Micro, 16(1), February 1996, pp 12-18.
Mark Horowitz, Chih-Kong Ken Yang, and Stefanos Sideropoulos. High-Speed Electrical Signalling: Overview and Limitations. IEEE Micro, February 1998, pp 12-24.
Infiniband Trade Association. URL: http://www.infinibandta.org/(html).
Intel Developer Web Site. URL: http://developer.intel.com/ (html).
Jai Menon and Jim Cortney, The Architecture of a Fault-Tolerant RAID Controller. In the proceedings of the International Symposium on Computer Architecture (ISCA), 1993, pages 76-86.
Craig Partridge et. al., A 50-Gb/s IP Router. In IEEE/ACM Transactions on Networking October 1998, pages 237-248.
PCI-X industry enablement program. URL: http://www.compaq.com/products/servers/technology/pci-x-enablement.html(html).
PC Webopedia. URL: http://www.pcwebopedia.com/ (html).
PC Guide. URL: http://www.pcguide.com/ (html).
SIGGRAPH/Eurographics 1998 Workshop on Graphics Hardware. URL: http://www.merl.com/hwws98/ (html).
SIGGRAPH/Eurographics 1999 Workshop on Graphics Hardware. URL: http://www.merl.com/hwws99/ (html).
1999 SIGGRAPH proceedings, ACM Press.

8. Single-Instruction Multiple Data (SIMD) Parallelism

SIMD principles are being employed heavily in the design of multimedia extensions to the base instruction sets of many architectures. These instruction sets are designed to carry out parallel (SIMD) operations on voice, image, and video data streams. Most of the incumbent architectures have adopted multimedia extensions. Intel started out with the MMX extensions [1] in the Pentium microprocessor, and later extended them to the Intel Internet Streaming SIMD [2] extensions in the Pentium III. Sun Microsystems added the VIS [4] extensions to the UltraSPARC architecture, and Apple/IBM/Motorola have added the AltiVec [3] extensions to the PowerPC architecture.

References

Intel Technology Journal, Q3 1997 issue. URL: http://developer.intel.com/technology/itj/q31997.htm (html).
Intel Technology Journal, Q2 1999 issue. URL: http://developer.intel.com/technology/itj/q21999.htm (html).
Motorola AltiVec Technology. URL: http://www.motorola.com/SPS/PowerPC/AltiVec/ (html).
VIS Instruction Set User's Manual. URL: http://www.sun.com/microelectronics/manuals/805-1394.pdf (pdf).

9. Multiprocessors and Multicomputers

This section first discusses advances in shared-memory multiprocessors and then examines some multicomputer/cluster software issues.

Recall that multiprocessors can implement cache coherence (designated CC) or not (NCC) and may have uniform (UMA) or non-uniform (NUMA) memory access times. This leads to four multiprocessor classifications for which we next discuss three recent notable machines: CC-UMA Sun Ultra 10000, CC-NUMA SGI Origin 2000, NCC-UMA (no machine discussed), and NCC-NUMA Cray T3E. We also discuss Compaq Piranha, which combines CC-UMA on-chip with CC-NUMA between chips.

Sun Ultra 10000 [6], code-named Starfire, is a symmetric multiprocessor (SMP) or CC-UMA that pushes the limits of an SMP to support 64 processors and a memory bandwidth over 10 gigabytes per second. Starfire uses snooping coherence but does not have a bus implemented with shared wires. Address transactions travel on four interleaved "buses," implemented with totally-ordered pipeline broadcast trees. Data blocks move on a separate unordered point-to-point crossbar.

Silicon Graphics (SGI) Origin 2000 [9] is a CC-NUMA that scales to 1024 processors and a terabyte of memory. It maintains coherence with a distributed directory protocol derived from that of Stanford DASH, but uses a hypercube-like interconnection network that is more scalable than DASH's mesh.

Cray/SGI T3E [12] is a NCC-NUMA that scales to 2048 DEC Alpha 21164 processors and 4 terabytes of memory. The T3E must overcome that the 21164 was designed to be used in systems with more modest requirements for physical address space, outstanding cache misses, TLB reach, etc. The T3E overcomes these limits with a set of elegant mechanisms, in part learned from the more ad hoc first-generation T3D. Mechanisms added to the "shell" outside each 21164 that includes a set of E-registers that replace the level two cache with something like vector registers, an address centrifuge for creating a large interleaved physical address space, barrier and eureka hardware, and message queue management in memory.

Compaq Piranha [4] is a research prototype that explores cache-coherent multiprocessing within and between chips. Each chip includes eight simple in-order processors (with private level one instruction and data caches), a shared level two cache that is eight-way interleaved and directly connected to eight banks of Rambus RDRAMs, two microprogrammable protocol engines for inter-chip coherence, and an interconnection router. Performance simulations with two commercial workloads show one chip multiprocessor (CMP) can significantly outperform a system using a single complex out-of-order processor chip.

One issue that has made programming and implementing shared-memory multiprocessors challenging is dealing with memory consistency models that formally define the memory interface between hardware and software. Adve and Gharachorloo [2] concisely review the state of the art of memory consistency models, including key differences between well-known academic and commercial models. Gniady et al. [8], on the other hand, present evidence that speculative execution may encourage hardware designers to return to the relatively-simple model of sequential consistency. Nevertheless, compiler writers will likely continue to target relaxed models when considering optimizations like code motion and register allocation, particularly on staticly-scheduled machines like the Merced implementation of IA-64.

Multicomputers and clusters are parallel systems with nodes joined by a network usually connected to a node's I/O system. Each node is logically like a common off-the-shelf (COTS) computer and, in some cases, is just a COTS computer. Multicomputer research, therefore, often focuses on using COTS parts and innovating in the software instead. Here we point out five software developments.

The Message Passing Interface (MPI) [10] is a highly-successful effort to unify the many vendor-specific messaging systems that existed before it. It facilitates portable codes and good performance for applications that can be structured to send coarse-grain messages (kilobytes to megabytes).

Several research proposals have looked at supporting fine-grained messages (e.g., the few words needed in a request control message). Notable among these is Berkeley's Active Messages [7] that allows a short message to specify a handler for receiving the message and possibly generating a low-overhead response message. Active Messages, for example, allow a reasonable implementation of the shared-memory Split-C programming model on non-shared-memory multicomputers.

Considerable progress has been made on software distributed shared memory systems that implement a shared memory application binary interface (ABI) on a cluster of computers. Rice Treadmarks [3] tolerates long network latencies with techniques like lazy release consistency, while Compaq (formerly DEC) Shasta [11] implements its ABI with sufficient fidelity to run a commercial database management system.

NASA Beowulf [5] is an effort to "put it all together." Beowulf provides software (or pointers to software) that allow non-experts to put together clusters of PCs using the Linux operating system to run parallel applications with middleware like MPI. There is no magic here. Beowulf is notable, because people find it works.

Finally, the gigantic end of parallel computing is being probed by the U.S. Department of Energy's Accelerated Strategic Computing Initiative (ASCI) Platforms [1]. Unfortunately, these machines include little long-term computer architecture research and cost so much that it is difficult to separate reality from the need to declare success.

References

Accelerated Strategic Computing Initiative (ASCI) Platforms. URL: http://www.llnl.gov/asci/platforms/ (html).
Sarita V. Adve and Kourosh Gharachorloo. Shared Memory Consistency Models: A Tutorial. IEEE Computer, 29(12):66-76, December 1996. Web technical report: http://rsim.cs.uiuc.edu/~sadve/Publications/models_tutorial.ps (postscript).
Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwanepoel. TreadMarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2):18-28, February 1996. URL: http://www.cs.rice.edu/~willy/papers/computer96.ps.gz (gzipped postscript).
Luiz Andre Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of International Symposium on Computer Architecture, pages 282-293, June 2000. URL: http://research.compaq.com/wrl/projects/Database/isca00.pdf (PDF).
Beowulf. URL: http://www.beowulf.org/ (html).
Alan Charlesworth. Starfire: Extending the SMP Envelope IEEE Micro, 18(1):39-49, January/February 1998. Web: http://computer.org/micro/mi1998/m1039abs.htm (html abstract. IEEE members with Digital Library subscriptions can download PDF.)
Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active Messages: a Mechanism for Integrating Communication and Computation. In Proceedings of International Symposium on Computer Architecture, pages 256-266, May 1992. URL: http://www.cs.cornell.edu/Info/Projects/CAM/isca92.ps (postscript).
Chris Gniady, Babak Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proceedings of International Symposium on Computer Architecture, pages 162-171, May 1999. URL: http://www.ece.purdue.edu/~babak/papers/isca99_scrc.ps (postscript).
James Laudon and Daniel Lenoski The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of International Symposium on Computer Architecture, pages 241-251, June 1997. URL: http://www.sgi.com/origin/images/isca.pdf (PDF).
Message Passing Interface (MPI) Standard. URL: http://www-unix.mcs.anl.gov/mpi/index.html (html).
Daniel Scales and Kourosh Gharachorloo, Towards Transparent and Efficient Software Distributed Shared Memory. In Proceedings of International Symposium on Operating System Principles, pages 157-169, October 1997. URL: http://www.research.digital.com/wrl/techreports/abstracts/98.7.html (html abstract with postscript and PDF full text).
Steven L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, pages 26-36, October 1996. URL: http://www.cs.wisc.edu/~markhill/Misc/asplos96_t3e_comm.pdf (PDF).

10. Recent Implementations and Future Prospects

In this section we update the reader with more recent references on both recent computer system implementations as well as their future prospects. Material already in the book is not duplicated here.

Recently Dally and Poulton have written an excellent text on many of the important EE issues in building large computer systems [1].

The microarchitecture of Intel's Itanium (IA-64) processor is described in [2].

The Intel Technology Journal [3] is the premier web technical journal. It contains articles on the implementation of recent processors, including the Pentium III [4]. It also contains excellent references for computer architects on extreme UV lithography [5] and limits to CMOS scaling [6].

Fred Pollack (of Intel research) presents his thoughts on important areas for future computer architecture research in a set of slides from his Micro keynote [7].

The Sematech International Technology Roadmap for Semiconductors [8] provides a set of goals for future semiconuctor technology.

Smith and Sohi [9] is a comparative evaluation of microarchitectures used in the implementation of several important architectures, circa 1995.

Perspectives on the relative importance of the RISC market, the x86 market, and embedded computing, including Tredennick's famous "The bb and the beach ball" analogy appear in [10].

References

William J. Dally and John W. Poulton. Digital Systems Engineering. Cambridge University Press, 1998.
Intel Itanium (IA-64) microarchitecture implementation. URL: http://developer.intel.com/design/IA-64/microarch_ovw/index.htm (html).
Intel technology journal. URL: http://developer.intel.com/technology/itj/ (html).
Intel technology journal article on the Pentium III implementation. URL: http://developer.intel.com/technology/itj/q21999/articles/art_2.htm (html).
Intel technology journal article on extreme UV lithography. URL: http://developer.intel.com/technology/itj/q31998/articles/art_4.htm (html).
Intel technology journal article on limits to CMOS scaling. URL: http://developer.intel.com/technology/itj/q31998/articles/art_3.htm (html).
Fred Pollack's Micro keynote slides. URL: http://huron.cs.ucdavis.edu/micro32/presentations/fred.ps (postscript).
Sematech International Technology Roadmap for Semiconductors URL: http://notes.sematech.org/ntrs/PublNTRS.nsf (html).
James E. Smith and Gurindar Sohi. The Microarchitecture of Superscalar Processors. In Proceedings of the IEEE, vol 83, no. 12, December 1995, pp 1609-1624.
Nick Tredennick. Technology and Business: Forces Driving Microprocessor Evolution. In Proceedings of the IEEE, vol 83, no. 12, December 1995, pp 1641-1652.

11. Power Considerations (New Section)

Since the publication of our reader, power considerations in computer architecture have become more important for a number of reasons. First, various types of portable and ubiquitous computing have become more popular in both research and practice. Second, limitions of device performance with scaling (see Chapter 10) have made the power dissipation of future high-performance microprocessors a potentially serious problem. This is because power dissipation was scaling as the inverse square of the chip voltage for many years, but now various device leakage modes (e.g., subthreshold conduction) may prevent the chip voltages from scaling to less than 1 volt. Thus we are adding a new section to the web companion on power considerations.

In recent years, power has also been reduced by techniques known to Benjamin Franklin: "Waste not, want not". For example, instead of clocking functional units that have no work to do in a given cycle, clock gating is used so that unused functional units do not waste dynamic power. It will be interesting to see if future techniques of power conservation can result in further significant power reductions once the voltage has been scaled and unused components have been turned off.

Chandrakasan et al. [1] is an excellent reference that provides an overview of low power circuits and systems. The energy dissipation of general-purpose processors and opportunities for energy reduction are presented by Gonzalez and Horowitz [3].

There are two good case studies in low-power processor design. In the embedded space, the design of the Strong-ARM 110 is described Montanaro et al. [4]. In the general-purpose space, the design of the PowerPC 603 is described by Gary et al. [2].

References

Anantha P. Chandrakasan, Samuel Sheng, and Robert W. Brodersen, Low-Power CMOS Digital Design, IEEE Journal of Solid State Circuits, April 1992, pp 473-484.
Sonya Gary, et al., PowerPC 603, A Microprocessor for Portable Computers, IEEE Design and Test of Computers, Winter, 1994, pp 14-23.
Ricardo Gonzalez and Mark Horowitz, Energy Dissipation in General Purpose Processors, IEEE Journal of Solid-State Circuits, Sept. 1996, pp 1277-1284.
James Montanaro et. al., A 160MHz, 32b, 0.5W CMOS RISC Microprocessor, IEEE Journal of Solid-State Circuits, Nov. 1996, pp 1703-1714.

12. Reliability, Availability, and Servicability/Scalability (New Section)

As John Hennessy pointed out in his 1999 FCRC keynote [2], in an increasing number of applications the performance (as measured by program execution time) of computers is not the key metric. In computers that are clients or servers for the internet, metrics such as reliability, availability, and serviceability/scalability (RAS) are much more crucial. For example, the web servers of a major internet company such as Yahoo must be online 24 hours a day, 7 days a week. Similarly, people expect their PDAs to work when they go to use them, without having to reboot them or to have them crash.

RAS issues have always been important in certain markets (such as finance and banking). Over the years, large enterprise computing systems, such as those from Tandem and IBM, have developed many techniques that are now of interest to a much wider range of computing.

In this section we highlight several references related to RAS issues. Pradhan [3] is a classic text on fault-tolerant computer design. More recently, Slegel et al. [4] describes many RAS issues in the design of a IBM microprocessor used to build highly available systems. An interesting new approach was proposed by Austin [1].

References

Todd M. Austin, DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design, In the Proceedings of Micro-32, 1999, pp 196-207.
John Hennessy, The Future of Systems Research, Computer, August 1999, pp 27-33.
Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.
Timothy J. Slegel, et. al., IBM's S/390 G5 Microprocessor, IEEE Micro Mar/Apr 1999, pages 12-23.