Mark D. Hill, University of Wisconsin-Madison Computer Sciences, Norman P. Jouppi, Compaq Western Research Laboratory, and Gurindar S. Sohi, University of Wisconsin-Madison Computer Sciences Version 2.0 Released August 8, 2000
Table of Contents
PrefaceWelcome to the web component that complements our book Readings in Computer Architecture [1]! The book, available from Morgan Kaufmann Publishers [2], contains more than 50 computer architecture papers that have stood the test of time. It is organized into ten chapters that reflect the key topics in the field of computer architecture. In the first section of each chapter, we introduce the selected papers, placing them in context and often offering additional comments and insights that provide a broader background for the chapter topic.The cutting edge of computer architecture changes too rapidly to be captured only in a book whose editions can come out no more frequently than every few years. There are many current papers that are worth reading, because they represent the current state-of-the-art (e.g., a new microprocessor) or because they may become classics. In addition, there are other useful web resources (e.g., the web page for a benchmark suite) whose existence is too transitory to cite in a book. This web component will be updated at least annually to address these dynamic needs. It is modeled after the book with the same ten chapters plus two new chapters: 11. Power Considerations and 12. Reliability, Availability, and Servicability/Scalability Considerations. Each chapter contains pointers to recent papers and web resources together with our notes. These comments introduce the papers and put them in context with each other and current work. Whenever possible, we include Web uniform resource locators (URLs). The papers and resources selected are reviewed by our colleagues in academia and industry to help ensure we have considered the most important recently published papers.
You can always obtain the most recently released version
of this web component at URL:
Some scholars, however, are displeased by how web pages are easily
changed and not archived. To address this concern, released versions of
this web component will be permanent and unchanging (except to update
outgoing URLs or fix trivial errors). To obtain the version released
on September 9, 1999, for example, use URL: To look at all previously released versions and a log of changes, see:
There are, however, at least three challenges to the dynamic world of the
Web. First, URLs can become obsolete. Please email us at
carbugs@mkp.com if you notice this
occurring. Second, web content can be extensive and can change. We point
to what we find interesting. We do not and cannot vouch for
the accuracy and balance of the web content to the same degree as can be
achieved through peer review. Third, the reader may be responsible
for conforming to copyrights on some web resources.
To help readers we call attention to the copyrights we are aware of
through "click boxes" that appear when dereferencing selected links.
We conclude this preface by calling your attention to the
WWW Computer Architecture Home Page at Wisconsin
[3].
Since 1995 this page has
been a clearinghouse of computer architecture information.
The current version includes URLs for
calls for papers and participation for conferences (e.g., ISCA),
research groups,
researchers,
organizations (e.g., ACM, DARPA, IEEE, and NSF),
commercial pages (both technical and less technical),
books,
on-line publications, and internet newsgroups.
Gordon Bell has placed complete pdf copies of his books online
[1].
This is very useful since Bell and Newell
[2],
Bell, Mudge, and McNamara
[3],
and Siewiorek, Bell, and Newell are all very valuable references that
are out of print. Bell has also put the slides of his recent presentations
online, including a talk on Seymour Cray's contributions to computing
[4]
that contains lots of photos of classic machines from the computer history
center.
The long anticipated book by the two IBM 360 architects who went to
academia [5] (Blaauw and Brooks) has been published. It contains many
details on the instruction set architecture of early machines.
They make a strong distinction between "computer architecture" and "computer
organization." Their use of those terms corresponds to the terms
"instruction-set architecture" and "microarchitecture", respectively,
which are more commonly used today.
Recently Ceruzzi has written a great book on the history of computers[6],
both from a discussion of the
machines themselves as well as the marketplace. It has excellent coverage
from the early 1960's until the PC.
The Computer Museum in Boston has a web site with historical information
[7].
The Computer Museum's Computer History Center also has good coverage
of historical machines [8].
Mark Smotherman has some web pages that cover some aspects of computer
history in great detail
[9].
Of particular interest is his discussion of the
IBM Streatch and his chronology of IBM's ACS project.
ACS was a response to IBM's loss of leadership
in the high-performance computing segment. It evaluated such advanced
concepts as multiple instruction issue and multithreading in the mid
1960's, but it was cancelled by IBM.
This section updates the book's "Methods" chapter in two ways. First, it
gives pointers to three benchmark suites in common current use (SPEC,
SPLASH-2, and TPC). Second it gives examples of state-of-the-art methods
and tools for analytic modeling (CMOS cache access times), simulation
(SimpleScalar, RSIM, Simics, and SimOS), and system monitoring (executable
editing with ATOM).
Benchmark suites provide a way for comparing system alternatives
that is much less work than porting your important workloads to
each alternative machine. The answers they give, however, may not be
completely representative of your workloads, especially when vendors tune
their designs for popular benchmark suites.
Here we point to the Web pages of three currently popular benchmark suites.
Standard Performance Evaluation Corporation (SPEC)
[5]
provides a variety of benchmarks for evaluating systems.
SPEC recently announced SPEC CPU2000 to replace the
widely-used but aging SPEC CPU95.
Other SPEC suites include those for
evaluating network file systems, Java virtual machines, and Web servers.
Stanford Parallel Applications for Shared Memory (SPLASH-2)
[6]
is the most popular benchmark suite for academic architects
evaluating shared-memory multiprocessors. In early 1990s, SPLASH-2's
predecessor, SPLASH, provided a tremendous boost to the field,
by replacing benchmarks like matrix multiply and FFT. Today, however,
SPLASH-2 has aged and it not clearly representative of what deployed
shared-memory multiprocessors actually run.
Transaction processing Performance Council (TPC)
[8]
provides
benchmarks and results for several important database management system
(DBMS) scenarios. Currently, the most popular are TPC-C for
on-line transaction processing (OLTP) and TPC-D for decision
support systems (DSS). OLTP is characterized by small, well-defined
queries (e.g., at an automated bank teller machine), while DSS has
longer, ad hoc queries (e.g., how do sales correlate with sales
person). Metrics measure throughput and cost-effectiveness (dollars
per throughput). TPC benchmarks are highly-regarded for the specific
scenarios they address.
A significant part of the book's "Methods" chapter focuses on the three basic
methods computer architects use to study systems: analytic modeling,
simulation, and system monitoring. We next point to examples that
illustrate each.
CACTI
[9]
is a analytic model embodied in a downloadable tool.
It allows one to estimate the access time of alternative cache structures
on CMOS microprocessors within 10% of Hspice results.
One can, for example, determine the access
times for direct-mapped and a two-way set-associative caches, obtain miss
ratio data from other tools, and decide which alterative is preferred.
Micro-architectural simulators remain popular tools with computer
architects.
Wisconsin SimpleScalar
[3]
and Rice RSIM
[2]
are widely used in both research and teaching. SimpleScalar is a
uniprocessor simulator, while RSIM focuses on multiprocessor issues.
Stanford SimOS
[1]
[7]
and Virtutech Simics
[10]
(a commercial product) provide a more complete system model
(including operating system and input/output behavior) at a natural cost
of being more challenging to use.
Finally, a significant step forward on the system monitoring front
is executable editing, first embodied in ATOM [3].
ATOM (for Alpha executables), Wisconsin EEL (SPARC), and Washington Etch
(IA-32) allow users to modify executables so that when they run they can
performance auxiliary functions (e.g, count basic block executions) and
still run at near full speed (for modest auxiliary functions).
Instruction sets that are in common use today include RISC instruction sets
such as the
Compaq Alpha [1] ,
Apple/IBM/Motorola PowerPC [7] ,
MIPS [5] ,
Sun SPARC [8],
as well as CISC instruction sets such as the
Motorola 68K [6]
and the
Intel IA-32 [2] .
Implementations of these account for most of the volume of
general-purpose microprocessors shipped today.
Recently HP and Intel have disclosed parts of the
IA-64 architecture [3] .
IA-64 is similar in many aspects to the PlayDoh architecture,
a research vehicle developed at Hewlett-Packard Laboratories.
The technical report by Kathial, Schlansker and Rau
[4]
describes the PlayDoh architecture.
Multimedia extensions to instruction sets are discussed in Chapter 8.
Instruction level parallelism continues to play a very important
role in both in industrial processor designs as well as in academic
research.
This section looks at some of these recent trends.
An important subject of investigation is the exploitation
of control independence. Current implementations of dynamically-scheduled
processors squash all speculative instructions on a branch misprediction,
including control independent instructions.
The paper by Lam and Wilson
[2]
presents a limits study
evaluating the limits of control flow on parallelism.
In addition to being an example of how to carry out a limits
study, it highlights the importance of exploiting control independence.
A processor's sustained instruction execution rate is limited by
the rate at which instructions are fetched, and this is in turn
limited by branch prediction (among other things such as
instruction cache organization). Since the early 1990s there has
been a lot of work in branch prediction, including work
on innovative branch predictors and on understanding why they work.
The paper by McFarling suggests combining multiple branch
predictors
[4]
suggests that multiple branch predictors could be used to
achieve prediction rates that are better that what could be achieved
with a single predictor.
The paper by Young, Gloy and Smith
[9]
provides a framework for characterizing different branch
prediction schemes. The paper by Emer and Gloy
[1]
presents a language for describing predictors.
A formal description allows the design space to be
explored in a systematic manner, and also allows
predictors to be systhesized.
In dynamically-scheduled processors, dependencies between
instructions involving register operands can easily be determined
since the register names are available for the decoder to see.
Dependencies between memory instructions is more complicated
since an address calculation is required. When addresses
are not available (or can be used in an efficient manner)
to determine the (in)dependence relationship between two memory
operations, guesses need to be made about the relationship.
(The conservative guess is that a dependence exists whereas
an aggressive guess is that no dependence exists.)
The paper by Moshovos, Breach, Vijaykumar and Sohi
[5]
discusses the subject of
data dependence speculation .
As processors become more aggressive, the use of
data dependence speculation techniques is likely to become more pervasive.
With branch prediction being used to overcome control dependence
constraints, renaming being used to overcome name dependence constraints,
and data dependence speculation being used to
overcome ambiguous dependence constraints, we are left with
true dependence constraints.
Value speculation is a technique to overcome true dependencies:
the output of an instruction (or alternately the inputs of
an instruction) are predicted, and instructions executed
with the predicted value, in parallel with instructions
that produce the actual value.
The paper by Lipasti and Shen
[3]
discusses how value speculation can be used to exceed
the dataflow limit inherent in a program's execution.
In recent years there has been a lot of research on understanding
the causes of value predictability, and on building better
value predictors.
As shrinking feature sizes allow increased logic speeds,
the importance of wire delays increases. Many microprocessor
designers believe that wire delays will dominate in future
microprocessor designs, and that building superscalar
processors with centralized instruction windows will not be feasible.
The paper by Palacharla, Jouppi and Smith
[6]
analyzes the important logic elements of a dynamic issue processor and
provides quantitative evidence to support this sentiment.
Designers and researchers are investigating decentralized
or distributed microarchitectures in which critical functional
elements are decentralized. The Compaq Alpha 21264 microprocessor
is on the leading edge of this trend: it divides the
instruction selection and execution logic into two.
Future microprocessors are likely to see a decentralization
of all aspects of instruction processing.
Finally we are to the realm of one model for future processors
that is being investigated heavily: distributed microarchitectures
which combine ILP and thread-level parallelism (TLP), using
thread-level speculation.
The paper by Sohi, Breach, and Vijaykumar
[8]
describes the multiscalar architecture in which
all important functions are implemented in a distributed fashion.
A serial program is speculatively parallelized using
thread-level speculation and executed on a parallel microarchitecture
with support for recovering from misspeculations.
Statically-scheduled ILP processors have also come back to the
forefront with the announcement of HP and Intel's
Explicitly Parallel Instruction Computing (EPIC)
technology, as epitomized by the recently-announced
IA-64
architecture. Research papers on technologies related to
EPIC architectures can be found on the Web site of the
Illinois IMPACT
research group.
A recent paper by Schlansker and Rau
[7]
describes the EPIC philosophy and the set of architectural
features that characterize the EPIC style of ILP architectures.
Much of the work in dataflow and multithreading is finding its way
into processor designs through the route of instruction-level parallelism.
Dynamically-scheduled superscalar processors strive to achieve
dataflow-like execution (albeit of control- and data-speculative
instructions). The instruction window and instruction
wakeup logic of current generation superscalar processors can
be considered to be analogous to the token store of a classical
tagged-token dataflow machine. Expectedly, researchers are
investigating alternative ways of implementing this functionality
(analogous to the explicit token store simplification
of the tagged token store).
Many processors currently in the design process (circa 1999) are
rumored to support the simultaneous or parallel
variety of multithreading. If such support is available,
an obvious question is how to use this functionality
in non-traditional ways. Two recent proposals try to exploit
the available multithreaded support to assist the main computation thread.
In the first approach, software exception handlers (e.g., TLB miss handlers
or profiling exception handlers) are executed as a separate thread
so that the main computation thread does not have to be switched out.
A paper by Zilles, Emer, and Sohi describes this approach.
[3].
In the second approach, "microthreads" are created, either from the program
itself, or via a separate specification.
These microthreads are then executed either speculatively or
non-speculatively
to carry out functions that assist the main thread.
Roth, Moshovos and Sohi illustrate how such a speculative thread
could be extracted from the program and assist with the task
of prefetching linked data structures
[2].
Chappell, Stark, Kim, Reinhardt and Patt show a subordinate thread could
be specified as a piece of microcode and give an example of how it could
be used to assist the branch prediction process
[1].
The use of such "microthreads" (also called "subordinate threads"
or "scout threads" or "helper threads") is likely to grow in future
microprocessors as the available chip real estate allows multiple
threads to be supported.
Technology and application trends continue to drive memory system
changes. This section looks at four issues:
instruction delivery,
prefetching,
translation lookaside buffers,
and DRAM interface changes.
These and related issues are discussed in the February 1999 special issue of
IEEE Transactions on Computers
[3].
Delivering instructions to a modern out-of-order pipeline is both critical
to performance and challenging to do. For a pipeline to sustain R
retired instructions per cycle, it must be fueled with I > R
instructions per cycle (since some instructions are squashed). Since
instruction cache misses cause cycles with no instructions delivered, one
must both make cache misses rare and deliver J > I instructions
when misses are not occurring. A key additional challenge occurs when J
becomes larger than the number of instructions in a typical
basic block.
Here we point to two papers that give some current thinking on instruction
delivery and specifically address how to fetch from more than one basic
block per cycle. Yeh et al.
[9]
propose a design that makes multiple
branch predictions per cycle and then uses them to access an instruction cache
that permits multiple fetches to different basic blocks each cycle.
Rotenberg et al.
[6]
instead propose coalescing consecutive dynamic
instructions (even across branches) into a trace and then caching the
traces in a trace cache that need only provide one trace per cycle.
Prefetching attempts to obtain data or instructions from memory prior
to its use. This typically reduces average memory latency and a cost
of additional bandwidth used (e.g., for items incorrectly prefetched).
An important type of prefetching is called non-binding prefetching.
Non-binding prefetching moves data (or instructions) into caches but leaves
it under the jurisdiction of coherence protocols so that prefetching
cannot cause program correctness violations.
Here we point to two papers that highlight the two key ways of initiating
non-binding prefetches: software and hardware. Mowry
[4]
discusses
non-binding data prefetches initiated by software after being inserted in
applications by a compiler. He reviews a uniprocessor compiler algorithm
and then delves into the more challenging problem of prefetching in
a shared-memory multiprocessor. Joseph and Grunwald
[2]
investigate
hardware-initiated prefetching. They review well-established prior
work, such as stride prefetchers and stream buffers, and then develop
their aggressive Markovian prefetcher which will, at minimum,
creatively use millions of transistors.
Another challenge is designing memory management
units (MMUs) to implement virtual memory.
MMU designs must support applications with very
large memory footprints (e.g., databases with gigabyte buffer pools)
and poor locality (e.g., some object-oriented programs) and yet keep
costs down since the same microprocessor design is often deployed in
both large servers and small desktops. Talluri and Hill
[8]
discuss translation lookaside buffer (TLB)
alternatives to the brute-force approach of just increasing the number
of TLB entries. In particular, they explore using large aligned pages
called superpages and/or having multiple translations per TLB entry
(called subblocking).
Jacob and Mudge
[1]
survey six commercial MMU designs.
Finally, memory systems are being significantly changed through evolution
and revolution of the interface between processors and DRAM memory.
Driving these changes is the trend
that the memory chip capacity is growing faster
than typical memory system sizes. The impact of this trend is to
reduce the number
of memory chips in a system needed to provide memory capacity to far
fewer than are needed to provide adequate memory bandwidth across the
traditional DRAM interface. To address this problem, researchers and
practitioners have proposed modest changes to DRAM chip interfaces, more
radical changes, and even merging processing and DRAM on the same chip.
The November/December 1997 issue of IEEE Micro
[7]
provides a
place to start to understand the issues forcing DRAM interface changes.
It includes discussions of semiconductor issues, Rambus, SLDRAM, and
merged logic and DRAM, but alas it is already two years old in the fast
changing area. More recently, Rambus
[5]
provides a cogent (but naturally
biased) case to their products. Readers should stay tuned as next year's
picture of DRAM interfaces will likely be different that this year's.
There have been a number of developments in I/O standards in 1999-2000.
Although the PCI bus has been extended to the PCI-X bus
[14], within several
years many future I/O systems will be based around the Infiniband
[10]
industry standard.
Infiniband is the merger of several competing industry standards.
It uses high-speed (2.5Gb/sec) point-to-point serial links as a basic building block.
Other recent developments include further expansion of Fibre Channel
[5] and storage-area networks (SANs).
Network-attached secure disks have been quickly moving from research
[7]
to high-volume products.
Adaptec's I/O Connection
[2]
is a pictorial reference on SCSI and PC cabling.
More recent work on RAID file servers is appeared in [3] and [12].
Dunning et al. [4] present Compaq/Intel/Microsoft's Virtual Interface Architecture (VIA).
Galles [6] describes the SGI Origin system interconnect.
Gillet [8] describes Compaq's memory channel, used for building clusters.
Horowitz et al. [9] is a good reference for electrical issues involved in high-speed
signalling.
Intel's developer web pages
[11]
contain useful information on PC standards and initiatives.
The design of a 50-Gb/s IP router is described by Partridge et al. [13].
The PC Webopedia [15]
and the PC Guide
[16]
contain lots of information on PC I/O standards.
The proceedings of the 1998 and 1999 Eurographics/SIGGRAPH Workshops on
Graphics Hardware Workshops
[17]
[18]
are both online. Recent proceedings of SIGGRAPH [19]
also have a limited number of papers on graphics hardware, but they are not
available online except through the ACM Digital Library
[1].
SIMD principles are being employed heavily in the design of multimedia
extensions to the base instruction sets of many architectures.
These instruction sets are designed to carry out parallel (SIMD)
operations on voice, image, and video data streams.
Most of the incumbent architectures have adopted multimedia
extensions. Intel started out with the
MMX extensions [1] in the Pentium microprocessor,
and later extended them to the
Intel Internet Streaming SIMD [2] extensions in the Pentium III.
Sun Microsystems added the
VIS [4] extensions to the UltraSPARC architecture,
and Apple/IBM/Motorola have added the
AltiVec [3] extensions to the PowerPC architecture.
This section first discusses advances in shared-memory multiprocessors and
then examines some multicomputer/cluster software issues.
Recall that multiprocessors can implement cache coherence (designated
CC) or not (NCC) and may have uniform (UMA) or
non-uniform (NUMA) memory access times. This leads to four
multiprocessor classifications for which we next discuss three recent
notable machines: CC-UMA Sun Ultra 10000, CC-NUMA SGI Origin 2000, NCC-UMA
(no machine discussed), and NCC-NUMA Cray T3E.
We also discuss Compaq Piranha, which
combines CC-UMA on-chip with CC-NUMA between chips.
Sun Ultra 10000 [6],
code-named Starfire, is a symmetric
multiprocessor (SMP) or CC-UMA that pushes the limits of an SMP to
support 64 processors and a memory bandwidth over 10 gigabytes per second.
Starfire uses snooping coherence but does not have a bus implemented
with shared wires. Address transactions travel on four interleaved
"buses," implemented with totally-ordered pipeline broadcast trees.
Data blocks move on a separate unordered point-to-point crossbar.
Silicon Graphics (SGI) Origin 2000
[9]
is a CC-NUMA that scales to 1024
processors and a terabyte of memory. It maintains coherence with a
distributed directory protocol derived from that of Stanford DASH, but
uses a hypercube-like interconnection network that is more scalable than
DASH's mesh.
Cray/SGI T3E
[12]
is a NCC-NUMA that scales to 2048 DEC Alpha 21164
processors and 4 terabytes of memory. The T3E must overcome that the
21164 was designed to be used in systems with more modest requirements
for physical address space, outstanding cache misses, TLB reach, etc. The T3E
overcomes these limits with a set of elegant mechanisms, in part learned
from the more ad hoc first-generation T3D. Mechanisms added to the
"shell" outside each 21164 that includes a set of E-registers that replace
the level two cache with something like vector registers, an address
centrifuge for creating a large interleaved physical address space,
barrier and eureka hardware, and message queue management in memory.
Compaq Piranha
[4]
is a research prototype that explores cache-coherent multiprocessing
within and between chips. Each chip includes eight simple in-order
processors (with private level one instruction and data caches), a
shared level two cache that is eight-way interleaved and directly
connected to eight banks of Rambus RDRAMs, two microprogrammable
protocol engines for inter-chip coherence, and an interconnection router.
Performance simulations with two commercial workloads show one chip
multiprocessor (CMP) can significantly outperform a system using a
single complex out-of-order processor chip.
One issue that has made programming and implementing shared-memory
multiprocessors challenging is dealing with memory consistency
models that formally define the memory interface between hardware and
software. Adve and Gharachorloo
[2]
concisely review the state of the
art of memory consistency models, including key differences between
well-known academic and commercial models. Gniady et al.
[8],
on the other hand, present evidence that speculative execution may
encourage hardware designers to return to the relatively-simple model
of sequential consistency. Nevertheless, compiler writers will likely
continue to target relaxed models when considering optimizations like
code motion and register allocation, particularly on staticly-scheduled
machines like the Merced implementation of IA-64.
Multicomputers and clusters are parallel systems with nodes joined by a
network usually connected to a node's I/O system. Each node is logically
like a common off-the-shelf (COTS) computer and, in some cases, is just
a COTS computer. Multicomputer research, therefore, often
focuses on using COTS parts and innovating in the software instead.
Here we point out five software developments.
The Message Passing Interface (MPI)
[10]
is a highly-successful
effort to unify the many vendor-specific messaging systems that existed
before it. It facilitates portable codes and good performance
for applications that can be structured to send coarse-grain messages
(kilobytes to megabytes).
Several research proposals have looked at supporting fine-grained messages (e.g.,
the few words needed in a request control message). Notable among these
is Berkeley's Active Messages
[7]
that allows a short message
to specify a handler for receiving the message and possibly generating
a low-overhead response message. Active Messages, for example, allow
a reasonable implementation of the shared-memory Split-C programming
model on non-shared-memory multicomputers.
Considerable progress has been made on software
distributed shared memory systems that implement a shared
memory application binary interface (ABI) on a cluster of
computers. Rice Treadmarks [3]
tolerates long network latencies with techniques
like lazy release consistency, while Compaq
(formerly DEC) Shasta [11]
implements its ABI with sufficient fidelity to run a commercial database
management system.
NASA Beowulf
[5]
is an effort to "put it all together."
Beowulf provides software (or pointers to software) that allow non-experts
to put together clusters of PCs using the Linux operating system to run
parallel applications with middleware like MPI. There is no magic here.
Beowulf is notable, because people find it works.
Finally, the gigantic end of parallel computing is being probed by the U.S.
Department of Energy's Accelerated Strategic Computing Initiative (ASCI)
Platforms
[1].
Unfortunately, these machines include little long-term computer architecture
research and cost so much that it is difficult to separate reality
from the need to declare success.
In this section we update the reader with more recent
references on both recent computer system implementations as well as
their future prospects. Material already in the book is not duplicated here.
Recently Dally and Poulton have written an excellent text on many of the
important EE issues in building large computer systems [1].
The microarchitecture of Intel's Itanium (IA-64) processor
is described in
[2].
The Intel Technology Journal
[3]
is the premier web technical journal.
It contains articles on the implementation of recent processors, including
the Pentium III
[4].
It also contains excellent
references for computer architects on extreme UV lithography
[5]
and
limits to CMOS scaling
[6].
Fred Pollack (of Intel research) presents his thoughts on important areas for future computer architecture research in a set of slides from his Micro keynote
[7].
The Sematech International Technology Roadmap for Semiconductors
[8] provides
a set of goals for future semiconuctor technology.
Smith and Sohi [9] is a comparative evaluation of microarchitectures used
in the implementation of several important architectures, circa 1995.
Perspectives on the relative importance of the RISC market, the x86 market,
and embedded computing, including Tredennick's famous
"The bb and the beach ball" analogy appear in [10].
Since the publication of our reader, power considerations in computer
architecture have become more important for a number of reasons.
First, various types of portable and ubiquitous computing have become
more popular in both research and practice. Second, limitions of
device performance with scaling (see Chapter 10) have made the power
dissipation of future high-performance microprocessors a potentially
serious problem. This is because power dissipation was scaling as the
inverse square of the chip voltage for many years, but now various
device leakage modes (e.g., subthreshold conduction) may prevent the
chip voltages from scaling to less than 1 volt.
Thus we are adding a new section to the web companion on power
considerations.
In recent years, power has also been reduced by techniques known to
Benjamin Franklin: "Waste not, want not". For example, instead of
clocking functional units that have no work to do in a given cycle,
clock gating is used so that unused functional units do not waste
dynamic power. It will be interesting to see if future techniques of
power conservation can result in further significant power reductions once
the voltage has been scaled and unused components have been turned off.
Chandrakasan et al. [1] is an excellent reference that provides an overview of low power circuits
and systems. The energy dissipation of general-purpose processors and
opportunities for energy reduction are presented by Gonzalez and Horowitz [3].
There are two good case studies in low-power processor design. In
the embedded space, the design of the Strong-ARM 110 is described Montanaro et al. [4].
In the general-purpose space, the design of the PowerPC 603 is
described by Gary et al. [2].
As John Hennessy pointed out in his 1999 FCRC keynote [2],
in an increasing number of applications
the performance (as measured by program execution time) of computers is
not the key metric. In computers that are clients or servers for the
internet, metrics such as reliability, availability, and
serviceability/scalability
(RAS) are much more crucial. For example, the web servers
of a major internet company such as Yahoo must be online 24 hours a day,
7 days a week. Similarly, people expect their PDAs to work when they
go to use them, without having to reboot them or to have them crash.
RAS issues have always been important in certain markets (such as finance
and banking). Over the years, large enterprise computing systems,
such as those from
Tandem and IBM, have developed many techniques that are now of interest
to a much wider range of computing.
In this section we highlight several references related to RAS issues.
Pradhan [3] is a classic text on fault-tolerant computer design. More recently,
Slegel et al. [4] describes many RAS issues in the design of a IBM microprocessor
used to build highly available systems.
An interesting new approach was proposed by Austin [1].
|
(415) 392-2665 mkp@mkp.com |