Performance Features

What is a "better" computer? What is the "best" computer? The factors involved are generally cost and performance.

COST FACTORS:

cost of hardware design
cost of software design (OS, compiler, applications)
cost of hardware manufacture
cost to end purchaser

PERFORMANCE FACTORS:

what programs will be run?
how frequently will they be run?
how big are the programs?
how many users?
how sophisticated are the users?
what I/O devices are necessary?

This section discusses ways of increasing performance.

There are two ways to make computers go faster.

Wait a year. Implement in a faster/better/newer technology. More transistors will fit on a single chip. More pins can be placed around the IC. The process used will have electronic devices (transistors) that switch faster.
new/innovative architectures and architectural features.

There are two standard features that dramatically impact performance. Each of these is covered: memory hierarchy and instruction-level parallelism.

MEMORY HIERARCHIES

Known in current technologies: the time to access data from memory is an order of magnitude greater than a CPU operation.

For example: if a 32-bit 2's complement addition takes 1 time unit, then a load of a 32-bit word takes about 10 time units.

Since every instruction takes at least one memory access (for the instruction fetch), the performance of computer is dominated by its memory access time.

To try to help this difficulty, we have load/store architectures, where most instructions take operands only from registers. We also try to have fixed size, SMALL size, instructions.

what we really want:

very fast memory -- of the same speed as the CPU
very large capacity -- 512 Mbytes
low cost -- $50

These are mutually incompatible. The faster the memory, the more expensive it becomes. The larger the amount of memory, the slower it becomes.

What we can do is to compromise. Take advantage of the fact (fact, by looking at many real programs) that memory accesses are not random. They tend to exhibit locality.

Locality means "nearby."

2 kinds:

Locality in time (temporal locality) If data has been referenced recently, it is likely to be referenced again (soon!).

Example: the instructions with in a loop. The loop is likely to be executed more than once. Therefore, each instruction gets referenced repeatedly in a short period of time.

Example: The top of stack is repeatedly referenced within a program.
Locality in space (spatial locality) If data has been referenced recently, then data nearby (in memory) is likely to be referenced soon.

Example: array access. The elements of an array are neighbors in memory, and are likely to be referenced one after the other.

Example: instruction streams. Instructions are located in memory next to each other. Our model for program execution says that unless the PC is explicitly changed (like a branch or jump instruction) sequential instructions are fetched and executed.

We can use these tendencies to advantage by keeping likely to be referenced (soon) data in a faster memory than main memory. This faster memory is called a cache.


	CPU-cache   <----------------> memory

It is located very close to the processor (CPU). It contains COPIES of PARTS of memory.

A standard way of accessing memory, for a system with a cache:
(The programmer does not see or know about any of this)

Memory access is sent to the cache (first). It will be a READ or a WRITE, to accomplish an instruction fetch, a load, or a store.

If the data is in the cache, then we have a HIT. For a read, the data is returned to the processor, and the memory access is completed.

If the data is not in the cache, then we have a MISS. The memory access must be sent on to main memory.

On average, the time to do a memory access for a machine with a single cache is

       AMAT = cache access time + (% misses  *  memory access time)

AMAT stands for Average Memory Access Time.

This average (mean) access time will change for each program. It depends on the program, and its reference pattern, and how that pattern interracts with the cache parameters.

A cache is managed by hardware.

Keep recently-accessed blocks in the cache -- this exploits temporal locality.
Break memory into aligned blocks (lines) of more than one word, to exploit spatial locality.
Transfer data to/from the cache in blocks.
Put a block in a "block frame." A simplified cache with 4 block frames may look like this:

	  ---------------------------------
	  |    one block fits here        |
	  ---------------------------------
	  |    a second block fits here   |
	  ---------------------------------
	  |    a third block fits here    |
	  ---------------------------------
	  |    a fourth block fits here   |
	  ---------------------------------

Any cache has far fewer block frames (places to put blocks) than main memory has blocks. So, we need a way of knowing which block from main memory goes in which block frame.

A simple implementation uses part of the address (remember that all memory accesses include an address) to map each block from main memory to a specific block frame in the cache. In this way, the address can be used to determine the block frame that a block would be in (if the block is actually in the cache).

Here is a simple diagram of which main memory block maps to which block frame, if there are just 4 block frames in the cache:

	    main memory blocks                        cache
	  --------------------                 --------------------
	  | maps to frame #1 |                 | frame #1         |
	  --------------------                 --------------------
	  | maps to frame #2 |                 | frame #2         |
	  --------------------                 --------------------
	  | maps to frame #3 |                 | frame #3         |
	  --------------------                 --------------------
	  | maps to frame #4 |                 | frame #4         |
	  --------------------                  --------------------
	  | maps to frame #1 |
	  --------------------
	  | maps to frame #2 |
	  --------------------
	  | maps to frame #3 |
	  --------------------
	  | maps to frame #4 |
	  --------------------
	  | maps to frame #1 |
	  --------------------
	  | maps to frame #2 |
	  --------------------
	  | maps to frame #3 |
	  --------------------
	        etc. (there are many more main memory blocks)

For this simple example, where the cache has 4 block frames, we will need 2 bits of address as an INDEX # or LINE #.

And, since a block contains more than one byte, and more than one word (to exploit spatial locality), we will need some of the bits from within the address to determine which byte or word of the block is desired.

The address may be used by the cache as

             -------------------------------------------------
	     |   ?     |  INDEX #   | BYTE/WORD within BLOCK |
             -------------------------------------------------

The remaining bits of the address will be used by the cache as a TAG. Each main memory block (and cache block frame) has a TAG associated with it. The TAG distinguishes which main memory block is in a specific cache block frame. This is necessary because many main memory blocks map to the same block frame (many main memory blocks have the same INDEX #). The cache needs to know which one (of the many) is in the cache.

The address is used by the cache as

             -------------------------------------------------
	     |  TAG    |  INDEX #   | BYTE/WORD within BLOCK |
             -------------------------------------------------

Last item, then we have the whole picture. It is possible (as when a program first starts), that nothing is in a cache's block frame. The cache needs a way of distinguishing between an empty block frame and a full one. A single bit per block called a VALID bit (or a PRESENT bit), determines if the block is currently empty or full.

A diagram:

   address
   -------------------------------------------
   | TAG  | INDEX # | BYTE/WORD within BLOCK |
   -------------------------------------------
      |         |
      |         |     INDEX  VALID   TAG   DATA (BLOCK)
      |         |            ----------------------------------------
      |         |       00   |   |       |                          |
      |         |            ----------------------------------------
      |         |       01   |   |       |                          |
      |         |            ----------------------------------------
      |         ------> 10   |   |   .   |                          |
      |                      --------|-------------------------------
      |                 11   |   |   |   |                          |
      |                      --------|-------------------------------
      |                              |
      |                              |
      |    |--------------------------
      |    |
     \ /  \ /

     comparison

Using this diagram, here is how the address and cache are used:

The processor sends a memory access to the cache. This access includes an address.
The INDEX field within the address is used to determine the block frame of the data.
The VALID bit of the correct block frame is checked to see if the frame actually contains data.
If VALID bit active, then the TAG of the block that is in the cache is checked to see if it matches the TAG within the address.
- If the TAGs match, then there is a HIT, and the least significant bits of the address are used to get the correct byte/word from within the block.
- If the TAGs do NOT match, then there is a MISS.
If VALID bit is NOT active, then there is no data in the block frame, and there is a MISS.

In the case of a MISS, the memory access is sent to main memory, which responds by providing the block corresponding to the address. The block is placed in the cache, the VALID bit set, the data is placed in the block frame, and the TAG of the block is placed in the TAG portion of the block frame.

An Unrealistic, Simplified Example to promote understanding of how a cache works:

Addresses are 5 bits.
Blocks are 4 bytes.
Memory is byte addressable.
There are 4 blocks in the cache.
Assume that all accesses are to READ a single byte.

Assume the cache is empty at the start of the example.

     (index #)
     (line #)     valid  tag  data (in hex)
       00           0     ?   0x?? ?? ?? ??
       01           0     ?   0x?? ?? ?? ??
       10           0     ?   0x?? ?? ?? ??
       11           0     ?   0x?? ?? ?? ??

Memory is small enough that we can make up a complete example. Assume little endian byte numbering.

     address   contents
     (binary)   (hex)
      00000    aa bb cc dd
      00100    00 11 22 33
      01000    ff ee 01 23
      01100    45 67 89 0a
      10000    bc de f0 1a
      10100    2a 3a 4a 5a
      11000    6a 7a 8a 9a
      11100    1b 2b 3b 4b

(1)
First memory reference is to the byte at address 01101.
The address is broken into 3 fields:

     tag   index #       byte within block
      0        11             01

On line 11, the block is marked as invalid, therefore we have a cache MISS.

The block that address 01101 belongs to (4 bytes starting at address 01100) is brought into the cache, and the valid bit is set.

   (line number)  valid  tag  data (in hex)
       00           0     ?   0x?? ?? ?? ??
       01           0     ?   0x?? ?? ?? ??
       10           0     ?   0x?? ?? ?? ??
       11           1     0   0x45 67 89 0a

And, now the data requested can be supplied to the processor. It is the value 0x89.

(2)
Second memory reference is to the byte at address 01010.

The address is broken into 3 fields:

     tag   index #       byte within block
      0        10             10

On line 10, the block is marked as invalid, therefore we have a cache MISS.

The block that address 01010 belongs to (4 bytes starting at address 01000) is brought into the cache, and the valid bit is set.

   (line number)  valid  tag  data (in hex)
       00           0     ?   0x?? ?? ?? ??
       01           0     ?   0x?? ?? ?? ??
       10           1     0   0xff ee 01 23
       11           1     0   0x45 67 89 0a

And, now the data requested can be supplied to the processor. It is the value 0xee.

(3)
Third memory reference is to the byte at address 01111.

The address is broken into 3 fields:

     tag   index #       byte within block
      0        11             11

This line within the cache has its valid bit set, so there is a block (from memory) in the cache. BUT, is it the block that we want? The tag of the desired byte is checked against the tag of the block currently in the cache. They match, and therefore we have a HIT.

The value 0x45 (byte 11 within the block) is supplied to the processor.

(4)
Fourth memory reference is to the byte at address 11010.

The address is broken into 3 fields:

     tag   index #       byte within block
      1        10             10

   (line number)  valid  tag  data (in hex)
       00           0     ?   0x?? ?? ?? ??
       01           0     ?   0x?? ?? ?? ??
       10           1     1   0x6a 7a 8a 9a
       11           1     0   0x45 67 89 0a

The value 0x7a (byte 10 within the block) is supplied to the processor.

(5)
Fifth memory reference is to the byte at address 11011.

The address is broken into 3 fields:

     tag   index #       byte within block
      1        10             11

The value 0x6a (byte 11 within the block) is supplied to the processor.

Terminology

miss ratio = fraction of total memory accesses that miss

hit ratio = fraction of total memory accesses that hit = 1 - miss ratio

Often

cache: instruction cache 1 cycle
data cache 1 cycle
main memory 20 cycles

 average memory access time
                  (AMAT) = cache-access + miss-ratio * memory-access
			 =       1     +   0.02     *  20
			 =       1.4

Beyond the scope of this class:

block and block frames divided in "sets" (equivalence classes) to speed lookup.
terms:
- fully-associative
- set-associative
- direct-mapped

Typical cache size is 64K bytes, given a 64Mbyte memory
The cache is 20 times faster than main memory
The cache has 1/1000 the capacity of main memory
The cache often hits on 98% of the references!

Remember:

recently accessed blocks are in the cache (temporal locality)
the cache is smaller than main memory, so not all blocks are in the cache
blocks are larger than 1 word (spatial locality)

This idea of exploiting locality is (can be) done at many levels. Implement a hierarchical memory system:

smallest, fastest, most expensive memory: registers
relatively small, fast, expensive memory: CACHE
large, fast as possible, cheaper memory: main memory
largest, slowest, cheapest (per bit) memory: disk

registers are managed/assigned by compiler or asm. lang programmer
cache is managed/assigned by hardware or partially by OS
main memory is managed/assigned by OS
disk managed by OS

This hierarchical scheme works so well (reducing the AMAT), that many systems now include 2 levels of caches!

 -----------|----         ------        -------------
 |processor | L1|<------->| L2 |<------>|main memory|
 -----------|----         ------        -------------

Another common cache enhancement: use not one L1 cache, but two special purpose caches. One cache exclusively holds instructions (code), and the other holds data (everything else).

Benefits:

Each of the two caches can be designed to work in parallel. This means that both an instruction fetch memory request, and a data load/store can occur at the same time. This overlap results in a performance improvement.
The instruction cache (often called I-cache) can be relatively small, and still exhibit a rather high hit ratio. When an entire loop's code fits into the cache, and the loop is executed many times, (after the initial misses that bring the code into the cache) the hit ratio is 1 (perfect).

Amdahl's Law

Or, why the common case matters most.

speedup = new rate / old rate 

        = old execution time / new execution time

We program in some enhancement to part of our program. The fraction of time spent in that part of the code is f. The speedup of that part of the code (f) is S.

Let an enhancement speedup f fraction of the time by speedup S:

speedup = [(1-f)+f]*old time / (1-f) * old time + f/S * old time

	=    1
	  ---------
	  1-f + f/S

Values:

	    f	    S		speedup
	   ---	   ---		-------
	   95%	   1.10		1.094
	    5%	   10		1.047
	    5%	   inf		1.052

lim		   1
		---------	=  1/ (1-f)
S --> inf	1-f + f/S

	 f	speedup
	---	-------
	1%      1.01
	2%      1.02
	5%      1.05
	10%     1.11
	20%     1.25
	50%     2.00