DRAFT IN PROGRESS

CS/ECE 755 Project
Prof. David Wood

Motivation
===========
Most current generation processors use prediction and speculation to
execute multiple instructions in parallel, thereby increasing performance.
Prediction is when a processor "guesses" something, e.g., whether or not
a branch will be taken. Speculation is when a processor performs an 
operation (e.g., executes an instruction) that will have to be undone 
if the prediction proves incorrect.

Current processors use a variety of "speculation recovery mechanisms"
to recover the correct
architectural state when a misspeculation occurs.  For example, the
MIPS R10000 checkpoints its register rename maps whenever it takes a branch;
when a branch misprediction occurs, the processor restores the maps to 
the appropriate checkpoint. 

A common limitation of current speculation recovery mechanisms is that
they only allow speculation of a relatively small number of instructions.  
For example, the MIPS R10000's active list is only 32 entries deep, limiting
speculation to at most 32 instructions.  As memory latencies increase
relative to processor speed, many researchers believe that we need
new mechanisms that can recover for substantially larger numbers of
instructions.  For example, a level-2 cache miss on current high-performance 
a shared memory multiprocessor can take 250ns.  If the processor runs
at 1 gigahertz and issues 4 instructions per cycle, this means there
are 1000 instruction issue "opportunities" during the cache miss.

Fundamentally new mechanisms are needed to recover from misspeculations
of 1000s of instructions.  One idea is to simply checkpoint the registers
periodically, by implementing each register as a stack (or shift register).
Checkpoints can be taken by doing a "push".  Restores are done by doing
a "pop".  Because only the top of each stack is attached to the bit lines,
the bit line loading is nearly the same as for a normal register file.

Unfortunately, this register checkpoint scheme only handles registers;
it does nothing about memory.  If stores are allowed to update memory,
we must be able to undo these stores when a misspeculation is detected.

One possible scheme for dealing with this problem is called the 
Version Buffer (VB).  A version buffer is a special buffer that can 
maintain multiple versions of the same cache block.  In one proposed
implementation, the VB lives between the L1 and L2 caches.  The L2 cache
is completely normal and only holds "committed" or non-speculative state.
The L1 cache is a normal cache, except that each cache tag is extended
with a version number that is used to determine the appropriate action.
The VB sits between these two caches, and must be able to supply the
correct copy or invalidate the speculative copies as needed.

In this project, you will do the detail design and layout of a small and
somewhat simplified version buffer.   You will need use a combination of
full-custom and semi-custom (e.g., standard cells) in your implementation.
More details will be forthcoming soon.

Requirements
------------

The VB holds cache blocks in a per-block write version (WV) order. When 
a new speculative version of a block is created, the old version is 
added to the tail of the VB.  When a speculative block is replaced 
from the L1 cache, it inserted at the tail of the VB.  The VB head can be 
written to the L2 only if it is non-speculative.  Key VB mechanisms:

* Insert block at tail
* Remove non-speculative block from head
* Find the most-recently WRITTEN version of a block.
* Find the most-recently COMMITTED version of a block.
* Remove ALL versions of a block.
* Make clean ALL versions of a block.
* Make version i non-speculative
* Invalidate version i (and greater)

A real VB might have 128 entries, 8 possible versions, and 64 byte
data blocks. For this project, your VB will have 16 entries, 4 possible
versions, and 8 bit data blocks.  Addresses will be also be 8 bits 
(rather than a more typical 40-50 bits).

L1 cache blocks are presented to the VB with:

* valid bit (V)
* address (8 bits)
* dirty bit (D)
* write version (WV) (2 bit binary encoding)
* read version set (RVS) (4 bit unary encoding)
* speculative flag (SF)
* data (8 bits)

Assume all addresses are physical. If SF=0 then WV is ignored.

The speed of the above mechanisms is not critical since they are
invoked only on creating an old version of block, L1 misses, and
coherence request not filtered the L2 cache.  Creating new versions
is most common and occurs are a rate of once per 1/(S*C) instructions
(1/(15%*50%) = 13 instructions -- this is too fast but C is probably
<< 50%).


Implementation 0: Partition by Version
--------------------------------------

Have V banks for V versions.  A reasonable design is 8 banks of 32
entries each.

Big picture.  Speculative old blocks and cache replacements with WV=w
are directed to bank w.  Non-speculative old blocks (SF=0) go to the
bank of the most-recently-committed version and overwrite a copy of the
same block if it is present.  This insures that an address is present
at most once per bank (making standard CAMs sufficient).  Stalls are
used whenever resources are exhausted.


Each bank maintains:

* WV (possibly implicit)
* SF
* tail pointer (to write new block)
* head pointer (to read block into L2 cache)
* a block array.

The block array has 32 entries that can be accessed via (1) a
fully-associative valid-address match and (2) direct addressing via
the head or tail pointer.

Each entry has:

* valid bit (V) (1b)
* address (50b)
* dirty bit (D) (1b)
* read version set (RVS) (8b)
* data (64B)

Address and valid bit are implemented as CAM. Must support commands to
clear (1) all valid bits, (2) the valid bit of an address that matches
and (3) the dirty bit of an address that matches.


VB mechanisms are implemented as follows.

* Insert block at head

    Pick a appropriate bank for speculative or non-speculative block B.
    If B already in bank overwrite it
	(only possible for non-speculative blocks).
    If space available increment tail pointer and write B.
    Stall otherwise.
    
* Remove non-speculative block from tail

   Operate on "oldest" bank.
   If bank is speculative do nothing
   If tail!=head examine head
	 if valid write to L2 cache and invalidate
	 in any case, increment head
   If tail==head and bank is not the most-recently committed,
       move to next newer bank.

* Find the most-recently WRITTEN version of a block.

    Search all (or all active) banks in parallel with address B.
    Return result from most recent bank (Do we delete it in VB?)
    If no result found, use block from L2 cache or memory

* Find the most-recently COMMITTED version of a block.

    The same as above except only consider non-speculative banks.

* Remove ALL versions of a block.

    Search all (or all active) banks in parallel with address B.
    Ask those blocks to clear their valid bit.

* Make clean ALL versions of a block.

    Ditto for dirty bit.

* Make version i non-speculative

    Reset SF for bank i.

* Invalidate version i (and greater)

    Clear the valid bits in all blocks of bank i (and greater) and
    reset head and tail pointers.  (This could be done lazily but is
    necessary to re-initialize a bank.)


Implementation Note: The above implementation uses one bank per version.
This could be inefficient if not all possible versions are used or
there is great burstiness in the number of blocks created per version.

An alternative is to allow a version to use more one than bank.  When
version i fills up bank i it does not stall but begins filling bank i+1.
When version i+1 starts it uses bank i+2 instead.  This alternative adds a
little complexity on mapping blocks to versions but may help on bursts.
Please note that most VB mechanisms are not affected by this change,
because each bank only contains one version and a block at most once
(e.g., the search logic for speculative or committed blocks is the same).


Implementation 1: Partition by Address
--------------------------------------

Partion the VB into several banks, say 8, by address, since no ordering
is needed between block of different addresses.

TODO.


Interface Definition:
----------------------

There are two interfaces to the version buffer, one from the
L1 cache and one from the L2 cache. 

L1 Interface:
	input	L1_command<2>
	input	L1_address<8>
	io	L1_valid<1>
	io	L1_dirty<1>
	io	L1_WV<2>
	io	L1_RVS<4>
	io	L1_SF<1>
	io	L1_data<8>
	output	L1_status<1>

The L1_commands are:
	0	No Operation
	1	Insert block at tail
	2	Find the most-recently WRITTEN version of a block.

L1_status responses are:
	0	Not OK (no room or miss)
	1	OK


L2 Interface:
	input	L2_command<3>
	input	L2_VN<2>
	io	L2_address<8>
	output	L2_valid<1>
	output	L2_dirty<1>
	output	L2_data<8>
	output	L2_status<1>

L2_commands:
	0	No Operation
	1	Remove ALL versions of a block specified by L2_address
	2	Remove non-speculative block from head
	3	Find the most-recently COMMITTED version of a block.
	4	Make clean ALL versions of a block specified by L2_address 
	5	Make all blocks in version i (L2_VN) non-speculative
	6	Invalidate all blocks in version i (L2_VN) and greater

L2_status:
	0	Not OK (miss, or no non-speculative block)
	1	OK