DRAFT IN PROGRESS CS/ECE 755 Project Prof. David Wood Motivation =========== Most current generation processors use prediction and speculation to execute multiple instructions in parallel, thereby increasing performance. Prediction is when a processor "guesses" something, e.g., whether or not a branch will be taken. Speculation is when a processor performs an operation (e.g., executes an instruction) that will have to be undone if the prediction proves incorrect. Current processors use a variety of "speculation recovery mechanisms" to recover the correct architectural state when a misspeculation occurs. For example, the MIPS R10000 checkpoints its register rename maps whenever it takes a branch; when a branch misprediction occurs, the processor restores the maps to the appropriate checkpoint. A common limitation of current speculation recovery mechanisms is that they only allow speculation of a relatively small number of instructions. For example, the MIPS R10000's active list is only 32 entries deep, limiting speculation to at most 32 instructions. As memory latencies increase relative to processor speed, many researchers believe that we need new mechanisms that can recover for substantially larger numbers of instructions. For example, a level-2 cache miss on current high-performance a shared memory multiprocessor can take 250ns. If the processor runs at 1 gigahertz and issues 4 instructions per cycle, this means there are 1000 instruction issue "opportunities" during the cache miss. Fundamentally new mechanisms are needed to recover from misspeculations of 1000s of instructions. One idea is to simply checkpoint the registers periodically, by implementing each register as a stack (or shift register). Checkpoints can be taken by doing a "push". Restores are done by doing a "pop". Because only the top of each stack is attached to the bit lines, the bit line loading is nearly the same as for a normal register file. Unfortunately, this register checkpoint scheme only handles registers; it does nothing about memory. If stores are allowed to update memory, we must be able to undo these stores when a misspeculation is detected. One possible scheme for dealing with this problem is called the Version Buffer (VB). A version buffer is a special buffer that can maintain multiple versions of the same cache block. In one proposed implementation, the VB lives between the L1 and L2 caches. The L2 cache is completely normal and only holds "committed" or non-speculative state. The L1 cache is a normal cache, except that each cache tag is extended with a version number that is used to determine the appropriate action. The VB sits between these two caches, and must be able to supply the correct copy or invalidate the speculative copies as needed. In this project, you will do the detail design and layout of a small and somewhat simplified version buffer. You will need use a combination of full-custom and semi-custom (e.g., standard cells) in your implementation. More details will be forthcoming soon. Requirements ------------ The VB holds cache blocks in a per-block write version (WV) order. When a new speculative version of a block is created, the old version is added to the tail of the VB. When a speculative block is replaced from the L1 cache, it inserted at the tail of the VB. The VB head can be written to the L2 only if it is non-speculative. Key VB mechanisms: * Insert block at tail * Remove non-speculative block from head * Find the most-recently WRITTEN version of a block. * Find the most-recently COMMITTED version of a block. * Remove ALL versions of a block. * Make clean ALL versions of a block. * Make version i non-speculative * Invalidate version i (and greater) A real VB might have 128 entries, 8 possible versions, and 64 byte data blocks. For this project, your VB will have 16 entries, 4 possible versions, and 8 bit data blocks. Addresses will be also be 8 bits (rather than a more typical 40-50 bits). L1 cache blocks are presented to the VB with: * valid bit (V) * address (8 bits) * dirty bit (D) * write version (WV) (2 bit binary encoding) * read version set (RVS) (4 bit unary encoding) * speculative flag (SF) * data (8 bits) Assume all addresses are physical. If SF=0 then WV is ignored. The speed of the above mechanisms is not critical since they are invoked only on creating an old version of block, L1 misses, and coherence request not filtered the L2 cache. Creating new versions is most common and occurs are a rate of once per 1/(S*C) instructions (1/(15%*50%) = 13 instructions -- this is too fast but C is probably << 50%). Implementation 0: Partition by Version -------------------------------------- Have V banks for V versions. A reasonable design is 8 banks of 32 entries each. Big picture. Speculative old blocks and cache replacements with WV=w are directed to bank w. Non-speculative old blocks (SF=0) go to the bank of the most-recently-committed version and overwrite a copy of the same block if it is present. This insures that an address is present at most once per bank (making standard CAMs sufficient). Stalls are used whenever resources are exhausted. Each bank maintains: * WV (possibly implicit) * SF * tail pointer (to write new block) * head pointer (to read block into L2 cache) * a block array. The block array has 32 entries that can be accessed via (1) a fully-associative valid-address match and (2) direct addressing via the head or tail pointer. Each entry has: * valid bit (V) (1b) * address (50b) * dirty bit (D) (1b) * read version set (RVS) (8b) * data (64B) Address and valid bit are implemented as CAM. Must support commands to clear (1) all valid bits, (2) the valid bit of an address that matches and (3) the dirty bit of an address that matches. VB mechanisms are implemented as follows. * Insert block at head Pick a appropriate bank for speculative or non-speculative block B. If B already in bank overwrite it (only possible for non-speculative blocks). If space available increment tail pointer and write B. Stall otherwise. * Remove non-speculative block from tail Operate on "oldest" bank. If bank is speculative do nothing If tail!=head examine head if valid write to L2 cache and invalidate in any case, increment head If tail==head and bank is not the most-recently committed, move to next newer bank. * Find the most-recently WRITTEN version of a block. Search all (or all active) banks in parallel with address B. Return result from most recent bank (Do we delete it in VB?) If no result found, use block from L2 cache or memory * Find the most-recently COMMITTED version of a block. The same as above except only consider non-speculative banks. * Remove ALL versions of a block. Search all (or all active) banks in parallel with address B. Ask those blocks to clear their valid bit. * Make clean ALL versions of a block. Ditto for dirty bit. * Make version i non-speculative Reset SF for bank i. * Invalidate version i (and greater) Clear the valid bits in all blocks of bank i (and greater) and reset head and tail pointers. (This could be done lazily but is necessary to re-initialize a bank.) Implementation Note: The above implementation uses one bank per version. This could be inefficient if not all possible versions are used or there is great burstiness in the number of blocks created per version. An alternative is to allow a version to use more one than bank. When version i fills up bank i it does not stall but begins filling bank i+1. When version i+1 starts it uses bank i+2 instead. This alternative adds a little complexity on mapping blocks to versions but may help on bursts. Please note that most VB mechanisms are not affected by this change, because each bank only contains one version and a block at most once (e.g., the search logic for speculative or committed blocks is the same). Implementation 1: Partition by Address -------------------------------------- Partion the VB into several banks, say 8, by address, since no ordering is needed between block of different addresses. TODO. Interface Definition: ---------------------- There are two interfaces to the version buffer, one from the L1 cache and one from the L2 cache. L1 Interface: input L1_command<2> input L1_address<8> io L1_valid<1> io L1_dirty<1> io L1_WV<2> io L1_RVS<4> io L1_SF<1> io L1_data<8> output L1_status<1> The L1_commands are: 0 No Operation 1 Insert block at tail 2 Find the most-recently WRITTEN version of a block. L1_status responses are: 0 Not OK (no room or miss) 1 OK L2 Interface: input L2_command<3> input L2_VN<2> io L2_address<8> output L2_valid<1> output L2_dirty<1> output L2_data<8> output L2_status<1> L2_commands: 0 No Operation 1 Remove ALL versions of a block specified by L2_address 2 Remove non-speculative block from head 3 Find the most-recently COMMITTED version of a block. 4 Make clean ALL versions of a block specified by L2_address 5 Make all blocks in version i (L2_VN) non-speculative 6 Invalidate all blocks in version i (L2_VN) and greater L2_status: 0 Not OK (miss, or no non-speculative block) 1 OK