CS/ECE 552: Introduction to Computer Architecture
Spring 2006
Professor:  David Wood
Teaching Assistant:  Andy Phelps
 
Final Project 
WISC-SP06 Architecture & Implementation
Instruction Set Specification

Description

WISC-SP06 is a load/store architecture, similar to the MIPS R2000 architecture, but restricted to 16-bit words and a smaller instruction set.  The architecture has 8 registers, R0 through R7.  All instructions and registers are 16 bits wide and use two's complement arithmetic. R0 is not always zero; it acts like any other register. Memory is  word addressable. The program counter is separate from the general purpose registers.  The register R7 is treated as the link register.  When a subroutine call is made using the JAL or JALR instructions, the address of the next instruction after the jump (i.e., PC+1) is saved in R7.

Finally, for extra credit, a specification is provided for a mechanism to generate an interrupt after executing some count of instructions. The Instruction Issue Counter, IIC, is a 16-bit register which is decremented for each instruction issued and stops counting when it reaches zero. When the register is decremented from one to zero, an interrupt is generated, and the PC is saved in a register called EPC. To return from the interrupt, there is an RTI instruction which loads PC from EPC.
 
 

Instruction formats

WISC-SP06 supports instructions in four different formats: J-format, 2 I-formats, and the R-format. These are described below.

J-format

The J-format is used for jump instructions that need a large displacement.
 
 

J-Format

5 bits 11 bits 
Op Code Displacement

Jump Instructions

The Jump instruction loads the PC with the value found by adding the PC of the next instruction (PC+1, not PC+4 as in MIPS) to the sign-extended displacement.

The Jump-And-Link instruction loads the PC with the same value and also saves the address of the next sequential instruction (i.e., PC+1) in the link register R7.

The syntax of the jump instructions is:

I-format

I-format instructions use either a destination register, a source register, and a 5-bit immediate value; or a destination register and an 8-bit immediate value.  The two types of I-format instructions are described below.

I-format 1 Instructions

 

I-format 1

5 bits  3 bits 3 bits 5 bits 
Op Code Rs Rd Immediate

The I-format 1 instructions include XOR-Immediate, ANDN-Immediate, Add-Immediate, Subtract-Immediate, Rotate-Left-Immediate, Shift-Left-Logical-Immediate, Shift-Right-Arithmetic-Immediate, Shift-Right-Logical-Immediate, Load, Store, and Store with Update.

The ANDNI instruction loads register Rd with the value of the register Rs AND-ed with the one's complement of the zero-extended immediate value. (It may be thought of as a bit-clear instruction.)  ADDI loads register Rd with the sum of the value of the register Rs plus the sign-extended immediate value.  SUBI loads register Rd with the result of subtracting register Rs from the sign-extended immediate value.  (That is, immed - Rs, not Rs - immed.)  Similar instructions have similar semantics, i.e. the logical instructions have zero-extended values and the arithmetic instructions have sign-extended values.

For Load and Store instructions, the effective address of the operand to be read or written is calculated by adding the value in register Rs with the sign-extended immediate value.  The value is loaded to or stored from register Rd. The STU instruction, Store with Update, acts like Store but also writes Rs with the effective address.

The syntax of the I-format 1 instructions is:

I-format 2 Instructions

 

I-format 2

5 bits  3 bits 8 bits 
Op Code Rs Immediate

The Load Byte Immediate instruction loads Rs with a sign-extended 8 bit immediate value.

The Shift-and-Load-Byte-Immediate instruction shifts Rs 8 bits to the left, and replaces the lower 8 bits with the immediate value.

The format of these instructions is:

The Jump-Register instruction loads the PC with the value of register Rs + signed immediate.  The Jump-And-Link-Register instruction does the same and also saves the return address (i.e., the address of the JALR  instruction plus one) in the link register R7. The format of these instructions is

The branch instructions test a general purpose register for some condition. The available conditions are: equal to zero, not equal to zero, less than zero, and greater than or equal to zero. If the condition holds, the signed immediate is added to the address of the next sequential instruction and loaded into the PC. The format of the branch instructions is

 

R-format

R-format instructions use only registers for operands.
 

R-format

5 bits  3 bits  3 bits  3 bits  2 bits
Op Code Rs Rt Rd Op Code Extension

ALU and Shift Instructions

The ALU and shift R-format instrucions are similiar to I-format 1 instructions, but do not require an immediate value.  In each case, the value of Rt is used in place of the immediate.  No extension of its value is required.  In the case of shift instructions, all but the 4 least-significant bits of Rt are ignored.

The ADD instruction performs signed addition. The SUB instruction subtracts Rs from Rt. (Not Rs - Rt.) The set instructions SEQ, SLT, SLE instructions compare the values in Rs and Rt and set the destination register Rd to 0x1 if the comparison is true, and 0x0 if the comparison is false. SLT checks for Rs less than Rt, and SLE checks for Rs less than or equal to Rt.   (Rs and Rt are two's complement numbers.) The set instruction SCO will set Rd to 0x1 if Rs plus Rt would generate a carry-out from the most significant bit; otherwise it sets Rd to 0x0. The Bit-Reverse instruction, BTR, takes a single operand Rs and copies it to Rd, but with a left-right reversal of each bit; i.e. bit 0 goes to bit 15, bit 1 goes to bit 14, etc.

The syntax of the R-format ALU and shift instructions is:


Special Instructions

The HALT instruction halts the processor.  The HALT instruction and all older instructions execute normally, but the instruction after the halt will never execute. The PC is left pointing to the instruction directly after the halt.

The No-operation instruction occupies a position in the pipeline, but does nothing.

The syntax of these instructions is:

Instruction Counter and Interrupt

These instructions are used with the extra-credit interrupt mechanism. These instructions should remain equivalent to NOP until the rest of the design has been completed and thoroughly tested.

SIIC sets the Instruction Issue Counter to the value specified in Rs. The 16-bit Instruction Issue Counter will then start decrementing with each subsequent instruction issued until it has decremented to zero. If it is loaded with zero, it will remain zero and will not generate any interrupt. The timing of the load is such that, if the IIC is loaded with a one, then exactly one instruction after the SIIC will issue prior to the interrupt being generated. If loaded with a two, exactly two instructions will issue, and so forth. When the interrupt is generated, the EPC register will be loaded with the address of the next sequential instruction to be executed, and PC will be loaded with the constant "1".

RTI returns from an interrupt by loading the PC from the value in the EPC register.

The syntax of these instructions is:

WISC-SP06 Instruction Set Summary

 

Instruction Format Syntax Semantics
00000 xxxxxxxxxxx HALT Cease instruction issue
00001 xxxxxxxxxxx NOP None
     
01000 sss ddd iiiii ADDI  Rd, Rs, immediate Rd <- Rs + I(sign ext.) 
01001 sss ddd iiiii SUBI  Rd, Rs, immediate Rd <- I(sign ext.) - Rs 
01010 sss ddd iiiii XORI  Rd, Rs, immediate Rd <- Rs XOR I(zero ext.)
01011 sss ddd iiiii ANDNI Rd, Rs, immediate Rd <- Rs AND ~I(zero ext.)
10100 sss ddd iiiii ROLI  Rd, Rs, immediate Rd <- Rs <<(rotate) I(lowest 4 bits)
10101 sss ddd iiiii SLLI  Rd, Rs, immediate Rd <- Rs << I(lowest 4 bits)
10110 sss ddd iiiii SRAI  Rd, Rs, immediate Rd <- Rs >>(arithmetic) I(lowest 4 bits)
10111 sss ddd iiiii SRLI  Rd, Rs, immediate Rd <- Rs >> I(lowest 4 bits)
10000 sss ddd iiiii ST    Rd, Rs, immediate Mem[Rs + I(sign ext.)] <- Rd
10001 sss ddd iiiii LD    Rd, Rs, immediate Rd <- Mem[Rs + I(sign ext.)]
10011 sss ddd iiiii STU   Rd, Rs, immediate Mem[Rs + I(sign ext.)] <- Rd
  Rs <- Rs + I(sign ext.)
     
11001 sss xxx ddd xx BTR   Rd, Rs Rd[bit i] <- Rs[bit 15-i] for i=0..15
11011 sss ttt ddd 00 ADD   Rd, Rs, Rt Rd <- Rs + Rt   
11011 sss ttt ddd 01 SUB   Rd, Rs, Rt Rd <- Rt - Rs 
11011 sss ttt ddd 10 XOR   Rd, Rs, Rt Rd <- Rs XOR Rt
11011 sss ttt ddd 11 ANDN  Rd, Rs, Rt Rd <- Rs AND ~Rt
11010 sss ttt ddd 00 ROL   Rd, Rs, Rt Rd <- Rs <<(rotate) Rt (lowest 4 bits)
11010 sss ttt ddd 01 SLL   Rd, Rs, Rt Rd <- Rs << Rt (lowest 4 bits)
11010 sss ttt ddd 10 SRA   Rd, Rs, Rt Rd <- Rs >>(arithmetic) Rt (lowest 4 bits)
11010 sss ttt ddd 11 SRL   Rd, Rs, Rt Rd <- Rs >> Rt (lowest 4 bits)
11100 sss ttt ddd xx SEQ   Rd, Rs, Rt if (Rs == Rt) then Rd <- 1 else Rd <- 0
11101 sss ttt ddd xx SLT   Rd, Rs, Rt if (Rs < Rt) then Rd <- 1 else Rd <- 0
11110 sss ttt ddd xx SLE   Rd, Rs, Rt if (Rs <= Rt) then Rd <- 1 else Rd <- 0
11111 sss ttt ddd xx SCO   Rd, Rs, Rt if (Rs + Rt) generates carry out
  then Rd <- 1 else Rd <- 0
     
01100 sss iiiiiiii BEQZ  Rs, immediate if (Rs == 0) then
  PC <- PC + 1 + I(sign ext.)
01101 sss iiiiiiii BNEZ  Rs, immediate if (Rs != 0) then
  PC <- PC + 1 + I(sign ext.)
01110 sss iiiiiiii BLTZ  Rs, immediate if (Rs < 0) then
  PC <- PC + 1 + I(sign ext.)
01111 sss iiiiiiii BGEZ  Rs, immediate if (Rs >= 0) then
  PC <- PC + 1 + I(sign ext.)
11000 sss iiiiiiii LBI   Rs, immediate Rs <- I(sign ext.)
10010 sss iiiiiiii SLBI  Rs, immediate Rs <- (Rs << 8) | I(zero ext.)
     
00100 ddddddddddd J     displacement PC <- PC + 1 + D(sign ext.)
00101 sss iiiiiiii JR    Rs, immediate PC <- Rs + I(sign ext.)
00110 ddddddddddd JAL   displacement R7 <- PC + 1
PC <- PC + 1 + D(sign ext.)
00111 sss iiiiiiii JALR  Rs, immediate R7 <- PC + 1
PC <- Rs + I(sign ext.)
     
00010 sss xxxxxxxx NOP / SIIC Rs IIC <- Rs
00011 xxxxxxxxxxx NOP / RTI PC <- EPC

 

Implementation

Non-pipelined Version

To start, you should do a single-cycle, non-pipelined implementation of the WISC-SP06 Architecture.  Figure 5.24 on page 314 of the third edition of the course text is a good place to start.   I suggest you start with the basic control scheme discussed on pages 303-306.

You should use the modules you designed in previous homeworks for this project.  If there were errors in your modules, you need to fix them.  An error caused in the results of the final project by an earlier error will be considered to be an error in the project.  Do not rely on our having found all errors in earlier work.  In addition to the correction of errors, you may need to make other modifications.

For the single-cycle design, use the single-cycle memory model that is supplied here. Since you will need to fetch instructions as well as read or write data in the cycle, use two memories -- one for instruction memory and one for data.
For the demo, you will be asked to run the test programs here.

Pipelined Version

After you have completed the single-cycle implementation, you will next implement a pipelined version of the architecture. The pipeline will have five stages:

  1. Instruction fetch (IF)
  2. Instruction decode/register fetch (ID)
  3. Execute/address calculation (EX)
  4. Memory access (MEM)
  5. Write back (WB)
A good starting point for the pipelined version of your datapath is described in figure 6.17 on page 395 of the text.

Be sure that the non-pipelined version is functional before you try the pipelined design.  While designing the non-pipelined version, make considerations that will allow for an easy conversion to the pipelined version.

At this step, you may continue to use the same one-cycle memories that you used in the non-pipelined design.

Stalling Version

At this step, replace the single-cycle memory with the stalling memory. This is a very similar module, but it has a "ready" output. At arbitrary times, it will de-assert "ready" to indicate that valid read data is not available, or write data has not been written. Your pipeline will need to be able to stall to handle these conditions.

You do not need to demo this version; this is just a stepping-stone to the next versions. However, if you do not succeed in making a later version work, you will want to be able to at least demonstrate that you got this version working.

Direct-mapped Cache Version

At this step, replace your memory modules with cache modules. This module has a "hit" output, which takes the place of the "ready" output of the stalling memory. Here, however, you will need to implement a state machine to handle cache misses. Upon a miss, the previous contents of the cache line will need to be written back to memory if dirty, and the new line will need to be loaded into cache. The main memory will take multiple cycles to perform each access. The memory module to use is here.

Two-way Set-associative Cache Version

Add a second cache module alongside each of your existing cache modules, and implement a two-way set-associate memory. You must use the pseudo-random-replacement policy specified. See this document again for more info.

Optimizations

Your design will be graded on functionality first, and performance second. Thus, you should get your pipelined processor working before trying to optimize it. For example, your initial design will stall on all branch and control hazards. After you get the basic pipeline design to work, then add optimizations. Your goal is to reduce the CPI, or cycles per instruction.  You will be graded in part on the number of cycles you take to execute the test programs. Increasing the clock rate is a secondary concern. However, you must adhere to the following rule: You may not have more than one of the following blocks in series in the same clock cycle:

For example, when you are doing a cache fill, you cannot have the data coming out of main memory and going into the cache memory in the same cycle. There will need to be a staging register in between the two.

The first optimization to implement is register forwarding. True data dependences are very common and compilers have only limited ability to schedule around them.  Your register file (from the homework) already implements forwarding within the Decode cycle; additional forwarding to add is from the beginning of the M stage and from the beginning of the W stage into the beginning of the X stage.

Another required optimization is to predict all branches to be "not taken". This essentially means that your pipeline should continue to execute sequentially until the branch resolves, and then "squash" instructions after the branch if the branch was actually taken.

Remember:  Making your design work correctly is the most important thing for your grade. Optimization is to be done afterward.