Final Project Instruction Set Specification

CS/ECE 552: Introduction to Computer Architecture Spring 2006 Professor: David Wood Teaching Assistant: Andy Phelps
Final Project
WISC-SP06 Architecture & Implementation Instruction Set Specification

Description

WISC-SP06 is a load/store architecture, similar to the MIPS R2000 architecture, but restricted to 16-bit words and a smaller instruction set. The architecture has 8 registers, R₀ through R₇. All instructions and registers are 16 bits wide and use two's complement arithmetic. R₀ is not always zero; it acts like any other register. Memory is word addressable. The program counter is separate from the general purpose registers. The register R₇ is treated as the link register. When a subroutine call is made using the JAL or JALR instructions, the address of the next instruction after the jump (i.e., PC+1) is saved in R₇.

Finally, for extra credit, a specification is provided for a mechanism to generate an interrupt after executing some count of instructions. The Instruction Issue Counter, IIC, is a 16-bit register which is decremented for each instruction issued and stops counting when it reaches zero. When the register is decremented from one to zero, an interrupt is generated, and the PC is saved in a register called EPC. To return from the interrupt, there is an RTI instruction which loads PC from EPC.

Instruction formats

WISC-SP06 supports instructions in four different formats: J-format, 2 I-formats, and the R-format. These are described below.

J-format

The J-format is used for jump instructions that need a large displacement.

**J-Format**
`5 bits`	`11 bits`
`Op Code`	`Displacement`

Jump Instructions

The Jump instruction loads the PC with the value found by adding the PC of the next instruction (PC+1, not PC+4 as in MIPS) to the sign-extended displacement.

The Jump-And-Link instruction loads the PC with the same value and also saves the address of the next sequential instruction (i.e., PC+1) in the link register R₇.

The syntax of the jump instructions is:

J displacement
JAL displacement

I-format

I-format instructions use either a destination register, a source register, and a 5-bit immediate value; or a destination register and an 8-bit immediate value. The two types of I-format instructions are described below.

I-format 1 Instructions

**I-format 1**
`5 bits`	`3 bits`	`3 bits`	`5 bits`
`Op Code`	`R_s`	`R_d`	`Immediate`

The I-format 1 instructions include XOR-Immediate, ANDN-Immediate, Add-Immediate, Subtract-Immediate, Rotate-Left-Immediate, Shift-Left-Logical-Immediate, Shift-Right-Arithmetic-Immediate, Shift-Right-Logical-Immediate, Load, Store, and Store with Update.

The ANDNI instruction loads register R_dwith the value of the register R_s AND-ed with the one's complement of the zero-extended immediate value. (It may be thought of as a bit-clear instruction.) ADDI loads register R_d with the sum of the value of the register R_s plus the sign-extended immediate value. SUBI loads register R_d with the result of subtracting register R_s from the sign-extended immediate value. (That is, immed - R_s, not R_s - immed.) Similar instructions have similar semantics, i.e. the logical instructions have zero-extended values and the arithmetic instructions have sign-extended values.

For Load and Store instructions, the effective address of the operand to be read or written is calculated by adding the value in register R_s with the sign-extended immediate value. The value is loaded to or stored from register R_d. The STU instruction, Store with Update, acts like Store but also writes R_s with the effective address.

The syntax of the I-format 1 instructions is:

ADDI R_d, R_s, immediate
SUBI R_d, R_s, immediate
XORI R_d, R_s, immediate
ANDNI R_d, R_s, immediate
ROLI R_d, R_s, immediate
SLLI R_d, R_s, immediate
SRAI R_d, R_s, immediate
SRLI R_d, R_s, immediate
ST R_d, R_s, immediate
LD R_d, R_s, immediate
STU R_d, R_s, immediate

I-format 2 Instructions

I-format 2
`5 bits`	`3 bits`	`8 bits`
`Op Code`	`R_s`	`Immediate`

The Load Byte Immediate instruction loads R_s with a sign-extended 8 bit immediate value.

The Shift-and-Load-Byte-Immediate instruction shifts R_s 8 bits to the left, and replaces the lower 8 bits with the immediate value.

The format of these instructions is:

LBI R_s, signed immediate
SLBI R_s, unsigned immediate

The Jump-Register instruction loads the PC with the value of register R_s + signed immediate. The Jump-And-Link-Register instruction does the same and also saves the return address (i.e., the address of the JALR instruction plus one) in the link register R₇. The format of these instructions is

JR Rs, immediate
JALR Rs, immediate

The branch instructions test a general purpose register for some condition. The available conditions are: equal to zero, not equal to zero, less than zero, and greater than or equal to zero. If the condition holds, the signed immediate is added to the address of the next sequential instruction and loaded into the PC. The format of the branch instructions is

BEQZ Rs, signed immediate
BNEZ Rs, signed immediate
BLTZ Rs, signed immediate
BGEZ Rs, signed immediate

R-format

R-format instructions use only registers for operands.

**R-format**
`5 bits`	`3 bits`	`3 bits`	`3 bits`	`2 bits`
`Op Code`	`Rs`	`Rt`	`Rd`	`Op Code Extension`

ALU and Shift Instructions

The ALU and shift R-format instrucions are similiar to I-format 1 instructions, but do not require an immediate value. In each case, the value of R_t is used in place of the immediate. No extension of its value is required. In the case of shift instructions, all but the 4 least-significant bits of R_t are ignored.

The ADD instruction performs signed addition. The SUB instruction subtracts R_s from R_t. (Not R_s - R_t.) The set instructions SEQ, SLT, SLE instructions compare the values in R_s and R_t and set the destination register R_d to 0x1 if the comparison is true, and 0x0 if the comparison is false. SLT checks for R_s less than R_t, and SLE checks for R_s less than or equal to R_t. (R_s and R_t are two's complement numbers.) The set instruction SCO will set R_d to 0x1 if R_s plus R_t would generate a carry-out from the most significant bit; otherwise it sets R_d to 0x0. The Bit-Reverse instruction, BTR, takes a single operand R_s and copies it to R_d, but with a left-right reversal of each bit; i.e. bit 0 goes to bit 15, bit 1 goes to bit 14, etc.

The syntax of the R-format ALU and shift instructions is:

ADD R_d, R_s, R_t
SUB R_d, R_s, R_t
ANDN R_d, R_s, R_t
ROL R_d, R_s, R_t
SLL R_d, R_s, R_t
SRA R_d, R_s, R_t
SRL R_d, R_s, R_t
SEQ R_d, R_s, R_t
SLT R_d, R_s, R_t
SLE R_d, R_s, R_t
SCO R_d, R_s, R_t
BTR R_d, R_s

Special Instructions

The HALT instruction halts the processor. The HALT instruction and all older instructions execute normally, but the instruction after the halt will never execute. The PC is left pointing to the instruction directly after the halt.

The No-operation instruction occupies a position in the pipeline, but does nothing.

The syntax of these instructions is:

HALT
NOP

Instruction Counter and Interrupt

These instructions are used with the extra-credit interrupt mechanism. These instructions should remain equivalent to NOP until the rest of the design has been completed and thoroughly tested.

SIIC sets the Instruction Issue Counter to the value specified in R_s. The 16-bit Instruction Issue Counter will then start decrementing with each subsequent instruction issued until it has decremented to zero. If it is loaded with zero, it will remain zero and will not generate any interrupt. The timing of the load is such that, if the IIC is loaded with a one, then exactly one instruction after the SIIC will issue prior to the interrupt being generated. If loaded with a two, exactly two instructions will issue, and so forth. When the interrupt is generated, the EPC register will be loaded with the address of the next sequential instruction to be executed, and PC will be loaded with the constant "1".

RTI returns from an interrupt by loading the PC from the value in the EPC register.

The syntax of these instructions is:

SIIC R_s
RTI

WISC-SP06 Instruction Set Summary


`Instruction Format`	`Syntax`	`Semantics`
`00000 xxxxxxxxxxx`	`HALT`	`Cease instruction issue`
`00001 xxxxxxxxxxx`	`NOP`	`None`

`01000 sss ddd iiiii`	`ADDI Rd, Rs, immediate`	`Rd <- Rs + I(sign ext.)`
`01001 sss ddd iiiii`	`SUBI Rd, Rs, immediate`	`Rd <- I(sign ext.) - Rs`
`01010 sss ddd iiiii`	`XORI Rd, Rs, immediate`	`Rd <- Rs XOR I(zero ext.)`
`01011 sss ddd iiiii`	`ANDNI Rd, Rs, immediate`	`Rd <- Rs AND ~I(zero ext.)`
`10100 sss ddd iiiii`	`ROLI Rd, Rs, immediate`	`Rd <- Rs <<(rotate) I(lowest 4 bits)`
`10101 sss ddd iiiii`	`SLLI Rd, Rs, immediate`	`Rd <- Rs << I(lowest 4 bits)`
`10110 sss ddd iiiii`	`SRAI Rd, Rs, immediate`	`Rd <- Rs >>(arithmetic) I(lowest 4 bits)`
`10111 sss ddd iiiii`	`SRLI Rd, Rs, immediate`	`Rd <- Rs >> I(lowest 4 bits)`
`10000 sss ddd iiiii`	`ST Rd, Rs, immediate`	`Mem[Rs + I(sign ext.)] <- Rd`
`10001 sss ddd iiiii`	`LD Rd, Rs, immediate`	`Rd <- Mem[Rs + I(sign ext.)]`
`10011 sss ddd iiiii`	`STU Rd, Rs, immediate`	`Mem[Rs + I(sign ext.)] <- Rd` `Rs <- Rs + I(sign ext.)`

`11001 sss xxx ddd xx`	`BTR Rd, Rs`	`Rd[bit i] <- Rs[bit 15-i] for i=0..15`
`11011 sss ttt ddd 00`	`ADD Rd, Rs, Rt`	`Rd <- Rs + Rt`
`11011 sss ttt ddd 01`	`SUB Rd, Rs, Rt`	`Rd <- Rt - Rs`
`11011 sss ttt ddd 10`	`XOR Rd, Rs, Rt`	`Rd <- Rs XOR Rt`
`11011 sss ttt ddd 11`	`ANDN Rd, Rs, Rt`	`Rd <- Rs AND ~Rt`
`11010 sss ttt ddd 00`	`ROL Rd, Rs, Rt`	`Rd <- Rs <<(rotate) Rt (lowest 4 bits)`
`11010 sss ttt ddd 01`	`SLL Rd, Rs, Rt`	`Rd <- Rs << Rt (lowest 4 bits)`
`11010 sss ttt ddd 10`	`SRA Rd, Rs, Rt`	`Rd <- Rs >>(arithmetic) Rt (lowest 4 bits)`
`11010 sss ttt ddd 11`	`SRL Rd, Rs, Rt`	`Rd <- Rs >> Rt (lowest 4 bits)`
`11100 sss ttt ddd xx`	`SEQ Rd, Rs, Rt`	`if (Rs == Rt) then Rd <- 1 else Rd <- 0`
`11101 sss ttt ddd xx`	`SLT Rd, Rs, Rt`	`if (Rs < Rt) then Rd <- 1 else Rd <- 0`
`11110 sss ttt ddd xx`	`SLE Rd, Rs, Rt`	`if (Rs <= Rt) then Rd <- 1 else Rd <- 0`
`11111 sss ttt ddd xx`	`SCO Rd, Rs, Rt`	`if (Rs + Rt) generates carry out then Rd <- 1 else Rd <- 0`

`01100 sss iiiiiiii`	`BEQZ Rs, immediate`	`if (Rs == 0) then` `PC <- PC + 1 + I(sign ext.)`
`01101 sss iiiiiiii`	`BNEZ Rs, immediate`	`if (Rs != 0) then` `PC <- PC + 1 + I(sign ext.)`
`01110 sss iiiiiiii`	`BLTZ Rs, immediate`	`if (Rs < 0) then` `PC <- PC + 1 + I(sign ext.)`
`01111 sss iiiiiiii`	`BGEZ Rs, immediate`	`if (Rs >= 0) then` `PC <- PC + 1 + I(sign ext.)`
`11000 sss iiiiiiii`	`LBI Rs, immediate`	`Rs <- I(sign ext.)`
`10010 sss iiiiiiii`	`SLBI Rs, immediate`	`Rs <- (Rs << 8) \| I(zero ext.)`

`00100 ddddddddddd`	`J displacement`	`PC <- PC + 1 + D(sign ext.)`
`00101 sss iiiiiiii`	`JR Rs, immediate`	`PC <- Rs + I(sign ext.)`
`00110 ddddddddddd`	`JAL displacement`	`R7 <- PC + 1` `PC <- PC + 1 + D(sign ext.)`
`00111 sss iiiiiiii`	`JALR Rs, immediate`	`R7 <- PC + 1` `PC <- Rs + I(sign ext.)`

`00010 sss xxxxxxxx`	`NOP / SIIC Rs`	`IIC <- Rs`
`00011 xxxxxxxxxxx`	`NOP / RTI`	`PC <- EPC`

Implementation

Non-pipelined Version

To start, you should do a single-cycle, non-pipelined implementation of the WISC-SP06 Architecture. Figure 5.24 on page 314 of the third edition of the course text is a good place to start. I suggest you start with the basic control scheme discussed on pages 303-306.

You should use the modules you designed in previous homeworks for this project. If there were errors in your modules, you need to fix them. An error caused in the results of the final project by an earlier error will be considered to be an error in the project. Do not rely on our having found all errors in earlier work. In addition to the correction of errors, you may need to make other modifications.

For the single-cycle design, use the single-cycle memory model that is supplied here. Since you will need to fetch instructions as well as read or write data in the cycle, use two memories -- one for instruction memory and one for data.
For the demo, you will be asked to run the test programs here.

Pipelined Version

After you have completed the single-cycle implementation, you will next implement a pipelined version of the architecture. The pipeline will have five stages:

Instruction fetch (IF)
Instruction decode/register fetch (ID)
Execute/address calculation (EX)
Memory access (MEM)
Write back (WB)

A good starting point for the pipelined version of your datapath is described in figure 6.17 on page 395 of the text.

Be sure that the non-pipelined version is functional before you try the pipelined design. While designing the non-pipelined version, make considerations that will allow for an easy conversion to the pipelined version.

At this step, you may continue to use the same one-cycle memories that you used in the non-pipelined design.

Stalling Version

At this step, replace the single-cycle memory with the stalling memory. This is a very similar module, but it has a "ready" output. At arbitrary times, it will de-assert "ready" to indicate that valid read data is not available, or write data has not been written. Your pipeline will need to be able to stall to handle these conditions.

You do not need to demo this version; this is just a stepping-stone to the next versions. However, if you do not succeed in making a later version work, you will want to be able to at least demonstrate that you got this version working.

Direct-mapped Cache Version

At this step, replace your memory modules with cache modules. This module has a "hit" output, which takes the place of the "ready" output of the stalling memory. Here, however, you will need to implement a state machine to handle cache misses. Upon a miss, the previous contents of the cache line will need to be written back to memory if dirty, and the new line will need to be loaded into cache. The main memory will take multiple cycles to perform each access. The memory module to use is here.

Two-way Set-associative Cache Version

Add a second cache module alongside each of your existing cache modules, and implement a two-way set-associate memory. You must use the pseudo-random-replacement policy specified. See this document again for more info.

Optimizations

Your design will be graded on functionality first, and performance second. Thus, you should get your pipelined processor working before trying to optimize it. For example, your initial design will stall on all branch and control hazards. After you get the basic pipeline design to work, then add optimizations. Your goal is to reduce the CPI, or cycles per instruction. You will be graded in part on the number of cycles you take to execute the test programs. Increasing the clock rate is a secondary concern. However, you must adhere to the following rule: You may not have more than one of the following blocks in series in the same clock cycle:

register file
memory or cache
16-bit full adder
barrel shifter
register file

For example, when you are doing a cache fill, you cannot have the data coming out of main memory and going into the cache memory in the same cycle. There will need to be a staging register in between the two.

The first optimization to implement is register forwarding. True data dependences are very common and compilers have only limited ability to schedule around them. Your register file (from the homework) already implements forwarding within the Decode cycle; additional forwarding to add is from the beginning of the M stage and from the beginning of the W stage into the beginning of the X stage.

Another required optimization is to predict all branches to be "not taken". This essentially means that your pipeline should continue to execute sequentially until the branch resolves, and then "squash" instructions after the branch if the branch was actually taken.

Remember: Making your design work correctly is the most important thing for your grade. Optimization is to be done afterward.