(2.4.2) MIPS R10000

Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996, pp. 28-40. IEEE Xplore link

Look at the paper for figures. This paper is the coolest, in that it pretty much explains all that that goes into an out of order processor in enuf detail (Hardwarewise). You should read this paper fully 100 times.

Specs

Fetch and decode 4 instr per cycle.

speculate four-entry branch stack.

dynamic out-of-order exec

in-order graduation (precise exceptions)

Exec units Pipeline

Nonblocking load/store unit.

dual 64-bit int ALU

FPU

pipelined adder with 2 cycle latency

multi with 2 cycle latency

hierarchical, nonblocking mem subsystem

on chip 2 way associative primary caches

external 2 way assos sec cache

64 bit multiproc sysif with split transaction protocol

Pipeline

Stage 1 : Fetches and Aligns (4 instr)

Stage 2 : Decode and rename (target address for branch and jump)

Stage 3 : Move to Queue and check if all operands ready (Wait here till ready)

Stage 4-7 : Exec

Int and FP have independent register files.

Instruction Fetch

fetch and decode at higher bandwidth than exec

keep the queues full.

and many instrs are discarded as mispredictions happen.

16 word cache line => fetch 16 instrc, aligner picks 4. in 0,1,2,3 order with a rotate (dependency logic minimization)

Branch Unit

addr[11:3] > 512 entry branch history table, 2-bit algo.

87% on spec 92.

delay slot instr executed. no advantage in ooo, but backward compatibility.

Branch Stack

> Saves state in a 4-entry branch stack.

(i) Alternate branch address

(ii) complete! copies of integer and fp map tables.

(iii) misc control bits

Branch Verif

Fetching along mispredicted paths > unneeded Cache refills! Its kind of okay because eventually this might be used.

4-bit branch mask accompanies each instr (on the queue and exec pipe). pending verification. if mispredicted, abort

Decoding Logic

Stops when queue/active list is full.

Hi and Lo for multiply and divide => 2 register writes => exception handling is difficult.

control register modify execute serially. (mostly in OS kernel).

dependencies : register operands, memory address, condition bits.

logical register numebrs into physcial-register numbers.

instruction order is taken care of.

Each physical register is written exactly once after assignment from the free list.

Until its written its busy.

When new mapping is written => old mapping is free.

33 logical and 64 physical.

Read the mapping for 3 operands

Writes the current operand mappins and new destination mapping into instr queues.

Active list saved previous destination mapping.

Free List

Unassigned physical registers.

4 parallel (graduates 4 at a time). 8 deep circular FIFO.

Active list

All instr currently active.

Upto 32 instr can be active > 4 parallel, 8 deep circular FIFO.

5-bit tag that equals and address in teh active list.

exec complete => tags sent here and marked complete.

Also has : logical-destination register number and old physical register number.

An instruction's graduation commits its new mapping. > Old physical register goes to free list.

Branch rollback > from Branch Stack (you know it might mispredict early)

Exception rollback > undo mapping from the active list (because you cant checkpoint at every instr)

Busy-bit tables

when register exits the free list, set bit.

used for dependency between integer and floating point pipes.

Instruction Queues

Integer Queue

16 entries, no specific order

Address Queue

Preserves program order (others need not)

Basically Load/Store q.

Two Matries (16 bit by 16 bit) track dependencies between memory accesses.

1: stop unnecessary Cache trashing (older instr obviously gets prefrence)

2: Load bypass : if pending store has the value.

Sequential memory consistency. (this is inorder) but : external access to memory ? if a jackass from outside writes into the memory after its read from the cache but register is not updated, soft exception and restart from load instr.

Store instr : precisely write into cache and active list is cleared.

Atomic instr : Load-Link, Store-Conditional pair.

Floating Point Queue

Memory Hier

The usuals. Non blocking 2 level set associative cache.

44-bit virtual address > 40 bit physcial address.

iCache has predecoded instructions (makes decoding easy later on). which functional unit executes in 4 extra bits, modifies opcodes slightly.

9 bit ECC and 1 bit parity.