Kenneth C. Yeager. The MIPS R10000 Superscalar Microprocessor, IEEE Micro, April 1996, pp. 28-40. IEEE Xplore link |
Look at the paper for figures. This paper is the coolest, in that it pretty much explains all that that goes into an out of order processor in enuf detail (Hardwarewise). You should read this paper fully 100 times.
Specs
Fetch and decode 4 instr per cycle.
speculate four-entry branch stack.
dynamic out-of-order exec
register renaming (map tables)
in-order graduation (precise exceptions)
Exec units Pipeline
Nonblocking load/store unit.
dual 64-bit int ALU
FPU
pipelined adder with 2 cycle latency
multi with 2 cycle latency
hierarchical, nonblocking mem subsystem
on chip 2 way associative primary caches
external 2 way assos sec cache
64 bit multiproc sysif with split transaction protocol
Pipeline
Stage 1 : Fetches and Aligns (4 instr)
Stage 2 : Decode and rename (target address for branch and jump)
Stage 3 : Move to Queue and check if all operands ready (Wait here till ready)
Stage 4-7 : Exec
Int and FP have independent register files.
Instruction Fetch
fetch and decode at higher bandwidth than exec
keep the queues full.
and many instrs are discarded as mispredictions happen.
16 word cache line => fetch 16 instrc, aligner picks 4. in 0,1,2,3 order with a rotate (dependency logic minimization)
Branch Unit
addr[11:3] > 512 entry branch history table, 2-bit algo.
87% on spec 92.
delay slot instr executed. no advantage in ooo, but backward compatibility.
Branch Stack
> Saves state in a 4-entry branch stack.
(i) Alternate branch address
(ii) complete! copies of integer and fp map tables.
(iii) misc control bits
Branch Verif
Fetching along mispredicted paths > unneeded Cache refills! Its kind of okay because eventually this might be used.
4-bit branch mask accompanies each instr (on the queue and exec pipe). pending verification. if mispredicted, abort
Decoding Logic
Stops when queue/active list is full.
Hi and Lo for multiply and divide => 2 register writes => exception handling is difficult.
control register modify execute serially. (mostly in OS kernel).
Register Mapping
dependencies : register operands, memory address, condition bits.
logical register numebrs into physcial-register numbers.
instruction order is taken care of.
Each physical register is written exactly once after assignment from the free list.
Until its written its busy.
When new mapping is written => old mapping is free.
33 logical and 64 physical.
Register Map table
Read the mapping for 3 operands
Writes the current operand mappins and new destination mapping into instr queues.
Active list saved previous destination mapping.
Free List
Unassigned physical registers.
4 parallel (graduates 4 at a time). 8 deep circular FIFO.
Active list
All instr currently active.
Upto 32 instr can be active > 4 parallel, 8 deep circular FIFO.
5-bit tag that equals and address in teh active list.
exec complete => tags sent here and marked complete.
Also has : logical-destination register number and old physical register number.
An instruction's graduation commits its new mapping. > Old physical register goes to free list.
Branch rollback > from Branch Stack (you know it might mispredict early)
Exception rollback > undo mapping from the active list (because you cant checkpoint at every instr)
Busy-bit tables
when register exits the free list, set bit.
used for dependency between integer and floating point pipes.
Instruction Queues
Integer Queue
16 entries, no specific order
Address Queue
Preserves program order (others need not)
Basically Load/Store q.
Two Matries (16 bit by 16 bit) track dependencies between memory accesses.
1: stop unnecessary Cache trashing (older instr obviously gets prefrence)
2: Load bypass : if pending store has the value.
Sequential memory consistency. (this is inorder) but : external access to memory ? if a jackass from outside writes into the memory after its read from the cache but register is not updated, soft exception and restart from load instr.
Store instr : precisely write into cache and active list is cleared.
Atomic instr : Load-Link, Store-Conditional pair.
Floating Point Queue
Memory Hier
The usuals. Non blocking 2 level set associative cache.
44-bit virtual address > 40 bit physcial address.
iCache has predecoded instructions (makes decoding easy later on). which functional unit executes in 4 extra bits, modifies opcodes slightly.
9 bit ECC and 1 bit parity.