1. General Info
The project is completely revamped for this course. We will be building an out-of-order processor using the principles learned in class using Chisel. You will be given a dual-issue in-order processor to start with implementing the RISC-V ISA. I am going to break up the class into teams of four or five. Each team will be further broken into 5 or 6 teams (with two people per sub-team: some will be more than one sub-team). Each sub-team will get a component on the processor. I expect to assign one grade to the entire team for the project. This is the first time I am doing this - it will all be great or we will crash an burn!
To expose you to the standard simulation tools that are the bread-and-butter of the architect, we will also spend 5 lectures in class with the TA leading a hands-on coding session covering the following simulators/tools:
- Performance counters
2. Project Guides:
3. Additional References:
- OoO Processor Notes - Additional notes on out of order designs and execution examples
- ROB Notes - Additional notes on reorder buffers and some execution examples
- R10K overview - additional resource on the R10K design
- R10K details - additional resource on the R10K design
4. Suggested Project Timeline and Stages:
This is a rough guide to help you have an idea of what you will need to complete with your team and keep you on a reasonable deadline. Individual teams should discuss their own specific plans for major milestones and division of work around your team members actual schedules accounting for any preexisting plans.
October 11 - Design Plan
By this point you should have a complete plan of the work to be done. This includes but is not limited to:
- Agreement on who is working on which area of the project
- A detailed team timeline
- A means of sharing your code
- A high level block diagram of your core with showing all (named) signal connections
- Complete interface definitions for every module. Someone without knowledge of implementation should have a general idea of how your modules would be used. You should minimally include the following:
- signal name
- signal direction
- signal width
- a description of the purpose of the signal
- any special interactions with other signals
- A verification strategy for each module you're creating. You will want to thoroughly test each module you create individually before attempting to integrate everything.
While planning remember that while fetch and decode will be able to be used mostly as is, much of the later half of the pipeline will need to be redesigned.
November 1 - Module Level design complete
You should aim to have your module's designed and verified by this time hopefully leaving you plenty of time to integrate your modules into the larger core and debug an issues.
Things to consider:
- Speculation: You will likely want to remove much of the speculation that is present in the existing core to allow for easier debug. Once your full core is working you can consider adding it back in.
- Note: Working on a potential tool to give perfect branch predictions using a simulated trace but you should also have an alternative plan
- Load/Store Ordering: You should decide on how you will order loads and stores. It's recommended you stall loads on all pending stores for your initial implementation.
December 5 - Fully working
You should aim to have your fully integrated processor completed a week ahead of the actual project completion. Last minute bugs and problems are virtually guaranteed and you should plan to have some slack in your schedule for when milestones are missed.
5. Module Suggestions
This is a rough suggestion of some of the modules you will need to create based on the R10k. Your team's actual implementation can differ.
|Fetch and Decode
||The fetch and decode stages should not need any significant adjustments to work with the OoO design but later stages will all need to be changed to a large extent.
|Register Map Table
||A table containing the mappings of of logical to physical registers. You should be sure to support enough read and write ports in your design to be sure that this does not become a bottleneck. Consider the maximum number of concurrent instruction that could potentially try to access the table simultaneously.
||Tracks all instructions currently active within the processor. Be sure that your design is able to add and remove enough entries at a time so as to not be a bottleneck.
||Tracks the list(s) of currently unassigned registers. Be sure your design can handle the max number of concurrent instructions to prevent bottlenecks.
||You will need a module to hold decoded instructions as they wait for execution. The R10k paper uses separate instruction queues for memory, integer, and floating point operations but it is also possible to use a more unified design.
||A unit will be needed to be sure that loads and store appear to execute in proper program order and not reordered. You'll need to build an LSQ of some variety to properly order the operations - or at least make them appear that way. The actual ordering policy is up to you but it is recommended that you start with a simple policy and attempt to make improvements once that is working.
||You will be able to use the existing ALU and FPU without any significant changes. You will however need to add more instances of the ALU and should aim to make the core at least 4-way superscalar. The existing FPU is outside of the main core unit and you will likely want to consider bringing it into the main pipeline for consistency in the new design. You will also need a load/store unit instead of the current memory stage. You should be able to use much of the existing memory stage but will likely need to build some additional control logic around it.
6. Perfect Fetch Module
A module to supply instructions and PC values in known program order is being supplied to help get you started. The module will use the known ordering of correct execution from the spike ISA RISC-V simulator from the riscv-tools package that's included in the rocket-core. Download the following files to get started with it:
- Description: This will run the spike simulator and parse the output log of the instructions that were run to create two files ('/tmp/pc.bin' and '/tmp/inst.bin' by default) with the values of the instructions and their associated PC values. You will always need to run this before using the instruction_rom module for the first time or simulating a different executable.
- Requirements: You will need to have the spike executable somewhere on your path for this to run properly. It will be in 'rocket-chip/riscv-tools/bin/' if you've gone through the rocket-chip build process.
- Example Usage: gen_instruction_trace.pl <riscv-binary>
- Description: This will provide instructions and PC values in program order from the files generated by spike and the gen_instruction_trace.pl script. It will provide 1-4 instructions each cycle depending on a design design parameter you give. At the end of each cycle, if enable is set it will move to the next set of values, otherwise it will remain on the same set of values.
- Example Instantiation:
val fetch = Module(new instruction_rom(2))
||The number of instructions provided each cycle. Currently only supports values of 1-4
||If enabled at the end of a clock period, the next set of instruction data will be provided.
||The concatenation of the next n pc values.
||The concatenation of instruction data of the next n instructions.
Keep in mind while using this module that even if your processor is calculating everything wrong it might still appear to be executing correctly if you don't look closely enough because of the nature of providing instructions in the perfectly idealized order.
7. Data Memory Module
A simple module that can be used to simulate a data memory system. It is not a fully implemented cache but instead randomly includes a delay in read operations to emulate cache like behavior. Download the following file to get started with it:
- Description: A simple data memory module that emulates cache like behavior with pseudo random delays.
- Initial Values: Memory locations will not be initialized to any particular value. They should always be written before being read.
- Read Behavior: A read is started when 'en' is high, 'wr' is low, and the module is not busy. The read data is available either the next cycle or in four cycles. Valid will go high after a read when data is output. Inputs need not be held constant for the entire busy duration.
- Write Behavior: A write is started when 'en' is high, 'wr' is high, and the module is not busy. Writes will never lead to the module being busy.
- Example Instantiation:
val data_mem = Module(new dmem(32,32))
||The number of bits used for the address. The memory is sized to use the full range of possible addresses.
||The number of bits used for each data element
||Input data to be used in case of write operation.
||Address for which to read/write.
||Signifies a read or write operation is to be performed. Type of operation determined by "wr" port
||If true along with "en", memory operation will be a write.
||Data output after a read.
||Signifies output data is valid after a read operation.
||Signifies read operation is in progress and memory is busy. No read or write operations can be started while this is high and new inputs will be ignored.
8. Additional Benchmarks
An additional set of benchmarks besides those included with the rocket-core repo are available for you to better judge your improvements. These are intended to specifically target OoO processors and various architectural features whereas those included with the rocket-core are somewhat general. The microbenchmark suite can be downloaded from its git repo by running this command in the directory you'd like to place them:
You can then change into the microbench directory and run make to build the project if you want to run it normally. You might need to change the path to the python executable on the first line of the rand_c_arr.py script (/usr/bin/python for CSL). You will need to make a few changes to compile the suite for the RISCV architecture. First, change the CC variable in the make.config file to point to the RISCV version of gcc (e.g. ~/rocket-chip/riscv-tools/bin//riscv64-unknown-elf-gcc). The benchmarks also use some x86 assembly macros that you will need to remove for what should be obvious reasons. The easiest way to do this is to comment out the pair of __asm__ lines in the common.h file (the two with the xchg operation). You should then be able to run make successfully if you've done everything correctly.
In order to run these benchmarks you will need to run them on top of the proxy kernel using either spike (ISA simulator) or the emulator for your design. The spike simulator is the golden standard for correct execution. Running make in the rocket-chip/emulator directory will build the emulator for your design if you haven't already. You can run with the following commands (adjust the path for your actual install location, optional -l and +verbose options will print instruction logs):
spike -l ../riscv-tools/riscv-pk/build/pk microbench/CCa/bench
./emulator-TestHarness-DefaultConfig +max-cycles=100000000 +verbose ../riscv-tools/riscv-pk/build/pk microbench/CCa/bench
9. RISC-V Compiler
The riscv-tools included in the rocket-chip repo include a version of gcc that can build RISC-V binaries. Assuming you've already followed the instructions to build the rocket-chip then your 'rocket-chip/riscv-tools/bin' directory should have a RISC-V version of many common compilation tools. The 'riscv64-unknown-elf-gcc' compiler works essentially the same as the standard version of gcc and all the standard compilation options you're used to should be the same. There are a few specific flags that you might find useful as you're working though.
|| Generate code for the RV32 subset of the RISC-V ISA. It's recommended you worry only about RV64.
|| Generate code for the RV64 subset of the RISC-V ISA. (default if neither option given)
|| Prevent the use of all hardware floating-point instructions. If you are not planning to support floating point instructions on your core this can help you be certain none are used but you should also be able to avoid using them in test programs.
Remember that the binaries executed will not be able to run natively and will need to be run with either spike or on the rocket-chip emulator.