Links to specific parts of document:
Phase 2.3 Handin Instructions
tar czvf cache_demo.tgz cache
] where cache is a directory that holds both cache_assoc and cache_direct sub-directories (this mirrors the provided tar structure).
/u/s/i/sinclair/public/html/courses/cs552/spring2022/handouts/scripts/project/phase2.3/verify_submission_format.sh
. In addition to checking for the correct files in the correct places, this script will run the randbench for both the direct-mapped and set associative caches, as well as the vcheck and name convention checks.
Your final processor you design for this course will use both instruction and data caches. For this stage of the project you will be designing and testing a cache to ultimately be used for your final design. You must first design and verify a direct mapped cache before making changes to create a two-way set-associative cache.
The cache's storage as well as the memory has already been designed for you. You will be implementing the memory system controller to effectively manage the cache.
All needed files are included in the original project tar file.
The following files are included in the cache directories (cache_*/). Of these files, 'mem_system.v' should be the only file you need to edit.
File | Description |
cache.v | the cache data structures outlined below |
clkrst.v | standard clock reset module |
final_memory.v | memory banks used in four_bank_mem |
four_bank_mem.v | four banked memory |
loadfile_*.img | memory image files |
mem.addr | sample address trace used by perfbench |
memc.v | used by the cache to store data |
mem_system_hier.v | instantiates the mem_system and a clock for the testbench |
mem_system_perfbench.v | testbench that uses supplied memory access traces |
mem_system_randbench.v | testbench that uses random memory access patterns |
mem_system_ref.v | reference memory design used by testbench for comparison |
mem_system.v | The memory system, the cache and main memory are instantiated here and is where you will need to make your changes. |
memv.v | used by cache to store valid bits |
*.syn.v | synthesizable versions of memory |
Four Banked Memory is a better representation of a modern memory system. It breaks the memory into multiple banks. The four-cycle, four-banked memory is broken into two Verilog modules, the top level four_bank_mem.v
and single banks final_memory.v
. All needed files were included in the project tar file.
final_memory.syn.v must be in the same directory as final_memory.v
+-------------------+ | | Addr[15:0] >------| four_bank_mem | DataIn[15:0] >------| | wr >------| 64KB |-----> DataOut[15:0] rd >------| |-----> stall | |-----> Busy[3:0] clk >------| |-----> err rst >------| | createdump >------| | +-------------------+
Timing:
| | | | | | | addr | addr etc | read data | | new addr | | data_in | OK to any | available | | etc. is | | wr, rd |*diffferent*| | | OK to | | enable | bank | | | *same* | | | | | | bank | <----bank busy; any new request to---> the *same* bank will stall
This figure shows the external interface to the module. Each signal is described in the table.
Signal | In/Out | Width | Description |
Addr | In | 16 | Provides the address to perform an operation on. |
DataIn | In | 16 | Data to be used on a write. |
wr | In | 1 | When wr="1", the data on DataIn will be written to Mem[Addr] four cycles after wr is asserted. |
rd | In | 1 | When rd="1", the DataOut will show the value of Mem[Addr] two cycles after rd is asserted. |
clk | In | 1 | Clock signal; rising edge active. |
rst | In | 1 | Reset signal. When "rst"=1, the memory will load the data from the file "loadfile". |
createdump | In | 1 | Write contents of memory to file. Each bank will be written to a different file, named dumpfile_[0-3]. Active on rising edge. |
DataOut | Out | 16 | Two cycles after rd="1", the data at Mem[Addr] will be shown here. |
stall | Out | 1 | Is set to high when the operation requested at the input cannot be completed because the required bank is busy. |
Busy | Out | 4 | Shows the current status of each bank. High means the bank cannot be accessed. |
err | Out | 1 | The error signal is raised on an unaligned access. |
This is a byte-addressable, word-aligned 16-bit wide 64K-byte memory.
Requests may be presented every cycle. They will be directed to one of the four banks depending on the least significant 2 bits of the address (remember that bit 0 must be 0 for aligned requests, so bits 2:1 are actualy used for determining the bank).
Two requests to the same bank which are closer than cycles N and N+4 will result in the second request not happening, and a "stall" output being generated.
Busy output reflects the current status of each individual bank.
Concurrent read and write not allowed.
On reset, memory loads from file "loadfile_0.img", "loadfile_1.img", "loadfile_2.img", and "loadfile_3.img". Each file supplies every fourth word. (The latest version of the assembler generates these four files.)
Format of each file: @0 <hex data 0> <hex data 1> ...etc
If input "create_dump" is true on rising clock, contents of memory will be dumped to file "dumpfile_0", "dumpfile_1", etc. Each file will be a dump from location 0 up through the highest location modified by a write in that bank.
This figure shows the external interface to the module. Each signal is described in the table below.
+-------------------+ | | enable >------| | index[7:0] >------| cache | offset[2:0] >------| | comp >------| 256 lines |-----> hit write >------| by 4 words |-----> dirty tag_in[4:0] >------| |-----> tag_out[4:0] data_in[15:0] >------| |-----> data_out[15:0] valid_in >------| |-----> valid | | clk >------| | rst >------| |-----> err createdump >------| | +-------------------+
Signal | In/Out | Width | Description |
enable | In | 1 | Enable cache. Active high. If low, "write" and "comp" have no effect, and all outputs are zero. |
index | In | 8 | The address bits used to index into the cache memory. |
offset | In | 3 | offset[2:1] selects which word to access in the cache line. The least significant bit should be 0 for word alignment. If the least significant bit is 1, it is an error condition. |
comp | In | 1 | Compare. When "comp"=1, the cache will compare tag_in to the tag of the selected line and indicate if a hit has occurred; the data portion of the cache is read or written but writes are suppressed if there is a miss. When "comp"=0, no compare is done and the Tag and Data portions of the cache will both be read or written. |
write | In | 1 | Write signal. If high at the rising edge of the clock, a write is performed to the data selected by "index" and "offset", and (if "comp"=0) to the tag selected by "index". |
tag_in | In | 5 | When "comp"=1, this field is compared against stored tags to see if a hit occurred; when "comp"=0 and "write"=1 this field is written into the tag portion of the array. |
data_in | In | 16 | On a write, the data that is to be written to the location specified by the "index" and "offset" inputs. |
valid_in | In | 1 | On a write when "comp"=0, the data that is to be written to valid bit at the location specified by the "index" input. |
clk | In | 1 | Clock signal; rising edge active. |
rst | In | 1 | Reset signal. When "rst"=1 on the rising edge of the clock, all lines are marked invalid. (The rest of the cache state is not initialized and may contain X's.) |
createdump | In | 1 | Write contents of entire cache to memory file. Active on rising edge. |
hit | Out | 1 | Goes high during a compare if the tag at the location specified by the "index" lines matches the "tag_in" lines. |
dirty | Out | 1 | When this bit is read, it indicates whether this cache line has been written to. It is valid on a read cycle, and also on a compare-write cycle when hit is false. On a write with "comp"=1, the cache sets the dirty bit to 1. On a write with "comp"=0, the dirty bit is reset to 0. |
tag_out | Out | 5 | When "write"=0, the tag selected by "index" appears on this output. (This value is needed during a writeback.) |
data_out | Out | 16 | When "write"=0, the data selected by "index" and "offset" appears on this output. |
valid | Out | 1 | During a read, this output indicates the state of the valid bit in the selected cache line. |
The cache contains 256 lines. Each line contains one valid bit, one dirty bit, a 5-bit tag, and four 16-bit words:
V D Tag Word 0 Word 1 Word 2 Word 3 ___________________________________________________________________________________ |___|___|_______|________________|________________|________________|________________| |___|___|_______|________________|________________|________________|________________| |___|___|_______|________________|________________|________________|________________| |___|___|_______|________________|________________|________________|________________| Index-------->|___|___|_______|________________|________________|________________|________________| |___|___|_______|________________|________________|________________|________________| |___|___|_______|________________|________________|________________|________________| |___|___|_______|________________|________________|________________|________________|
Important Notes:
Done
signal should be asserted for exactly one cycle. If the request can be satisfied in the same cycle that data should be presented, Done
should be asserted in that same cycle.
You will need to determine how your cache is arranged and functions before starting implementation. Draw out the state machine for your cache controller as this will be required. You may implement either a Mealy or Moore machine though a Moore machine is recommended as it will likely be easier. Be forewarned that the resulting state machine will be relatively large so it is best to start early.
The state machine diagram is due the week before the cache demo on Canvas (as part of HW5). If we have concerns about your design we will ask you to setup an appointment to talk about your FSM design before the due date.
You will initially need to implement your cache as a direct mapped cache. Make your changes for this problem in the "cache_direct" directory.
Although there are a lot of signals for the cache, its operation is pretty simple. When "enable" is high, the two main control lines are "comp" and "write". Here are the four cases for the behavior of the direct mapped cache:
To begin testing you will use address traces that you will create to target the different possible aspects of cache behavior. Once you have that fully working you can use a fully random test set.
The perfbench testbench uses address trace files that describe a sequence of reads and writes. You will need to write several (at least 5) address traces to test your cache and the various behavior cases that might occur. You should try to make it so that your traces highlight the various use cases that your cache might experience to be sure that they are working. For simplicity, the verification script assumes these tests are named mem1.addr, mem2.addr, mem3.addr, mem4.addr, and mem5.addr
An example address trace file (mem.addr) is provided. The format of the file is the following:
Once you have created your address traces this testbench can be run using:
wsrun.pl -addr mem.addr mem_system_perfbench *.v
If it correctly runs you will get output that looks like the following:
# Using trace file mem.addr # LOG: ReQNum 1 Cycle 12 ReqCycle 3 Wr Addr 0x015c Value 0x0018 ValueRef 0x0018 HIT 0 # # LOG: ReqNum 2 Cycle 14 ReqCycle 12 Rd Addr 0x015c Value 0x0018 ValueRef 0x0018 HIT 1 # # LOG: Done all Requests: 2 Replies: 2 Cycles: 14 Hits: 1 # Test status: SUCCESS # Break at mem_system_perfbench.v line 200 # Stopped at mem_system_perfbench.v line 200
WARNING: just because a SUCCESS message prints, it does not guarantee your cache is working correctly. It merely states that your design ran to completion successfully (i.e., it says nothing about if you got the right number of hits or misses). You should use the cache simulator to verify the correct behavior is happening. The cache simulator can be run as follows:
cachesim <associativity> <size_bytes> <block_size_bytes> <trace_file>
So for this problem you would use:
cachesim 1 2048 8 mem.addr
This will generate output like the following:
Store Miss for Address 348 Load Hit for Address 348
You should then compare this to the perfbench output to make sure they both exhibit the same behavior.
The address traces you created should be put in the 'cache_direct/verification' directory and have the '.addr' extention.
Once you are confident that your design is working you should test it using the random testbench. The random bench does the following:
At the end of each section you will see a message showing the performance like the following:
LOG: Done two_sets_addr Requests: 4001, Cycles: 79688 Hits: 562
You can run the random testbench like this:
wsrun.pl mem_system_randbench *.v
This will ultimately print a message saying either:
# Test status: SUCCESS
or
# Test status: FAIL
Keep in mind that it's considered a success if the correct data is returned every time but that doesn't mean your cache is necessarily working. If you have no hits or a very small number of them something is still wrong. If you are seeing failures try to isolate the case that is causing the issues and create a trace that generates the same behavior to make debugging easier.
If you want to test the randbench to validate exactly how many hits you should get for the direct-mapped cache, there is a trace for the default seed that the randbench uses here: /u/s/i/sinclair/public/html/courses/cs552/spring2022/handouts/cachesim/randbench.trace
. This trace can be passed into the cachesim
program as follows to validate how many hits you should get (the command line below assumes you are testing the direct mapped cache with pseudoRandom replacement):
/u/s/i/sinclair/public/html/courses/cs552/spring2022/handouts/bins/cachesim 1 2048 8 /u/s/i/sinclair/public/html/courses/cs552/spring2022/handouts/cachesim/randbench.trace pseudoRandom | grep "Hit" | wc -l
Note: if you want to test the randbench with a different seed (initial random number), you will need to create a corresponding trace to pass into cachesim
-- the provided trace only works for the default random seed.
You should not start on this until you have implemented and fully verified your direct-mapped cache.
Remember to change directories to the cache_assoc directory before starting to make changes to your design as you will need to submit both designs. Be aware that the second cache module is instantiated slightly differently before copying your mem_system file and overwriting the provided file.
After you have a working design using a direct-mapped cache, you will add a second cache module to make your design two-way set-associative. Here are the four cases again:
In order to make the designs more deterministic and easier to grade, all set-associative caches must implement the following pseudo-random replacement algorithm:
Example, using two sets:
start with victimway = 0 load 0x1000 victimway=1; install 0x1000 in way 0 because both free load 0x1010 victimway=0; install 0x1010 in way 0 because both free load 0x1000 victimway=1; hit load 0x2010 victimway=0; install 0x2010 in way 1 because it's free load 0x2000 victimway=1; install 0x2000 in way 1 because it's free load 0x3000 victimway=0; install 0x3000 in way 0 (=victimway) load 0x3010 victimway=1; install 0x3010 in way 1 (=victimway)
Your testing for the set-associative cache should be done in much the same way. You can either create more address traces or update your previous ones to reflect the differences in behavior the new design would have. Remember to get your perfbench tests working before attempting to debug the randbench.
The cache simulator would now be run with slightly different arguments to reflect your changes:
cachesim 2 4096 8 mem.addr pseudoRandom
If you do not specify the pseudoRandom argument it will use an LRU replacement policy instead of the pseudo-random policy you have implemented.
The address traces you used should be put in the 'cache_assoc/verification' directory and have the '.addr' extention. For simplicity, the verification script assumes these files are called mem1-tw.addr, mem2-tw.addr, mem3-tw.addr, mem4-tw.addr, and mem5-tw.addr.
If you want to test the randbench to validate exactly how many hits you should get for the 2-way set associative cache, there is a trace for the default seed that the randbench uses here: /u/s/i/sinclair/public/html/courses/cs552/spring2022/handouts/cachesim/randbench.trace
. This trace can be passed into the cachesim
program as follows to validate how many hits you should get (the command line below assumes you are testing the 2-way set associative cache with pseudoRandom replacement):
/u/s/i/sinclair/public/html/courses/cs552/spring2022/handouts/bins/cachesim 2 4096 8 /u/s/i/sinclair/public/html/courses/cs552/spring2022/handouts/cachesim/randbench.trace pseudoRandom | grep "Hit" | wc -l
Note: if you want to test the randbench with a different seed (initial random number), you will need to create a corresponding trace to pass into cachesim
-- the provided trace only works for the default random seed.
When instantiating the module, there is a parameter which is set for each instance. When you dump the contents of the cache to a set of files (e.g. for debugging), this parameter allows each instance to go to a unique set of filenames.
Parameter Value File Names --------------- ---------- 0 Icache_0_data_0, Icache_0_data_1, Icache_0_tags, ... 1 Dcache_0_data_0, Dcache_0_data_1, Dcache_0_tags, ... 2 Icache_1_data_0, Icache_1_data_1, Icache_1_tags, ... 3 Dcache_1_data_0, Dcache_1_data_1, Dcache_1_tags, ...
Here is an example of instantiating two modules with a parameter value of 0 and 1:
cache #(0) cache0 (enable, index, ... cache #(1) cache1 (enable, index, ...