### U. Wisconsin CS/ECE 752 Advanced Computer Architecture I

Prof. David A. Wood

Unit 8: Storage Hierarchy I: Caches

Slides developed by Amir Roth of University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood.

Slides enhanced by Milo Martin, Mark Hill, and David Wood with sources that included Profs. Asanovic, Falsafi, Hoe, Lipasti, Shen, Smith, Sohi, Vijaykumar, and Wood

CS/ECE 752 (Wood): Caches



### Motivation

- · Processor can compute only as fast as memory
  - A 3Ghz processor can execute an "add" operation in 0.33ns
  - Today's "Main memory" latency is 50-100ns
  - Naïve implementation: loads/stores can be 300x slower than other operations
- Unobtainable goal:
  - · Memory that operates at processor speeds
  - Memory as large as needed for all running programs
  - · Memory that is cost effective
- · Can't achieve all of these goals at once

CS/ECE 752 (Wood): Caches

### Types of Memory

- Static RAM (SRAM)
  - 6-10 transistors per bit
  - · Optimized for speed (first) and density (second)
  - · Fast (sub-nanosecond latencies for small SRAM)
    - · Speed proportional to its area
  - · Mixes well with standard processor logic
- Dynamic RAM (DRAM)
  - 1 transistor + 1 capacitor per bit
  - · Optimized for density (in terms of cost per bit)
  - Slow (>20ns internal access, >40ns pin-to-pin)
  - · Different fabrication steps (does not mix well with logic)
- Nonvolatile storage: Magnetic disk, Flash, STT MRAM, etc.

CS/ECE 752 (Wood): Caches

### Storage Technology

- Cost what can \$300 buy today?
  - SRAM 500MB (volume chips)
  - DRAM 32GB 64x cheaper than SRAM
  - FLASH 512 GB 16X cheaper than DRAM
  - Disk 8500 GB 16X cheaper than FLASH
- Latency

  - SRAM <1 to 5ns (on chip)</li>
    DRAM ~40ns --- 100x or more slower
  - Disk 10,000,000ns or 10ms --- 100,000x slower (mechanical)
- Bandwidth
  - SRAM 10-100GB/sec

  - DRAM ~6GB/sec per channel
     Disk 100MB/sec (0.1 GB/sec) sequential access only





# Locality to the Rescue Locality of memory references Property of real programs, few exceptions Books and library analogy Temporal locality Recently referenced data is likely to be referenced again soon Reactive: cache recently used data in small, fast memory Spatial locality More likely to reference data near recently referenced data Proactive: fetch data in large chunks to include nearby data Holds for data and instructions CS/ECE 752 (Wood): Caches

























| Hill's 3C Miss Rate Classification                |    |
|---------------------------------------------------|----|
| Compulsory                                        |    |
| Miss caused by initial access                     |    |
| Capacity                                          |    |
| Miss caused by finite capacity                    |    |
| I.e., would not miss in infinite cache            |    |
| Conflict                                          |    |
| Miss caused by finite associativity               |    |
| I.e., would not miss in a fully-associative cache |    |
| Coherence (4 <sup>th</sup> C, added by Jouppi)    |    |
| Miss caused by invalidation to enforce coherence  |    |
| CS/ECE 752 (Wood): Caches                         | 21 |









## Block Size and Tag Overhead • Tag overhead of 32KB cache with 1024 32B frames

- - 32B frames → 5-bit offset
  - 1024 frames → 10-bit index
  - 32-bit address 5-bit offset 10-bit index = 17-bit tag
  - (17-bit tag + 1-bit valid) \* 1024 frames = 18Kb tags = 2.2KB tags
  - ~6% overhead
- · Tag overhead of 32KB cache with 512 64B frames
  - 64B frames → 6-bit offset
  - 512 frames → 9-bit index
  - 32-bit address 6-bit offset 9-bit index = 17-bit tag
  - (17-bit tag + 1-bit valid) \* 512 frames = 9Kb tags = 1.1KB tags
  - + ~3% overhead

CS/ECE 752 (Wood): Caches

26

### Block Size and Performance Parameters: 8-bit addresses, 32B cache, 8B blocks • Initial contents: 0000(0010), 0020(0030), 0100(0110), 0120(0130) Cache contents (prior to access) Address Outcome Miss 0000(0010), 0020(0030), 0100(0110), 0120(0130) 0000(0010), 3020(3030), 0100(0110), 0120(0130) 3030 Hit (spatial locality) 0000(0010), 3020(3030), 0100(0110), 0120(0130) 2100 Miss 0000(0010), 3020(3030), 2100(2110), 0120(0130) 0012 Hit 0000(0010), 3020(3030), 2100(2110), 0120(0130) 0020 Miss Hit (spatial locality) 0000(0010), 0020(0030), 2100(2110), 0120(0130) 0030 0000(0010), 0020(0030), 2100(2110), 0120(0130) 0110 Miss (co 0000(0010), 0020(0030), **0100(0110)**, 0120(0130) 0100 Hit (sp 0000(0010), 0020(0030), 0100(0110), 0120(0130)

0000(0010), 0020(0030), **210** CS/ECE 752 (Wood): Caches

## Large Blocks and Superblocks

- · Large cache blocks can take a long time to refill
  - · refill cache line critical word first
  - · restart cache access before complete refill
- · Large cache blocks can waste bus bandwidth if block size is larger than spatial locality
  - group blocks into SuperBlocks (aka Sectors)
  - associate separate valid (coherence) bits for each block
- Sparse access patterns can use 1/S of the cache
  - · S is blocks per superblock
  - Internal fragmentation!

SuperBlock: One tag, multiple blocks, one-to-one mapping

tag vvvv block block block block

### **Decoupled SuperBlocks** • Level of indirection between tags and blocks · Forward pointers • Tag points to multiple disjoint blocks · Backward pointers Block points to matching tag (and block #) · Separates notions of data and tag associativity.... Backward pointer (4 1-bit pointers) tttt block block block block v, block# Forward pointer (8 2-bit points) CS/ECE 752 (Wood): Caches

### Conflicts

- What about pairs like 3030/0030, 0100/2100?
  - These will conflict in any sized cache (regardless of block size) • Will keep generating misses
- Can we allow pairs like these to simultaneously reside?
  - · Yes, reorganize cache to do so

| tag                                                    | (3 bits) | index (3 bits) | 2 bits |
|--------------------------------------------------------|----------|----------------|--------|
| Cache contents (prior to access)                       | Address  | Outcome        |        |
| 0000, 0010, 0020, 0030, 0100, 0110, 0120, 0130         | 3020     | Miss           |        |
| 0000, 0010, 3020, 0030, 0100, 0110, 0120, 0130         | 3030     | Miss           |        |
| 0000, 0010, 3020, <b>3030</b> , 0100, 0110, 0120, 0130 | 2100     | Miss           |        |
| 0000, 0010, 3020, 3030, 2100, 0110, 0120, 0130         | 0012     | Hit            |        |
| 0000, 0010, 3020, 3030, 2100, 0110, 0120, 0130         | 0020     | Miss           | -      |
| 0000, 0010, 0020, 3030, 2100, 0110, 0120, 0130         | 0030     | Miss           |        |
| 0000, 0010, 0020, <b>0030</b> , 2100, 0110, 0120, 0130 | 0110     | Hit            |        |
| CS/ECE 752 (Wood): Caches                              |          | 3              | 0      |









### Replacement Policies · Set-associative caches present a new design choice • On cache miss, which block in set to replace (kick out)? Some options Random • FIFO (first-in first-out) LRU (least recently used) • Fits with temporal locality, LRU = least likely to be used in future nMRU (not most recently used) · Is LRU for 2-way set-associative caches pLRU (pseudo-LRU) • Tree-base generalization of nMRU, uses A-1 bits Belady's: replace block that will be used furthest in future • Unachievable optimum · Which policy is simulated in previous example? CS/ECE 752 (Wood): Caches







# Partition LRU list into *filter* and *reuse* lists On insert, block goes into *filter* list On reuse (hit), block promoted into *reuse* list Provides scan & some thrash resistance Blocks without reuse get evicted quickly Blocks with reuse are protected from scan/thrash blocks No storage overhead But LRU update slightly more complicated CS/ECE 752 (Wood): Caches



# Set dueling — Dynamic Set Sampling Best policy depends upon workload Dynamically decide which is best Divide sets into N groups Group 1 uses policy A Group 2 uses policy B Other N-2 groups use policy that performed the best in last interval











### Classifying Misses: 3(4)C Model

- Divide cache misses into three categories
  - Compulsory (cold): never seen this address before
    - Would miss even in infinite cache
    - Identify? easy
  - Capacity: miss caused because cache is too small
    - Would miss even in fully associative cache
    - Identify? Consecutive accesses to block separated by access to at least N other distinct blocks (N is number of frames in cache)
  - Conflict: miss caused because cache associativity is too low
  - Identify? All other misses
  - (Coherence): miss due to external invalidations
    - Only in shared memory multiprocessors
  - Who cares? Different techniques for attacking different misses

CS/ECE 752 (Wood): Caches

### Cache Performance Simulation

- Parameters: 8-bit addresses, 32B cache, 4B blocks
  - Initial contents : 0000, 0010, 0020, 0030, 0100, 0110, 0120, 0130
  - · Initial blocks accessed in increasing order

| Cache contents                                         | Address | Outcome           |
|--------------------------------------------------------|---------|-------------------|
| 0000, 0010, 0020, 0030, 0100, 0110, 0120, 0130         | 3020    | Miss (compulsory) |
| 0000, 0010, <b>3020</b> , 0030, 0100, 0110, 0120, 0130 | 3030    | Miss (compulsory) |
| 0000, 0010, 3020, 3030, 0100, 0110, 0120, 0130         | 2100    | Miss (compulsory) |
| 0000, 0010, 3020, 3030, <b>2100</b> , 0110, 0120, 0130 | 0012    | Hit               |
| 0000, 0010, 3020, 3030, 2100, 0110, 0120, 0130         | 0020    | Miss (capacity)   |
| 0000, 0010, 0020, 3030, 2100, 0110, 0120, 0130         | 0030    | Miss (capacity)   |
| 0000, 0010, 0020, 0030, 2100, 0110, 0120, 0130         | 0110    | Hit               |
| 0000, 0010, 0020, 0030, 2100, 0110, 0120, 0130         | 0100    | Miss (capacity)   |
| 0000, 1010, 0020, 0030, <b>0100</b> , 0110, 0120, 0130 | 2100    | Miss (conflict)   |
| 1000, 1010, 0020, 0030, <b>2100</b> , 0110, 0120, 0130 | 3020    | Miss (conflict)   |

CS/ECE 752 (Wood): Caches





# Software Restructuring: Data • Capacity misses: poor spatial or temporal locality • Several code restructuring techniques to improve both - Compiler must know that restructuring preserves semantics • Loop interchange: spatial locality • Example: row-major matrix: x[i][j] followed by x[i][j+1] • Poor code: x[i][j] followed by x[i+1][j] for (j = 0; j<NCOLS; j++) for (i = 0; i<NROWS; i++) sum += X[i][j]; // non-contiguous accesses • Better code for (i = 0; i<NROWS; i++) for (j = 0; j<NCOLS; j++) sum += x[i][j]; // contiguous accesses







### Miss Cost: Critical Word First/Early Restart Observation: latency<sub>miss</sub> = latency<sub>access</sub> + latency<sub>transfer</sub> latency<sub>access</sub>: time to get first word latency<sub>transfer</sub>: time to get rest of block Implies whole block is loaded before data returns to CPU Optimization Critical word first: return requested word first. . Must arrange for this to happen (bus, memory must cooperate) • Early restart: send requested word to CPU immediately · Get rest of block load into cache in parallel • latency<sub>miss</sub> = latency<sub>access</sub>

### Miss Cost: Lockup Free Cache

- Lockup free: allows other accesses while miss is pending
  - Consider: Load [r1] -> r2; Load [r3] -> r4; Add r2, r4 -> r5
  - · Only makes sense for...
    - · Data cache
    - Processors that can go ahead despite D\$ miss (out-of-order)
  - Implementation: miss status holding register (MSHR)
  - Remember: miss address, chosen frame, requesting instruction
  - · When miss returns know where to put block, who to inform
  - · Simplest scenario: "hit under miss"
  - Handle hits while miss is pending
  - Easy for OoO cores
  - More common: "miss under miss"
  - · A little trickier, but common anyway
  - Requires split-transaction bus/interconnect
  - · Requires multiple MSHRs: search to avoid frame conflicts

CS/ECE 752 (Wood): Caches

55

57

### Prefetching Prefetching: put blocks in cache proactively/speculatively · Key: anticipate upcoming miss addresses accurately · Can do in software or hardware • Simple example: next block prefetching Miss on address X → anticipate miss on X+block-size + Works for insps: sequential execution + Works for data: arrays I\$/D\$ • Timeliness: initiate prefetches sufficiently in advance Coverage: prefetch for as many misses as possible Accuracy: don't pollute with unnecessary data It evicts useful data 12

### **Prefetching Characterization**

- Useless prefetch
  - · Prefetch brings in data that is never used
- Harmful prefetch
  - · Prefetch brings in data that displaces a block that WOULD have been reused.
  - So-called (negative) interference
    - Positive interference is possible in multicores
- · Performance v. Energy
  - Original characterization deals with performance
  - · Useless prefetches unnecessarily use energy, thus are harmful

CS/ECE 752 (Wood): Caches

58

### Software Prefetching

CS/ECE 752 (Wood): Caches

CS/ECE 752 (Wood): Caches

- Software prefetching: two kinds
  - Binding: prefetch into register (e.g., software pipelining)
    - + No ISA support needed, use normal loads (non-blocking cache)
  - Need more registers, and what about faults?
  - Non-binding: prefetch into cache (or other buffer) only
    - Need ISA support: non-binding, non-faulting loads
  - + Simpler semantics Example
  - for (i = 0; i<NROWS; i++) for (j = 0; j<NCOLS; j+=BLOCK\_SIZE) {</pre> prefetch (&X[i][j]+BLOCK\_SIZE);
    for (jj=j; jj<j+BLOCK\_SIZE-1; jj++)
     sum += x[i][jj];</pre>

CS/ECE 752 (Wood): Caches

### Hardware Prefetching

- · What to prefetch?
  - · One block ahead
    - How much latency do we need to hide (Little's Law)?
    - · Can also do N blocks ahead to hide more latency
    - + Simple, works for sequential things: insns, array data

  - · Needed for non-sequential data: lists, trees, etc.
- · When to prefetch?
  - · On every reference?
  - · On every miss?
    - + Works better than doubling the block size
  - Ideally: when resident block becomes dead (avoid useful evictions)
    - How to know when that is? ["Dead-Block Prediction", ISCA' 01]

### Address Prediction for Prefetching

- "Next-block" prefetching is easy, what about other options?
- Correlating predictor
  - Large table stores (miss-addr → next-miss-addr) pairs
  - On miss, access table to find out what will miss next
    - It's OK for this table to be large and slow
- Content-directed or dependence-based prefetching
  - Greedily chases pointers from fetched blocks
- Jump pointers
  - Augment data structure with prefetch pointers
  - · Can do in hardware too
- · An active area of research

CS/ECE 752 (Wood): Caches

### Write Issues

- So far we have looked at reading from cache (loads)
- What about writing into cache (stores)?
- Several new issues
  - Tag/data access
  - Write-through vs. write-back
  - Write-allocate vs. write-not-allocate
- Buffers

61

- Store buffers (queues)
- Write buffers
- · Writeback buffers

CS/ECE 752 (Wood): Caches

62







### Write-Through vs. Write-Back · When to propagate new value to (lower level) memory? Write-through: immediately + Conceptually simpler + Uniform latency on misses - Requires additional bus bandwidth Write-back: when block is replaced Requires additional "dirty" bit per block + Lower bus bandwidth for large caches Only writeback dirty blocks - Non-uniform miss latency · Clean miss: one transaction with lower level (fill) • Dirty miss: two transactions (writeback + fill) • Writeback buffer: fill, then writeback (later) • Common design: Write through L1, write-back L2/L3 CS/ECE 752 (Wood): Caches









### Increasing Cache Bandwidth · What if we want to access the cache twice per cycle? • Option #1: multi-ported cache Same number of six-transistor cells · Double the decoder logic, bitlines, wordlines · Areas becomes "wire dominated" -> slow · OR, time multiplex the wires · Option #2: banked cache . Split cache into two smaller "banks" Can do two parallel access to different parts of the cache . Bank conflict occurs when two requests access the same bank · Option #3: replication Make two copies (2x area overhead) Writes both replicas (does not improve write bandwidth) Independent reads No bank conflicts, but lots of area Split instruction/data caches is a special case of this approach CS/ECE 752 (Wood): Caches













### 

### Methods: Execution-Driven Simulation

- Simulate the program execution
  - simulates each instruction's execution on the computer
  - model processor, memory hierarchy, peripherals, etc.
  - ✓ reports execution time
    - ✓ accounts for all system interactions
  - ✓ no need to generate/store trace
  - x much more complicated simulation model
  - $oldsymbol{ imes}$  time-consuming but good programming can help
  - x multi-threaded programs exhibit variability
- $oldsymbol{x}$  Very common in academia and industry today
- **X** Watch out for repeatability in multithreaded workloads

CS/ECE 752 (Wood): Caches

80

### **Low-Power Caches**

- · Caches consume significant power
  - 15% in Pentium4
  - 45% in StrongARM
- · Three techniques
  - · Way prediction (already talked about)
  - · Dynamic resizing
  - Drowsy caches

CS/ECE 752 (Wood): Caches

81

### Low-Power Access: Dynamic Resizing

### · Dynamic cache resizing

- Observation I: data, tag arrays implemented as many small arrays
- · Observation II: many programs don't fully utilize caches
- Idea: dynamically turn off unused arrays
  - Turn off means disconnect power (V<sub>DD</sub>) plane
  - + Helps with both dynamic and static power
- There are always tradeoffs
  - Flush dirty lines before powering down  $\,\rightarrow$  costs power $\uparrow$
  - − Cache-size $\downarrow$  →  $\%_{miss}$  $\uparrow$  → power $\uparrow$ , execution time $\uparrow$

CS/ECE 752 (Wood): Caches

82

### Dynamic Resizing: When to Resize

- Use %<sub>miss</sub> feedback
  - %<sub>miss</sub> near zero? Make cache smaller (if possible)
  - %<sub>miss</sub> above some threshold? Make cache bigger (if possible)
- · Aside: how to track miss-rate in hardware?
  - Hard, easier to track miss-rate vs. some threshold
  - Example: is %<sub>miss</sub> higher than 5%?
    N-bit counter (N = 8, say)
    - Hit? counter -= 1
    - Miss? counter += 19
    - Counter positive? More than 1 miss per 19 hits (%<sub>miss</sub> > 5%)

CS/ECE 752 (Wood): Caches

83

### Dynamic Resizing: How to Resize?

### Reduce ways

- ["Selective Cache Ways", Albonesi, ISCA-98]
- + Resizing doesn't change mapping of blocks to sets → simple - Lose associativity

### • Reduce sets

- ["Resizable Cache Design", Yang+, HPCA-02]
- Resizing changes mapping of blocks to sets → tricky
  - When cache made bigger, need to relocate some blocks
  - Actually, just flush them
- · Why would anyone choose this way?
  - + More flexibility: number of ways typically small
  - + Lower  $\ensuremath{\mbox{$\upselection}$_{miss}$}$ : for fixed capacity, higher associativity better

CS/ECE 752 (Wood): Caches





86









### Performance Calculation I

- Parameters
  - · Reference stream: all loads
  - D\$:  $t_{hit}$  = 1ns,  $\%_{miss}$  = 5%
  - L2: t<sub>hit</sub> = 10ns, %<sub>miss</sub> = 20%
- Main memory: t<sub>hit</sub> = 50ns
- What is t<sub>avgD\$</sub> without an L2?
  - t<sub>missD\$</sub> = t<sub>hitM</sub>
  - $t_{avgD\$} = t_{hitD\$} + \%_{missD\$} * t_{hitM} = 1ns + (0.05*50ns) = 3.5ns$
- What is t<sub>avqD\$</sub> with an L2?
  - $\bullet \quad t_{\text{missD\$}} = t_{\text{avgL2}}$
  - $t_{avgL2} = t_{hitL2} + \%_{missL2} + t_{hitM} = 10$ ns+(0.2\*50ns) = 20ns
  - $t_{avgD\$} = t_{hitD\$} + \%_{missD\$} * t_{avgL2} = 1ns + (0.05*20ns) = 2ns$

CS/ECE 752 (Wood): Caches

91

### Performance Calculation II

- In a pipelined processor, I\$/D\$ t<sub>hit</sub> is "built in" (effectively
- Parameters
  - Base pipeline CPI = 1
  - Instruction mix: 30% loads/stores
  - I\$:  $\%_{\text{miss}}$  = 2%,  $t_{\text{miss}}$  = 10 cycles
  - D\$: %<sub>miss</sub> = 10%, t<sub>miss</sub> = 10 cycles
- · What is new CPI?
  - $CPI_{I\$} = \%_{missI\$} *t_{miss} = 0.02*10 \text{ cycles} = 0.2 \text{ cycle}$
  - $CPI_{D\$} = \%_{memory} *\%_{missD\$} *t_{missD\$} = 0.30*0.10*10 \text{ cycles} = 0.3 \text{ cycle}$
  - $CPI_{new} = CPI + CPI_{1\$} + CPI_{D\$} = 1+0.2+0.3 = 1.5$

CS/ECE 752 (Wood): Caches

92

### An Energy Calculation

- Parameters
  - 2-way SA D\$
  - 10% miss rate
  - $5\mu W/access$  tag way,  $10\mu W/access$  data way
- What is power/access of parallel tag/data design?
  - Parallel: each access reads both tag ways, both data ways
    - Misses write additional tag way, data way (for fill)
  - $[2*5\mu W + 2*10\mu W] + [0.1*(5\mu W + 10\mu W)] = 31.5 \mu W/access$
- What is power/access of serial tag/data design?
  - Serial: each access reads both tag ways, one data way
    - Misses write additional tag way (actually...)
  - $[2 * 5\mu W + 0.9 * 10\mu W] + [0.1 * (5\mu W + 10\mu W)] = 20.5 \mu W/$ access

CS/ECE 752 (Wood): Caches

### Summary

- Average access time of a memory component
  - latency<sub>ava</sub> = latency<sub>hit</sub> + %<sub>miss</sub> \* latency<sub>miss</sub>
  - Hard to get low latency<sub>hit</sub> and %<sub>miss</sub> in one structure → hierarchy
- Memory hierarchy

  - Cache (SRAM) → memory (DRAM) → swap (Disk)
     Smaller, faster, more expensive → bigger, slower, cheaper
- Cache ABCs (capacity, associativity, block size) • 3C miss model: compulsory, capacity, conflict
- Performance optimizations
  - %<sub>miss</sub>: victim buffer, prefetching
  - latency<sub>miss</sub>: critical-word-first/early-restart, lockup-free design
- Power optimizations: way prediction, dynamic resizing
- Write issues
  - Write-back vs. write-through/write-allocate vs. write-no-allocate

CS/ECE 752 (Wood): Caches

## **Backups** CS/ECE 752 (Wood): Caches

### SRAM Technology

- SRAM: static RAM
  - Static: bits directly connected to power/ground
  - Naturally/continuously "refreshed", never decay
  - · Designed for speed
  - Implements all storage arrays in real processors
    - · Register file, caches, branch predictor, etc.
    - Everything except pipeline latches
- · Latches vs. SRAM
  - Latches: singleton word, always read/write same one
  - · SRAM: array of words, always read/write different one
    - · Address indicates which one

































### Multi-Porting an SRAM · Why multi-porting? · Multiple accesses per cycle • True multi-porting (physically adding a port) not good + Any combination of accesses will work – Increases access latency, energy $\propto$ P, area $\propto$ P<sup>2</sup> Another option: pipelining • Timeshare single port on clock edges (wave pipelining: no latches) + Negligible area, latency, energy increase Not scalable beyond 2 ports • Yet another option: replication • Don't laugh: used for register files, even caches (Alpha 21164) • Smaller and faster than true multi-porting 2\*P2 < (2\*P)2 + Adds read bandwidth, any combination of reads will work Doesn't add write bandwidth, not really scalable beyond 2 ports CS/ECE 752 (Wood): Caches

# Still yet another option: banking (inter-leaving) Divide SRAM into banks Allow parallel access to different banks Two accesses to same bank? bank-conflict, one waits Low area, latency overhead for routing requests to banks Few bank conflicts given sufficient number of banks Rule of thumb: N simultaneous accesses → 2N banks How to divide words among banks? Round robin: using address LSB (least significant bits) Example: 16 word RAM divided into 4 banks bo: 0,4,8,12; b1: 1,5,9,13; b2: 2,6,10,14; b3: 3,7,11,15 Why? Spatial locality

CS/ECE 752 (Wood): Caches

























## **CAM Upshot**

- CAMs: effective but expensive
  - Matchlines are very expensive (for nasty EE reasons)
     Used but only for 16 or 32 way (max) associativity
     Not for 1024-way associativity

  - - No good way of doing something like that
       No real need for it, either

CS/ECE 752 (Wood): Caches