A Case for Redundant Arrays of Inexpensive Disks (RAID)

D. A. Patterson, G. Gibson, and R. H. Katz @ UC-Berkeley

Proceedings of the ACM SIGMOD International Conference on Management of Data, 1988, pages 109-116

Motivation and Problem

Increasing performance gap between processor/memory and disk
Buffer caches  or a cache in disk can't help everything since some applications, like transaction processing and seismic processing, have access patterns not well suited to caches
Amdahl's law: Effective speed-up = 1 / [ f / k  + ( 1 - f ) ]
k = speed up
f = fraction of work done by the components whose speeds are boosted up

A solution: stripping

Array of inexpensive disks connected to SCSI bus

Each disk has its own controller


Interleaving for bulk data access, or

parallel read/write for small independent accesses

Reliability issue

simple stripping won't work well because reliability is compromised by the large number of components

MTBF (array) = MTBF (individual disk) / # of disks in the array

100 disks -> every 2 weeks

A Better Solution: RAID

Use extra disks for redundant information, e.g., checksum or parity, and repair faults of disks from those extra disks

Partition disks into groups and assign some extra check disks to each group

MTTR (Mean Time To Repair)

MTBF (group) = [MTBF (individual disk) / # of disks in the group] * [1/prob(another failure in the group before repair)]

prob(another failure) = MTTR / MTBF(remaining disks in the group)

MTBF(raid) = MTBF(group) / # of group = MTBF(individual disk)2/[(D+C*n) * (G+C-1) * MTTR]

D = total # of disks

G = # of data disks in a group

C = # of check disks in a group

n = # of groups

Some metrics

Reliability overhead = check-disks / data-disks

Usable storage capacity percentage = just reverse of reliability overhead

Performance: small independent reads Vs big reads

Effective performance = performance boost Vs check disk overhead

First Level RAID: Mirrored Disks

Every data disk has an identical copy (a mirror)
Write to both disks simultaneously
Read from one of mirrors -- read can be interleaved or done independently by doubling controllers
Simple & high reliability
Write bandwidth is halved, storage space is halved
failure of a disk in a group leads to bottleneck (chained declustering can help with this)

Second Level RAID: Hamming Code for ECC

Bit interleaving
Use Hamming codes to detect and correct errors
Small reads/writes issue
To check and/or update Hamming code, every disk in a group should be accessed
If disk I/O unit is 512 bytes and group size = 10 with C = 4, then at least 14 * 512 bytes should be accessed per read or write operation
Lower cost (40% overhead usually) than level 1 (100%)
Performance for large I/O is as good as or better than level 1
Performance is bad for small reads/writes: 6 ~ 9 % of level 1

Third Level RAID: Single Check Disk Per Group

Level 2 uses Hamming code to determine which disk failed
Most disk controllers are good enough to detect (hard) error or even to correct (soft) error from checksum per sector
Parity per group is enough to correct error if we know which disk has failed!
Bit interleaving
Use parity to correct the bit of the faulty disk
Lower cost (4 ~ 10% overhead)
Less disks per group means less synchronization cost
Performance is still bad for small reads/writes

Fourth Level RAID: Independent Reads/Writes

Bit level interleaving
Pros: use all disks simultaneously; load balancing
only 1 I/O at a time per group
if disks are not synchronized, biggest rotation delay is taken account
Sector (or disk I/O unit) interleaving
Use parity to correct the bit of the faulty disk
Read - just read the data disk having the sector
Read the data disk having the sector
Read the parity disk
Write to the data disk
Write to the parity disk
new parity = (old-data XOR new-data) XOR old-parity
Same cost as level 2 & 3
Bulk I/O is as fast as level 2 & 3
Small I/O is a lot faster than level 2 & 3
Small modify is still slow. But this type of access is rare!
Parity disk should be accessed every time and thus is bottleneck
Only one write is allowed at a time per group
LFS can help with this since it writes out to a log instead of updating in place

Fifth Level RAID: No Single Check Disk

Removes the distinction between data and parity disks by distributing all data and parity over all disks in a group
small writes do not bottleneck on the check disk
the system may be able to continue operation during a single failure since parity information is on other disks (but the system is vulnerable to data loss until reconstruction completes.)
small writes still need to update parity information (a read and write unless parity information is in the host's memory), but the bottlenecking on a single check disk is less likely


For small I/O workload, use level 1
For bulk I/O workload or you can't afford the capacity overload of mirroring, use level 5


see Remzi's note