RAID: Redundant Array of Inexpensive/Independant Disks ====================================================== Why talk about RAID? -------------------- - RAID: Redundant array of independent/inexpensive disks - Introduced in 1988 - Idea that governs storage solutions Almost all important systems protect data with RAID The lab, the data center (stores bank account), Google Google for: Storage server, Dell blade server, - Won "Test of Time" award Why use multiple disks? ----------------------- - Capacity: need more disks to store more data . Each disk has a limit - Performance: parallel access . Remember that disk I/O is still the major bottleneck until today - Reliability: can handle a disk failure . Even the most expensive disk can fail RAID and File system -------------------- - redundant array of independent disks: but remember a file system only works with an array of logical block numbers so how would we manage more than one disk? add another level of indirection! linear array of LBN is a powerful abstraction of a disk - Hardware RAID: a box you attach to a computer -- looks like a disk to the computer, but internally has lots of disks - Software RAID: runs in the OS as a layer below the file system, makes a bunch of disks attached to the PC look like one big disk RAID Managament --------------- - Many-levels of RAID. It turns out that given a number of disks, you can arrange your data in many different ways. - How to manage RAID? Need to know your workload!! Always: know your workload before designing a system! RAID-Naive ---------- FS: 0 ................................ 399 Software/Hardware RAID: Map LBN to disk# and block offset Disk0 Disk1 Disk2 Disk3 0 100 200 300 1 101 201 .. 2 102 202 .. .. .. 99 199 299 - Disk = BlockAddress / (Disk Capacity) - Block Offset = BlockAddress % (Disk Capacity) User: write LBN 2 == Disk #0, offset #2 write LBN 102 == Disk #1, offset #2 Evaluate along three axes: Performance, Reliability, Capacity - RAID 0 Capacity: Great (given D disks, can use all for user data) - RAID 0 Reliability: Poor (one disk failure leads to entire volume corruption) - RAID 0 Performance: Very Poor - What's the bad thing? - What's a common workload? . Random writes? 4 users, .. fine . How about sequential writes? (e.g. cp /flash/video1.wmv video2.wmv) A user writes to block 0 to 50 .. bad! Only 1 disk is busy . How about sequential reads? (e.g. cp video2.wmv /flash/video1.wmv) . Not load-balance the workload Simplest RAID Level: RAID-0 (Striping) --------------------------------------- - Mapping: Lay out blocks across drives FS: 0 ................................ 399 Software/Hardware RAID: Map LBN to disk# and block offset Disk0 Disk1 Disk2 Disk3 0 1 2 3 4 5 6 7 8 9 10 11 ... - How to calculate where a particular block is? Disk = BlockAddress % #Disks Block Offset = BlockAddress / #Disks e.g., Read block 7 Disk = 2 % 4 = 2 (Disk #2) Offset = 2 / 4 = 0 (block offset #0) Disk = 7 % 4 = 3 (Disk #3) Offset = 7 / 4 = 1 (block offset #0) Evaluate along three axes: Performance, Reliability, Capacity - RAID 0 Capacity: Great (given D disks, can use all for user data) - RAID 0 Performance: Great (can use all disks in parallel on reads/writes) - RAID 0 Reliability: Poor (one disk failure leads to entire volume corruption) Can we do better? RAID-1 (Mirroring) ------------------------------------ - Let's improve reliability! - Make sure each block has a copy on another disk - Example layout: Disk0 Disk1 | Disk2 Disk3 | 0 1 | 0 1 2 3 | 2 3 ... | ... 7 | ... 7 ... 200 | ... 200 Mapping: Write block 2 Disk# = BlockAddress % (#Disks/2) Block Offset = BlockAddress / (#Disks/2) In addition, replicate to (disk# + 2) Write 2 Disk = 2 % 2 = 0 Block offset = 2 / 2 = 1 Write to disk #0 , block offset #1 Write to disk #2 , block offset #1 Write 7 Disk = 7 % 2 = 1 Block offset = 7 / 2 = 3 Write to disk #1, block offset #3 Write to disk #3, block offset #3 Evaluate along three axes: Performance, Reliability, Capacity - RAID 1 Capacity: D/2 (only half of D disks is perceived as usable by user), bad! - RAID 1 Reliability: Can tolerate any one disk failure (for sure), scheme above can tolerate D/2 failures (if lucky) - RAID 1 Performance: Read: Good - Reads can go to all disks; Ex: two users, A access block 0 to 50, B access block 100 to 150 Use disk0 and disk1 for user A Use disk2 and disk3 for user B (load balancing, go to less busy disks) In RAID-0, cannot do this. Lots of seeks! Write: Writes get half the bandwidth, What is a bandwidth? #bytes/sec Let's say for each disk, bw: 1 block / sec Write 100 blocks = 100 seconds 100 blocks / 100 seconds = 1 block / sec How about if we have 4 disks, and perform RAID-0 25 blocks go to each disk = 25 seconds 100 blocks / 25 seconds = 4 blocks / sec How about if we have 4 disks, and perform RAID-1 50 blocks go to each disk = 50 seconds 100 blocks / 50 seconds = 2 blocks / sec Because each logical write to RAID-1 leads to 2 physical writes (one to each copy) RAID-4 (Parity) --------------- - Let's improve Capacity! And hopefully maintain reliability! - Use parity to encode info about each block What is a parity? XOR. Imagine if we have 4 bit containers. And we use all 4 containers to store our bits. If we lose one, we lose our file! So let's use 1 bit container for a parity bit. Use 3 containers to store our bits. X Y XOR 0 0 0 0 1 1 1 0 1 1 1 0 parity bit: 1 if number of 1's is odd : 0 if number of 1's is even (In other words, How? XOR, e.g., if you have the following data (0 1 0 0) the parity would 1, such that the number of 1's in the row is even. If you had the following data (0 1 1 0), the parity bit would be 0, again to make sure the number of 1's in the row is even. c c c parity(c0,c1,c2) = c0 XOR c1 XOR c2 0 1 2 1 0 1 | 0 0 1 0 | 1 0 0 1 | 1 1 1 0 | 0 1 0 0 | 1 0 0 0 | 0 disk is just a bunch of bits. hence, each column can be seen as a disk. What happens if we lose a disk. bring a new disk! fill the new disk, with the XOR value of the 3 disks. - Example layout: Disk0 Disk1 Disk2 Disk3 0 1 2 Parity(0,1,2) 3 4 5 Parity(3,4,5) ... - Parity checks the blocks in the row Evaluate along three axes: Performance, Reliability, Capacity - RAID 4 Capacity: (D-1 of D) disks used for data 4 disks, space overhead is 1 disk / 4 disks = 25 % 8 disks, space overhead is 1 disk / 8 disks = 12.5 % - RAID 4 Reliability: Can tolerate any one disk failure - RAID 4 Performance: Good for reads (can go to all but one disk), Less good for writes (requires 2 reads and 2 writes) - What happens if I want to modify this block. - Read the block (normal) - Write the block, I need to update the parity. B1 B2 B3 | P --------------- 1 0 1 | 0 0 0 1 | 1 new P = new B1 XOR B2 XOR B3 (read 3 blocks, write 2 blocks --> bad!) how about if there are 8 disks in the system (read 7 blocks, write 2 blocks --> bad!) - new P = old P X old data X new data B1 B2 B3 | P --------------- 1 0 1 | 0 Change B1 to 0 new P = old P X old B1 X new B1 = 0 X 1 X 0 = 1 B1 B2 B3 | P --------------- 0 0 1 | 1 (read 2 blocks, write 2 blocks) - Why so many reads and writes on a single write to RAID-4? Must read the old data and the old parity in order to compare the new block with the old block and hence generate the new parity. Hence, random write in RAID-4 (or 5) is bad! because need to update the parity (which implies 2 reads and 2 writes) - Major problem with RAID-4: Parity-disk bottleneck!!!! All parity reads/writes must go to the parity disk bandwidth is reduced to 1 block / sec. disk 0, 1, 2 are updated at different rows. They can proceed in parallel, but the parity write can't. Solution? RAID-5 (Rotated Parity) --------------------------------- Spreads the parity across all disks (but still requires 4 I/Os per write) - Example layout: Disk0 Disk1 Disk2 Disk3 0 1 2 Parity(0,1,2) 3 4 Parity(3,4,5) 5 6 Parity(3,4,5) 7 8 Parity(3,4,5) 9 10 11 The Hardware/Software RAID needs to reroute write properly. Conclusions: - RAID turns multiple disks into one bigger, faster, more reliable disk - RAID-0 is good when performance really matters but reliability doesn't - RAID-1 is good when capacity and cost don't matter bad write performance does - RAID-5 is good for capacity and when cost matters or if workload is read-mostly Solution? RAID-6 (Double parity) -------------------------------- Let's make a better reliability! Survives 2 disk failures. math is complex. extend this to 3 disk failures, etc. In real life: ------------- - another level of indirection - RAID 01 - RAID 10 - RAID 15 - RAID 51 - Recap: - Design, know your workload, reiterate the design process NFS, AFS, Google Google: multiple-level of protection ..