**Some question** ================= * When client writes, how does it perceive latency? - fast, if there is space in NVRAM, just write to NVRAM and return - slow, if NVRAM full, need to kick something out, this may incur RAID-1 writes * What problems does the NVRAM help solve? - consistent update - act as a cache for client * How does AutoRaid solve the small write problem with RAID-5? - like log structure - try to append only, i.e write to blank block - keep the contents of old parity on NVRAM - hence avoid 2 read: 1 to old block, 1 to old parity * Why there is no concept of old block in RAID-5? - update to RAID-1 is inplace - update to RAID-5 is never in place, these updates mostly result from migration from RAID-1 to RAID-5. * background activity (cleaning, hole plugging, migration) seems expensive? Does it affect client performance. - No, not really, because of NVRAM * Can you come up with a workload that performs worst in AutoRaid? - One is thrashing when the working set is bigger than the mirrored storage * Overall, what is cool about this paper? - Very important that the behavior is dynamic: makes it robust across workloads, across technology changes, and across the addition of new disks. Greatly simplifies management of the disk system * What are the key points? - RAIDs difficult to use well - Mix mirroring and RAID5 automatically - Hide magic behind simple SCSI interface **Basic about RAID** ==================== See remzi notes Evaluating Rate: - Performance - Capacity - Reliability Some aside: Chunk size: ----------- + small chunk size: ~ increase parallelism of read and write to a single file, because many files will get striped across many disk ~ but, increase positioning time to access blocks across multiple disks + big chunk size: ~ reduce intra-file parallelism (note: only intra-file) ~ but decrease positioning time "a small chunk size implies that many files will get striped across many disks, thus increasing the parallelism of reads and writes to a single file; however, the positioning time to access blocks across multiple disks increases, because the positioning time for the entire request is determined by the maximum of the positioning times of the requests across all drives." "A big chunk size, on the other hand, reduces such intra-file parallelism, and thus relies on multiple concurrent requests to achieve high throughput. However, large chunk sizes reduce positioning time; if, for example, a single file fits within a chunk and thus is placed on a single disk, the positioning time incurred while accessing it will just be the positioning time of a single disk." The Mapping problem ------------------- + map from logical block address to physical disk and offset The Consistent update problem ----------------------------- Occurs on a write to any RAID that has to update multiple disks during a single logical operation. E.g: update to a logical block in RAID-1. RAID-1 has to issues two write to separate disks. Suppose update to first disk finish, and a power loss. As a result, two copy of the block are now *inconsistent* Solution: Write-ahead-logging, each RAID controller has non-volatile memory of some kind, and just replay the log for recovery. NOTE: THe parity disk is RAID is not for corruption detection, but it main goal is to recover from failure. This process is call RECONSTRUCTION Comparison ---------- Notation: N: # disks in the array S: sequential throughput of a single disk R: random throughput of a single disk D: time that a request to a single disk would take RAID-0 RAID-1 RAID-4 RAID-5 ---------------------------------------------------------- Capacity N N/2 N-1 N-1 ---------------------------------------------------------- Reliability 0 1(sure) 1 1 N/2(lucky) ---------------------------------------------------------- Throughput - Seq. read N.S (N/2).S (N-1).S (N-1).S - seq. write N.S (N/2).S (N-1).S (N-1).S - Ran. read N.R N.R (N-1).R N.R - Ran. write N.R (N/2).R R/2 (1/4).N.R Latency - read D D D D - write D D(max) 2D 2D **THE HP AutoRAID Hierarchical Storage System** =============================================== # 0. ABSTRACT - two-level storage hierarchy (RAID 1 & RAID 5) inside a single disk controller - upper level for active data: RAID1: full redundancy, excellent performance - lower level for inactive data RAID5: excellent storage cost, lower performance - migrate data blocks between two level transparently and automatically # 1. INTRODUCTION - RAID provides: + performance + reliability + capacity - But, RAID is also *difficult to use* + each RAID level has > different performance characteristics > suitable for a small range of workload + lot of configuration parameter ~ data layout, parity layout, stripe depth, stripe width, cache sizes, etc. ~ require workload knowledge to set correctly ~ changing layout or adding capacity require > reformat > reloading ==> potential operator error - Solution: AutoRAID + Combine RAID 1 and RAID 5 in a single disk control ~ may lose knowledge about files. How? > well, part of a file can be in RAID1, other part may be in RAID5 ~ but automatic, adaptable, and easily deployable + RAID 1 has great performance, used to store active data + RAID 5 has cost-capacity benefits, used to store inactive/read-only data - Question: Why implement this storage hierarchy in disk control? Alternative: + Manually: error prone, cannot adapt quickly to workload change + At file system, per file basis: deployment problem (since many file systems) - Major features: + Mapping is done transparently + Mirroring write-active data for best performance + RAID 5 Write-inactive data: for good read performance, best cost capacity + Adaptation to workload changes: Promote/Demote based on workload change + Log-structured RAID 5 writes: ~ solve the small write problem of RAID 5 ~ leverage NVRAM to store old parity content ~ write only to empty area of disk in a append fashion, hence no nead to re-read old parity and old data So clearly, AutoRAID needs: - some mapping data structure - some cleaning/migrating mechanism # 2. THE TECHNOLOGY # 2.1 HP AutoRAID Hardware See Fig. 2. for more: - similar to regular RAID array - set of disks - intelligent controller, which contains: + microprocessor + parity logic circuit + caches + non-volatile memory - From client/host computer point of view, AutoRAID provides a Logical Unit (LUN) interfaces, each is an array of logical blocks # 2.2 DATA LAYOUT (!important) Physical Data Layout (Fig. 3. and 4) ------------------------------ - PEX (Physical EXtents): is unit of physical space allocation on disk - PEG (Physical Extent Group): + groups of PEXes + can be assigned to either RAID 1 or RAID 5 - Segments: + contiguous space on disk, often 128KB + each PEX is divided into a set of 128KB segments. + stripe unit for RAID 5 + half of a mirror unit - Stripe: one row of data and parity in RAID 5 Logical Data Layout ------------------- - AutoRAID provides an interface of LUN, each of which is an array of RB - RB = Relocation Blocks + typical size 64KB units + basic units of migration in the system - How should we decide RB size? + small RB size: ~ more mapping information ~ increase time spent on disk seek and rotation delay + larger RB size: ~ increase migration cost if only small amount of data are updated in the RB ~ why? because when data is written --> promote to RAID 1 Mapping structure ----------------- - Virtual device tables: + one per LUN + list of RBs and pointers to the PEGs which they resides + summary info - PEG tables: + one per PEG + hold list of RBs in PEG and list of PEXes used to store them + stats: access time, free space, some history (for cleaning and GC) - This data structure need to be persisted some where... # 2.3 NORMAL OPERATION (Important!) What happens when client read? ------------------------------ 1. if data in caches, return, and update some stats 2. else allocate space in front-end buffer cache, 3. dispatch read to backend storage What happens when client write? ------------------------------- 1. allocate space in NVRAM buffer 2. if no space, then need to flush dirty-data --> back-end write or promotion 3. one space available, write data to NVRAM, and return What happens when flushes dirty data? ------------------------------------- IF data is in RAID-1 THEN back-end write ELSE promote the RB from RAID-5 to RAID-1 Hence, the promotion is done when write to data belong to RAID-5 class - allocate space in RAID-1 class - if there is no space, demote some RAID-1 data to RAID-5 - (now, have space) write data to RAID-1 class What happens RAID-1 back-end read and write? -------------------------------------------- - Read: + pick up one copies of issues request to associated disk + which disk to choose? ~ randomly ~ strictly alternating ~ shortest disk queue ~ shortest seek (requires knowledge about disk and head positions) ~ inner/outer - Write: + causes 2 write to two disks + return only when both copies have been updated + Question? consistent update problem? Solution: NVRAM What happens RAID-5 back-end read and write? -------------------------------------------- - Read: + normal case: a read is issued to disk that hold data + recovery case: the data may have to be reconstructed from other blocks in the same stripe - Write: do something different that normal writes to mitigate small-write prob. + layout storage as a log + freshly demote RBs are appended to the end of the log + i.e. always overwrite virgin/blank storage, hence don't need to read old data and old parity + 3 modes: ~ per-RB writes: 2 IOs only + write to data block + write to the parity block + Keeps in NVRAM contents old parity block To: > avoid a read to old parity block > deal with consistent update problem (in case of power lost) How: 1) if data block not make it to disk, ==> ok, do nothing 2) if data block made to disk, but not the parity block, then the data on disk and the parity NVRAM match, just need to copy the parity from NVRAM to disk ~ batched writes: + the parity is written only after all data RBs in a stripe have been written, or at the end of the batch + N+1 IOs, N to write the RBs, 1 for the parity block + To deal with failures, Keeps in NVRAM: > prior contents of parity blocks > index of the highest-numbered RB that contain valid data ~ normal small write in RAID 5 (read-modify-update): + 4 IOs + use when no space for logging style + use when don't want to promote RB to RAID-1 (in background migration) # 2.4 BACKGROUND OPERATION So, there are at least some background operation like: - garbage collection (cleaning, hole plugging) - Migration: moving RB between levels - Balancing Cleaning and Hole-Plugging for RAID-1? -------------------------------------- - Why hole in RAID-1? + RBs are demoted to RAID-5 (to get space for more active data) - These holes are used for: + contain promoted RBs ~ when writing to RAID-5 RBs, it is promoted to RAID-1 + contain newly created RBs - Cleaning: when need new PEG for RAID-5 but not enough free PEXes + pick one RAID-1 PEG, migrate data out to file holes in other RAID-1 PEG + the PEG can be reclaimed and reallocated to RAID-5 Cleaning and Hole-Plugging for RAID-5? -------------------------------------- - Why hole in RAID-5? + when RBs are promoted to RAID-1 class (when RBs get updated) + holes can not be use directly due to logging style ==> garbage - Garbage collection: + hole-plugging: ~ when RAID-5 PEG containing the holes is almost full, ~ copy RBs from another RAID-5 PEG with small number of RBs to fill holes of the almost full one ~ minimize data movement ~ when to do: idle only. Why? Requires reading the rest of the stripe and recalculating parity + cleaning: (similar to LFS) ~ if the PEG containing the holes is almost empty, and no other holes to be plugged ~ select all RBs in the PEG and written it the end of current RAID 5 log ~ reclaim the complete PEG as an unit Background Migration -------------------- - migrate RBs from RAID-1 to RAID-5 - to provide enough empty RBs slot in RAID-1: for write burst - what RBs to migrate: LEAST RECENTLY WRITTEN - when to migrate: number of empty RAID-1 RBs drop under a threshold Balancing: Adjusting data layout across drives ---------------------------------------------- - when new drive is added to the array - migrate PEX between disks to equalize the amount of data stored on each disk - background, when system idle