RAID

Sections:
1. Background
2. RAID through duplication
   2.1. RAID 0: striping
   2.2. RAID 1: mirroring
   2.3. RAID 10: striping with mirroring
3. RAID through parity
   3.1. RAID 4: parity
   3.2. RAID 5: rotated parity
   3.3. RAID 6: double parity

1. Background

A key part to the success of a database system is simply the reliability of the system. Our goal is 100% availability — that is, that the server is always available for executing SQL queries. This is not possible in practice, but 99% availability is.

To achieve high availability, we need to avoid problems such as power outages, software updates leading to system reboots, and disk failures.

Power outages are relatively straightforward: We use a UPS (uninterrupted power supply), which is just a bank of batteries that can sustain the system during a short power outage. A good UPS will also flatten out flukes that commonly arise in a power line, so that the server sees a much cleaner current than what is supplied by the power company. To deal with extended power outages, we also use a generator, but the UPS will still be necessary while the generator is starting. Naturally, we'll want each piece of equipment plugged into two different UPSs so that if ones goes bad, we can replace it without needing to shut down the system.

System reboots are most easily handled by having multiple servers, though this brings up problems of its own that can be addressed either by using a distributed database or a complex system called a storage area network where the two servers have access to the same disk.

Dealing with disk failures is typically accomplished using a RAID (redundant array of independent disks), which allows several physical disks to be connected together to act as a large physical disk. A high-quality RAID allows hot-swapping, which permits any of the physical disks to be replaced while the system continues to do its work.

The main focus of our study are the several approaches to designing a RAID system.

2. RAID through duplication

The simpler forms of RAID involve achieving redundancy through saving information in multiple spots.

2.1. RAID 0: striping

Actually, this form involves no duplication — it simply distributes the data across disks, which allows for better capacity and performance than any single disk.

disk 0	disk 1	disk 2	disk 3
`B`₀	`B`₁	`B`₂	`B`₃
`B`₄	`B`₅	`B`₆	`B`₇
`B`₈	`B`₉	`B`₁₀	`B`₁₁
:	:	:	:

Capacity: We are getting as much capacity as possible from the disks.

Performance: Multiple processes can conceivably access the virtual disk simultaneously, presuming that the processes happen to be requesting blocks that happen to map to different disks.

Reliability: Very poor: The failure of any of the disks means that we have lost data.

2.2. RAID 1: mirroring

With RAID 1, we use all disks to hold identical information.

disk 0	disk 1	disk 2	disk 3
`B`₀	`B`₀	`B`₀	`B`₀
`B`₁	`B`₁	`B`₁	`B`₁
`B`₂	`B`₂	`B`₂	`B`₂
:	:	:	:

Capacity: We use very little capacity.

Performance: Multiple processes can conceivably read the virtual disk simultaneously, even if they are requesting the same block. However, a write requires changing all disks.

Reliability: Excellent: We can afford to lose all but one disk without losing any data.

2.3. RAID 10: striping with mirroring

Combining RAID 1 with RAID 0, we arrive at what is sometimes written as RAID 1+0, sometimes as RAID 10. Here, we do striping but then mirror each of the disks.

Following is a 2 + 2 RAID 10 configuration (striped across two disks, with each disk mirrored). We could instead do RAID 4 + 4 (striped across four disks, with each disk mirrored) or a RAID 3 + 3 + 3 configuration (striped across tree disks, with three copies of each disk).

disk 0	disk 1	disk 2	disk 3
`B`₀	`B`₀	`B`₁	`B`₁
`B`₂	`B`₂	`B`₃	`B`₃
`B`₄	`B`₄	`B`₅	`B`₅
:	:	:	:

Capacity: We provide more capacity than any single disk, but just half of the overall capacity across all disks. (Or, if we have three copies of each disk, we provide just a third of the overall capacity.)

Performance: Multiple processes can read the virtual disk simultaneously, even if they are requesting the same block. A write requires changing several disks, but it still allows concurrent reads and concurrent writes.

Reliability: Very good: We can afford to lose any one disk, and we can often afford to lose more if they happen to be mirrored differently.

2. RAID through parity

3.1. RAID 4: parity

To recover from a failed disk while still providing more capacity, we can use parity: For each set of corresponding bits, we store the parity bit indicated whether an odd number of the bits are 1. That is, the first bit on disk 3 indicates whether an odd number of the first bits of the other disks are 1; the second bit on disk 3 indicates whether an odd number of the second bits of the other disks are 1; and so forth.

Determining whether the number of bits is odd can be computed using the exclusive-or function (represented using ^ in many languages). This binary operator yields 1 if only one of its arguments is 1, but it yields 0 if zero or both of its arguments are 1 (that is, if there are an even number of 1's).

disk 0	disk 1	disk 2	disk 3
`B`₀	`B`₁	`B`₂	`B`₀ ^ `B`₁ ^ `B`₂
`B`₃	`B`₄	`B`₅	`B`₃ ^ `B`₄ ^ `B`₅
`B`₆	`B`₇	`B`₈	`B`₆ ^ `B`₇ ^ `B`₈
:	:	:	:

Capacity: The capacity provided is that of all disks except for the parity disk.

Performance: For reading, this works as well as striping, though the parity disk is never useful for that. For writing, we can first read the current value of the disk holding the data and the parity disk for its stripe; and then we can write the new value to the destination disk and the parity disk. We can compute the new parity by taking the exclusive-or of the old parity, the old value, and the new value. For example, if the old values of the blocks in that stripe are a, b, c, and we're writing a new value B to the second of these, then we can compute the new parity as follows:

(`a` ^ `b` ^ `c`) ^ `b` ^ `B`	XOR of old parity, old value, and new value
= `a` ^ (`b` ^ `b` ^ `B`) ^ `c`	since XOR is associatiative and commutative
= `a` ^ (0 ^ `B`) ^ `c`	since anything XOR'd with itself is 0
= `a` ^ `B` ^ `c`	since 0 XOR `X` = `X` for any `X`

However, this performance is worse than a single disk since it requires reading the old values, and since the parity will become a single bottleneck when several processes are trying to write simultaneously.

Reliability: Ok. We can recover from losing any one disk. If we lose the parity disk, we can obviously recover from it. If we lose another disk, we can simply compute the parity of the remaining disks to recover it, as demonstrated by the following proof that c is computed by taking the XOR of a, b, and the parity.

`a` ^ `b` ^ (`a` ^ `b` ^ `c`)	XOR of two remaining disks and the parity disk
= (`a` ^ `a`) ^ (`b` ^ `b`) ^ `c`	since XOR is associatiative and commutative
= 0 ^ 0 ^ `c`	since anything XOR'd with itself is 0
= `c`	since 0 XOR `X` = `X` for any `X`

3.2. RAID 5: rotated parity

To improve performance, we can instead change which disk acts as the parity disk for each stripe. Since there is little downside to this, you rarely find RAID 4 being used in practice.

disk 0	disk 1	disk 2	disk 3
`B`₀	`B`₁	`B`₂	`B`₀ ^ `B`₁ ^ `B`₂
`B`₃	`B`₄	`B`₃ ^ `B`₄ ^ `B`₅	`B`₅
`B`₆	`B`₆ ^ `B`₇ ^ `B`₈	`B`₇	`B`₈
:	:	:	:

Capacity: Identical to RAID 4

Performance: Reads are now distributed across all disks. A write requires reading and writing two disks, but the disks could be different for different writes.

Reliability: Identical to RAID 4

3.3. RAID 6: double parity

Two simultaneous disk failures is still reasonably likely, and RAID 4 and RAID 5 don't guard against this. We can handle two disk failures if we add an additional function. The additional function required is somewhat complex, so I'll just avoid describing it, instead using Q to describe the function.

disk 0	disk 1	disk 2	disk 3
`B`₀	`B`₁	`B`₀ ^ `B`₁	`Q`(`B`₀, `B`₁)
`B`₂	`B`₂ ^ `B`₃	`Q`(`B`₂, `B`₃)	`B`₃
`B`₄ ^ `B`₅	`Q`(`B`₄, `B`₅)	`B`₄	`B`₅
:	:	:	:

Capacity: Provides the total capacity of all but two of the disks.

Performance: Reads are distributed across all disks. A write requires reading and writing three physical disks.

Reliability: Can tolerate up to two simultaneous disk failures.