Session 33: Filesystem research

RAID (not in text)
  motivation
  varieties of RAID
log-structured filesystem (Section 5.3.6)
  motivation
  filesystem structure
  analysis

Today I want to talk about two recent innovations in filesystems, to give you an idea of what OS research typically is like. Both of these aren't all that recent - RAID came about in the mid-1980's, whereas log-structured filesystems was late 1980's to early 1990's. But they were just glimmers in researchers' eyes then; by now, they've entering the mainstream. RAID has certainly arrived; log-structured filesystems still have some ways to go, but they're slowly coming along too.

RAID

Motivation

Much of good OS research tends to extrapolate current trends, observe what possibilities/problems arise, propose an alternative to current thinking, and finally implement their proposal to demonstrate its viability. RAID followed this pattern.

By the mid-1980's, researchers understood that processors were basically doubling in speed every 18 months (Moore's Law), apparently without bound. But disks weren't improving in speed, leading people to think that, soon, disk access would be the overriding bottleneck in most computing systems.

The one happy thing about disks is that they've gotten cheaper at an exponential rate (if you measure the cost in terms of pennies per byte, at any rate).

Putting these facts together, it would make sense to try to put several disks into the same computer, trying to split the accesses between disks, thereby reducing the wait time before a disk request can be satisfied. Hence RAID, which stands for Redundant Array of Inexpensive Disks. (Manufacturers will tell you the I stands for Independent, a decision made for advertising reasons. It makes sense, since in fact when you buy a RAID array in the marketplace, it's much more expensive than just a regular disk. Of course you get higher performance for the cost.)

One important point about RAID is that it can be implemented either of two ways: It can be done in software, in which case the disks would be off-the-shelf, regular disks, and the OS would implement everything there is about RAID. Or it can be done in the controller, in which case the OS needs to know next to nothing about managing the RAID, since as far as the CPU and software is concerned, the RAID array looks just like a regular disk. (The OS has to change a little bit, since to get much benefit from RAID, the OS must sometimes issue several requests to the controller simultaneously. But that's minor relative to implementing the RAID layout in the OS.) The latter is what a RAID array purchased from a manufacturer does, and it provides higher performance. The former is cheaper; it's supported by Windows NT, though limitations in the PC bus architecture lead to limitations on how many disks you can attach.

Varieties of RAID

The inventors of RAID proposed six different types of RAID, named Level 0 through Level 5. The term level is a bit confusing, since there isn't really a notion of one level being above another. But that's what they call it, so that's what we're sticking with.

Level 0

In level 0, you simply place blocks cyclically among the disks.

+----------+ +----------+ +----------+ +----------+ 
| Sector 12| | Sector 13| | Sector 14| | Sector 15|
+----------+ +----------+ +----------+ +----------+ 
| Sector 8 | | Sector 9 | | Sector 10| | Sector 11|
+----------+ +----------+ +----------+ +----------+ 
| Sector 4 | | Sector 5 | | Sector 6 | | Sector 7 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 0 | | Sector 1 | | Sector 2 | | Sector 3 |
+----------+ +----------+ +----------+ +----------+ 
   Disk 0       Disk 1       Disk 2       Disk 3

(In practice, they rotate by groups of sectors (called strips), instead of putting each successive sector on each disk. This is because multi-sector files tend to be placed close together, and it's preferable not to hit all the disks for the single file.)

The advantage of this layout is that, if there are a multitude of requests, they will likely be split more or less evenly between the disks, and each disk arm can work separately trying to serve its share. Thus, we've potentially cut the wait time for getting the disk arm by a factor of 4 (or however many disks there are in the RAID).

However, Level 0 RAID has a major problem, in that it hurts reliability of the disks. Disks fail. They may have a lifespan of 4 years, but if I have 8 disks in the RAID, then chances are one of them will fail within a year. As soon as that disk fails, we've lost a more or less random 1/8th of our data.

Level 1

Level 1 RAID addresses this reliability problem by taking advantage of the R in RAID: Redundant. In this scheme, we keep two copies of every disk. After all, disks are cheap. Why not just buy twice as many?

+----------+ +----------+ +----------+ +----------+ 
| Sector 6 | | Sector 7 | | Sector 6 | | Sector 7 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 4 | | Sector 5 | | Sector 4 | | Sector 5 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 2 | | Sector 3 | | Sector 2 | | Sector 3 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 0 | | Sector 1 | | Sector 0 | | Sector 1 |
+----------+ +----------+ +----------+ +----------+ 
   Disk 0       Disk 1       Disk 2       Disk 3

This way, if a disk goes bad, there's another copy waiting. So we can simply turn the computer off, install a replacement disk, copy from the remaining copy onto the new disk, and continue where we left off.

(Many fancy RAID arrays have what's called hot-swapping, which allows the system administrator to replace the disk while the computer is still running. This is to address the fact that in major server systems (like a reservations system or a Web site), each hour of server downtime can cost the business tens of thousands, or even more, dollars. Even when a disk goes bad, you need the server to keep running as normal.)

This technique doubles the cost per byte of disk space from Level 0 - but, as I said, the disks aren't all that expensive. It could potentially result in increased performance, since in fact two reads to the same disk could be split between the two mirrors of the two disks. (This is the problem we addressed in Assignment 4.) Writes aren't improved much, since any write has to be saved on both disks.

And of course Level 1 increases the reliability of disks dramatically. Businesses still do backups of their RAIDs, but that's to protect against user mistakes and catastrophic damage (like a fire burning the building holding the server). Disk crashes become almost no concern, except for the small cost involved in replacing a crashed disk.

Level 2

Level 2 is totally different - a totally different level of granularity. Here, we rotate bits across the disks. (Or, perhaps we rotate bytes of words across disks. At any rate, we're working at a much finer granularity.)

+------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ 
|Bit3.0| |Bit3.1| |Bit3.0| |Bit3.1| |Bit3.0| |Bit3.1| |Bit3.0| |Bit3.1|
+------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ 
|Bit2.0| |Bit2.1| |Bit2.0| |Bit2.1| |Bit2.0| |Bit2.1| |Bit2.0| |Bit2.1|
+------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ 
|Bit1.0| |Bit1.1| |Bit1.0| |Bit1.1| |Bit1.0| |Bit1.1| |Bit1.0| |Bit1.1|
+------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ 
|Bit0.0| |Bit0.1| |Bit0.0| |Bit0.1| |Bit0.0| |Bit0.1| |Bit0.0| |Bit0.1|
+------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ 
 Disk 0   Disk 1   Disk 2   Disk 3   Disk 4   Disk 5   Disk 6   Disk 7

Level 2 doesn't really improve seek time for the disk much - since we don't read the entire byte until all disks have gotten to where they belong. What Level 2 does is potentially increase the throughput of the disk - how many bytes it can read at a time. Disks only rotate so quickly (10,000rpm, maybe). With Level 2 RAID, the time to read a byte is cut by an eighth, so that the bytes come off the disk much faster once the arm has reached where it belongs.

RAID level 2 isn't used much in practice. But it makes some sense in applications where throughput is more important than seek time. (The log-structured filesystem is an example where perhaps this is applicable.)

Level 3

Level 3 is the same as Level 2, except that we add another disk to hold parity information. For example, Bit 0 of Disk 8 would hold the XOR of Bit 0 of Disk 0, Bit 0 of Disk 1, Bit 0 of Disk 2, Bit 0 of Disk 3, etc. Taking the XOR essentially stores whether the number of 1 bits stored in the corresponding location of the other disks is even or odd.

The reason for storing the parity is to protect against disk crashes: If one of the disks crashes, then we can infer what was on it: For each bit, if the parity bit was 1 (indicating an odd number of 1's on the other disks), and we find an even number of 1's on the other disks, then there must have been a 1 on the crashed disk. If we find an odd number of 1's on the other disks, ther there must have been a 0 on the crashed disk. We reason the inverse if the parity bit was 0.

By keeping the parity, we've achieved nearly the reliability of Level 0, without having to buy twice as many disks.

Level 4

Level 4 takes this parity concept to the sector level.

+----------+ +----------+ +----------+ +----------+ 
| Sector 8 | | Sector 9 | | Sector 10| |Parity8-10|
+----------+ +----------+ +----------+ +----------+ 
| Sector 6 | | Sector 7 | | Sector 8 | |Parity6-8 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 3 | | Sector 4 | | Sector 5 | |Parity3-5 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 0 | | Sector 1 | | Sector 2 | |Parity0-2 |
+----------+ +----------+ +----------+ +----------+ 
   Disk 0       Disk 1       Disk 2       Disk 3

We get the reliability of RAID Level 1, without the additional cost (we only need one extra disk to make this work).

Level 4 gives you increased performance on reads from the disk, but writes cause a problem: Every time you write something, you have to change both the disk that containts the sector and the parity disk. The result is that every write affects the parity disk, so that writes get no increase in performance (except insomuch as they don't have to wait for pending reads to sectors on other disks).

You may balk a bit in that last paragraph: A naive implementation would have to read all the disks every time there's a write, in order to recompute the parity sector to be saved on the parity disk. But in fact, since X XOR X = 0, what we can do is take the existing parity sector, XOR it with the old data on the changed sector, and the XOR that with the new data for the changed sector. The result is the new parity to be saved. Hence, altering a sector can be implemented so that the write only affects the disk containing the sector and the parity disk.

Of course, when a drive crashes, inferring the missing data will involve going over all the other drives. Thus, when a drive crashes, the system's performance will suddenly degrade to acting only as good as a single disk. (With Level 1, the performance degrades much more cleanly.)

Level 5

Level 5 is a simple variations on Level 4: If the parity disk is the hang-up, why not distribute the parity sectors across disks?

+----------+ +----------+ +----------+ +----------+ 
|Parity8-10| | Sector 8 | | Sector 9 | | Sector 10|
+----------+ +----------+ +----------+ +----------+ 
| Sector 6 | |Parity6-8 | | Sector 7 | | Sector 8 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 3 | | Sector 4 | |Parity3-5 | | Sector 5 |
+----------+ +----------+ +----------+ +----------+ 
| Sector 0 | | Sector 1 | | Sector 2 | |Parity0-2 |
+----------+ +----------+ +----------+ +----------+ 
   Disk 0       Disk 1       Disk 2       Disk 3

This way, if there are lots of writes, they'll tend to want to change different sectors on different disks, and so the writes can occur in parallel. For example, in the above layout, we could write to Sector 0 and Sector 4 simultaneously, since changing Sector 0 involves only Disks 0 and 3, while changing Sector 4 involves only Disks 1 and 2. (The combinations here are pretty rare; it's a more convincing case when you have more disks in the array.)

Additionally, reads get to take advantage of that extra disk now, so a sequence of reads here is improved over a single disk by a factor of 4, whereas Level 4 would only improve by a factor of 3 (since the fourth disk was just totally ignored during reads).

Level 5 makes so much sense, that basically nobody does Level 4. Basically, the only advantage of Level 4 is that it's a bit simpler, since it's easy to determine where the parity sector is. But the additional computation is trivial, and it's completely outweighed by the improved efficiency of Level 5.