Final Review A: Questions

Rfa.1.

Filesystems can represent long files by listing either blocks or extents. What is the difference, and what is the advantage of each?

Rfa.2.

With flash memory, an erase operation takes much more time than a write operation (milliseconds versus microseconds). What distinguishes these two operations?

Rfa.3.

Identify two differences between designing a filesystem for flash memory versus designing for a hard disk.

Rfa.4.

Explain the shortcomings of traditional simple filesystems (like FAT) that lead to the introduction of journaling filesystems, and describe how journaling filesystems overcome these shortcomings.

Rfa.5.

In a COW filesystem, blocks are never updated: Instead, given a request to modify block B, the filesystem creates a new block B'. Naturally, any block referencing B must be changed to reference B' instead — but in updating those referencing blocks, we must copy them as well.

That works well until we get to the root block; the operating system must be able to locate the root block at bootup time, so it can't simply be copied to anywhere else on disk. How can a COW filesystem allow for copy-on-write changes to the root block so that the root block can be identified efficiently on bootup?

Rfa.6.

If we are using two disks in a RAID 1 array, and each disk has an expected lifetime of 10 years, then you might expect it to take 15 years until both disks fail. Why is the expected lifetime of the RAID 1 array far, far longer than 15 years?

Rfa.7.

In a RAID 5 array with four disks, we get the space for three disks and can recover from any single one of the disks failing. How is data stored across the four disks, and how does the recovery work when one disk fails?

Rfa.8.

How is that RAID 5 can provide faster performance compared to the simpler approach of using just one large disk?

Rfa.9.

Suppose we have a RAID 4 array of three disks: A, B, and a third containing the exclusive OR of these two disks (A ^ B). If disk B fails, how can we recover its contents from the remaining two disks? Why does this work?

Rfa.10.

Given eight disks, you're tasked with recommending either RAID 5 (a 7-plus-1 configuration) or RAID 10 (a 2-plus-2-plus-2-plus-2 configuration). Choose one to recommend and justify your recommendation, with specific reference to how RAID 5 and 10 work.

Final Review A: Solutions

Rfa.1.

An extent describes a sequence of adjacent blocks, consisting of the identifier of the first block in the sequence and the number of blocks following it. A file that is reasonably defragmented can be represented by just a handful of extents, leading to much less space being used to represent the blocks in the file. For a highly fragmented filesystem, though, one might as well store the block identifiers separately rather than list extents of just one or two blocks each.

In addition to differences in space utilization, there is also the issue of how quickly the filesystem can identify a block given a request to skip directly to the middle of a file. The block-based approach has the potential to be much better, as one can easily find the middle of the list of block identifiers; but with extents, the algorithm must take into account the sizes of all previous extents. However, when the number of extents is small (as is typical for reasonably defragmented systems), there is no particular advantage here — and in fact the extent-based system could be faster since the list of extents might be loaded in a single block, whereas the list of blocks might require accessing several blocks in the list.

Rfa.2.

Setting a block of bits to 1 is an erase, while resetting a selected bit to 0 is called a write. Note that setting to 1 can only be done in large blocks, whereas existing 1's can be changed to 0's individually.

Rfa.3.
  • In designing for a hard disk, data placement is a major consideration: Files are much more efficient when they are at adjacent locations to avoid having to move the disk head as the file is read.

  • Flash memory hardware needs to be notified when a block becomes available, so that it does not waste time copying the block elsewhere as part of its wear-leveling.

  • A flash-based filesystem can perform better if writes are batched together.

  • Because flash supports a limited number of writes to each block, the filesystem should avoid frequently changing data. For example, the filesystem may not want to include in a file's metadata the time when a file was most recently read.

Rfa.4.

One major problem with simple filesystems is how to ensure that the filesystem remains in good shape even if the system goes down in the middle of a sequence of related writes to disk (perhaps due to a power outage). To deal with recovering from such system crashes, the traditional filesystem performs a check at bootup, but this check can take inordinately long for modern disk sizes, and even if it finds a problem it may not be able to reconstruct what the data should be.

Journaling filesystems avoid these problems by including a separate log of actions that are being performed. For any set of changes, it first writes a description of the upcoming changes into the log, concluding with a log record saying that the set of changes is “committed.” The filesystem ensures that this log data is actually on disk before actually changing anything on disk. On bootup, the filesystem goes through the log to find the “committed” changes, ensuring that those changes are reflected on disk. This is much less work than checking the entire filesystem.

Rfa.5.

A COW filesystem can have a fixed-size set of potential root block locations, each with a timestamp reflecting when that block was written. When the root block is to be changed, it gets copied over the oldest available block in this fixed-size set. On bootup, the system can iterate through each location and see which block has the newest timestamp.

Rfa.6.

When the first disk fails, we would expect the operator to replace it promptly with a third disk, and all data from the still-working disk could be copied over immediately to get this RAID into the same state as the original RAID. When one of these disks fails, we would replace it with a fourth disk, and it could too recover. The only way the RAID 1 array would fail would be if the working disk happens to fail in between the time that its opposite fails and its new opposite gets a copy of its data. Ideally, that interval would be fairly short (within a day, say), and so the chance that the working disk fails during that interval should be fairly small.

Rfa.7.

We group blocks into groups of three, and each group is stored across parallel disks, with the fourth containing the bitwise XOR of these three blocks. The disk containing the XOR block is rotated among the four disks, so that different groups' XOR blocks are likely on different disks: This is to distribute the load on disks.

When one disk fails, the missing data is recovered by taking the XOR of the remaining three disks. For each group where the failing disk contained the XOR block, this obviously gets the correct information. For groups where the failing disk contained one of the three “real” blocks, it also works since XOR is associative and commmutative, using the following derivation: A ^ B ^ (A ^ B ^ C) = (A ^ A) ^ (B ^ B) ^ C = 0 ^ 0 ^ CC.

Rfa.8.

In RAID 5, blocks are “striped” among disks. With a sequence of read requests to several blocks, the blocks will be on different disks, and so multiple requests can be services simultaneously. As a result, there's less waiting time on average.

Rfa.9.

We can recover its contents by taking the exclusive OR of A with the parity disk A ^ B. This works because exclusive OR is associative, because X ^ X is 0, and because 0 ^ X is X: A ^ (A ^ B) = (A ^ A) ^ B = 0 ^ BB.

Rfa.10.

Recommending RAID 10:

  • A RAID 10 array can often tolerate multiple disk failures (if they happen not to be a pair of mirrored drives), but a RAID 5 array can tolerate only one disk failure.
  • In writing to a RAID 5 array, the new parity block must be computed, which involves performing at least two reads before the writing takes place. With a RAID 10 array, we can write directly to the disk and its mirror.
  • For a RAID 5 array receiving two read requests simultaneously, there is a chance that the two reads will map to the same disk and so one will have to be delayed. These could always be served simultaneously on a RAID 10 array, since if they map to the same disk they could read from different mirrors.

Recommending RAID 5:

  • If D represents the capacity of one of these eight disks, a RAID 10 array's capacity will be only 4 D , whereas a RAID 5 array's capacity will be 7 D.