Filesystems can represent long files by listing either
blocks or extents. What is the difference, and
what is the advantage of each?
An extent describes a sequence of adjacent blocks, consisting
of the identifier of the first block in the sequence and the
number of blocks following it. A file that is
reasonably defragmented can be represented by just a handful of
extents, leading to much less space being used to represent the blocks
in the file. For a highly fragmented filesystem, though, one might
as well store the block identifiers separately rather than list extents
of just one or two blocks each.
In addition to differences in space utilization, there is
also the issue of how quickly the filesystem can identify
a block given a request to skip directly to the middle of a file.
The block-based approach has the potential to be much better, as one can
easily find the middle of the list of block identifiers; but with extents,
the algorithm must take into account the sizes of all previous
extents. However, when the number of extents is small (as is
typical for reasonably defragmented systems), there is no
particular advantage here — and in fact the extent-based
system could be faster since the list of extents might be
loaded in a single block, whereas the list of blocks might
require accessing several blocks in the list.
With flash memory, an erase operation
takes much more time than a write operation
(milliseconds versus microseconds).
What distinguishes these two operations?
Setting a block of bits to 1 is an erase, while resetting
a selected bit to 0 is called a write. Note that
setting to 1 can only be done in large blocks, whereas existing
1's can be changed to 0's individually.
Identify two differences between designing a filesystem for
flash memory versus designing for a hard disk.
In designing for a hard disk, data placement
is a major consideration: Files are much more efficient when
they are at adjacent locations
to avoid having to move the disk head as the file is
Flash memory hardware needs to be notified when a block
becomes available, so that it does not waste time copying the
block elsewhere as part of its wear-leveling.
A flash-based filesystem can perform better if writes are
Because flash supports a limited number of writes to each
block, the filesystem should avoid frequently changing data.
For example, the filesystem may not want to include in a file's
metadata the time when a file was most recently read.
Explain the shortcomings of traditional simple filesystems
(like FAT) that lead to the introduction of journaling
filesystems, and describe how journaling filesystems overcome
One major problem with simple filesystems is how to ensure
that the filesystem remains in good shape even if the
system goes down in the middle of a sequence of related writes to disk
(perhaps due to a power outage).
To deal with recovering from such system crashes,
the traditional filesystem performs a check at bootup,
but this check can take inordinately long for modern disk sizes,
and even if it finds a problem it may not be able to reconstruct
what the data should be.
Journaling filesystems avoid these problems by including a
separate log of actions that are being performed.
For any set of changes, it first writes a
description of the upcoming changes into the log, concluding
with a log record saying that the set of changes is
“committed.” The filesystem ensures that this log
data is actually on disk before actually changing anything on
disk. On bootup, the filesystem goes through the log to
find the “committed” changes, ensuring that those
changes are reflected on disk. This is much less work than
checking the entire filesystem.
In a COW filesystem, blocks are never updated: Instead, given
a request to modify block B, the filesystem
creates a new block B'. Naturally, any block referencing
B must be changed to reference B' instead —
but in updating those referencing blocks, we must copy them as
That works well until we get to the root block; the
operating system must be able to locate the root block
at bootup time, so it can't simply be copied to anywhere else on
disk. How can a COW filesystem allow for copy-on-write changes
to the root block so that the root block can be identified
efficiently on bootup?
A COW filesystem can have a fixed-size set of potential root
block locations, each with a timestamp reflecting when that block was
written. When the root block is to be changed, it gets
copied over the oldest available block in this fixed-size set.
On bootup, the system can iterate through each location and see
which block has the newest timestamp.
If we are using two disks in a RAID 1 array, and each disk
has an expected lifetime of 10 years, then you might expect it
to take 15 years until both disks fail. Why is the expected
lifetime of the RAID 1 array far, far longer than 15 years?
When the first disk fails, we would expect the operator to
replace it promptly with a third disk, and all data from the
still-working disk could be copied over immediately to get this
RAID into the same state as the original RAID. When one of
these disks fails, we would replace it with a fourth disk, and
it could too recover. The only way the RAID 1 array would
fail would be if the working disk happens to fail in between the
time that its opposite fails and its new opposite gets a copy of
its data. Ideally, that interval would be fairly short (within a
day, say), and so the
chance that the working disk fails during that interval
should be fairly small.
In a RAID 5 array with four disks, we get the space for three
disks and can recover from any single one of the disks failing.
How is data stored across the four disks, and how does the
recovery work when one disk fails?
We group blocks into groups of three, and each group is
stored across parallel disks, with the fourth containing the
bitwise XOR of these three blocks. The disk containing the XOR
block is rotated among the four disks, so that different groups'
XOR blocks are likely on different disks: This is to distribute
the load on disks.
When one disk fails, the missing data is recovered by
taking the XOR of the remaining three disks.
For each group where the failing disk contained the XOR block,
this obviously gets the correct information.
For groups where the failing disk contained one of the three
“real” blocks, it also works since
XOR is associative and commmutative, using the following
A ^ B ^ (A ^ B ^ C) = (A ^ A) ^ (B ^ B) ^ C = 0 ^ 0 ^ C = C.
How is that RAID 5 can provide faster performance compared to
the simpler approach of using just one large disk?
In RAID 5, blocks are “striped” among disks. With a
sequence of read requests to several blocks, the blocks will be
on different disks, and so multiple requests can be services
simultaneously. As a result, there's less waiting time on
Suppose we have a RAID 4 array of three disks:
A, B, and a third containing the exclusive OR of
these two disks (A ^ B). If disk B fails, how
can we recover its contents from the remaining two disks? Why does
We can recover its contents by taking the exclusive OR
of A with the parity disk A ^ B.
This works because exclusive OR is associative,
because X ^ X is 0,
and because 0 ^ X is X:
A ^ (A ^ B) = (A ^ A) ^ B = 0 ^ B = B.
Given eight disks, you're tasked with recommending either RAID 5
(a 7-plus-1 configuration) or RAID 10 (a 2-plus-2-plus-2-plus-2
Choose one to recommend and justify your recommendation,
with specific reference to how RAID 5 and 10 work.
Recommending RAID 10:
A RAID 10 array can often tolerate multiple disk failures
(if they happen not to be a pair of mirrored drives), but a
RAID 5 array can tolerate only one disk failure.
In writing to a RAID 5 array, the new parity block must be
computed, which involves performing at least two reads before
the writing takes place. With a RAID 10 array, we can write
directly to the disk and its mirror.
For a RAID 5 array receiving two read requests simultaneously,
there is a chance that the two reads will map to the same disk
and so one will have to be delayed. These could always be served
simultaneously on a RAID 10 array, since if they map to the same
disk they could read from different mirrors.
Recommending RAID 5:
If D represents the capacity of one of these eight
disks, a RAID 10 array's capacity will be only 4 D ,
whereas a RAID 5 array's capacity will be 7 D.