Filesystem performance & reliability

Efficiency

To improve disk performance, we commonly have a cache of recently used disk blocks, called the buffer cache. Overall, RAM contains the virtual page frames and the buffer cache; in fact, many operating systems manage these together using LRU, so that the number of pages used for the buffer cache grows or shrinks depending on how heavily the disk is used relative to virtual memory.

One common optimization is pre-fetching: When the operating system reads one block, the OS tries to predict which blocks will be needed next and reads them, too. An easy example: If a program reads the beginning of the file, the OS can easily predict that the program is likely to read past the beginning so it would go ahead and fetch the first several blocks.

Another optimization is write-back caching: One can have a write-through buffer cache, where each change requested to a block immediately goes to disk. But with a write-back buffer cache, the change is not sent to disk until the OS decides it wants to use the buffer for some other purpose.

While the write-back approach leads to fewer writes to the disk, it leads to real problems with consistency: A pure implementation means that a user might execute a program to modify the disk, but the computer may crash before any of the buffers end up getting written to disk. Even worse: The computer may crash after a random subset of the modified buffers are written to disk! A typical approach is to ensure that buffers are written to disk in the following situations:

the buffer is needed for another block.
the file containing the buffer is closed.
the program executes a flush to force all current buffers in the file to go to disk.
the block remains modified in memory for 30 seconds.

Obviously, this doesn't fix the problem of consistency entirely, though it does reduce it dramatically.

In practice, to avoid inconsistency, when programs are saving modified versions of files, they often create a temporary file, write the modified text into the temporary file, close the file (forcing it to disk), and then rename the temporary file to replace the old version.

Reliability

Regular file operations often involve changing several disk blocks. For example, in creating a file in Unix, we must write to the inode with the file information, add the file entry into its directory, and possibly create the first data block of the disk. In moving a file from one directory to another, we must delete the entry from the first directory and insert the entry into the second. But what if the computer crashes between two of these operations? Then the filesystem could enter an inconsistent state, confusing the OS to the point that the filesystem can't be used at all.

Ad hoc reliability

The traditional way of avoiding inconsistency is to simply be careful about the order in which we do things. Roughly, we work from the bottom up: We'd first create the data block, then create the inode that uses the data block, and only after the new file is entirely ready would be insert it into its directory. Or in moving a file, we'd add the entry into its destination directory before deleting the entry from its source directory, so that at least the file is never entirely lost.

Even using this ad hoc techniques, problems can still crop up. For this reason, on startup the operating system will perform a filesystem check (called fsck on Unix) to ensure that the filesystem is in good shape. This, however, begins to break down with larger disks: Checking all of a large disk can easily take 15 minutes, which is far longer than we want to wait for the computer to start up.

Logging/journaling filesystems

Most modern filesystems (including NTFS (Windows), HFS+ (Macintosh), and ext3/ext4 (Linux) use a log, which is an additional file that essentially exists separate from the filesystem. Each operation on the filesystem is called a transaction. Before modifying the normal filesystem at all, we first write a description of the full transaction into the log and ensure that it is entirely written to disk, and only then do we allow the regular filesystem to be modified.

On bootup, rather than checking the filesystem, the OS just reads through the log to ensure that all logged transactions are indeed completed (re-performing the operations when necessary). Obviously, the filesystem will have a procedure for removing information from the log once they are wholly committed to disk, so that bootup does not need to check through a large log.

Most filesystems like this log modifications to file metadata only. We can call these journaling filesystems, to distinguish them from logging filesystems, where changes to the data within files is also logged. All the above-listed filesystems journaling, except that ext3/ext4 are journaling by default but can be configured to do full logging.

Copy-On-Write filesystems

A more recent alternative are filesystems that never change existing blocks: When we ask to modify a block, we instead write a new version of the block elsewhere on the disk. This idea is called copy on write, or COW.

One problem is that there are pointers between blocks: For example, a directory consists of inode identifiers (which include block references), and each inode includes references to the blocks containing the file data (directly or indirectly). So when we create a copy of one block, we have to change any blocks referring to that block; and changing each of those blocks means we have to change any blocks referring to those blocks, and so on until we reach the root node of the system.

However, we can't just create a copy of the root node anywhere, since the OS needs to find the root node when the OS restarts. The basic solution is to include a fixed set of root nodes, each including a timestamp saying when it was created and a checksum. When booting, the OS goes through the set and finds which has the newest timestamp but has a checksum that works out. (The checksum is to deal with the possibility of a crash happening in the middle of writing a new version of the root node.)

Another problem that crops up with COW systems is that they do involve allocating lots of blocks. This is avoided in part by delaying writes for several seconds, with the hope of coalescing multiple writes to the same block into just one write. We still keep a log so that we can recover should the computer crash in the midst of waiting for more writes.

A very nice advantage to COW filesystems is that the system can easily support snapshots — that is, one can view the filesystem from one hour ago, or one day ago, or even one month ago, by just keeping a different root node around.

The idea of COW filesystems started with ZFS in 2001, and it is still entering the mainstream. Under Windows, ReFS is a COW filesystem that is intended to be the successor to NTFS. The btrfs project aims to provide a COW filesystem to Linux; it is widely seen as the likely successor to ext4. Some additional btrfs features:

Snapshots
Incremental backups: One can ask to save only the blocks that have changed since the last backup.
Extent-based (ext4 and ZFS are block-based), where adjacent blocks are represented by the initial block number and how many blocks follow it. This allows for more compact representation of files.
Compression of files as they are stored
Deduplication: When files are found to contain duplicate blocks, only one block needs to be stored
SSD awareness: avoid optimizations for avoiding movement of the (nonexistent) disk head, batch writes together