Fsck in bcachefs:

This document is an overview of fsck in bcachefs, as well as what will need to change for snapshots.

In bcachefs, the fsck code is only responsible for checking higher level filesystem structure. Btree topology and allocation information are checked by btree_gc.c, which doesn't need to be aware of snapshots. This reduces the scope of fsck considerably.

What it does have to check is:

  • For each extent, dirent and xattr, that a corresponding inode exists and is of the appropriate type.

  • Dirents and xattrs indexed by hash, with linear probing. The fsck code is responsible for checking that they're stored at the correct index (in case there was a hash collision, every index between the hash value and the actual index must be occupied by keys or hash whiteouts) and that there are no duplicates.

  • Dirents: check that target inode exists and matches dirent's d_type, and check the target inode's backpointer fields.

  • Extents:

  • Check that extents do not overlap
  • Check that extents do not exist past inode's i_size
  • Count extents for each inode and check inode's i_sectors.

  • Check that root directory exists

  • Check that lost+found exists

  • Check directory stucture: currently, this is done with a depth-first traversal from the root directory. We check for directories with multiple links, and then scan the inodes btree for directories that weren't found by the DFS.

  • Check each inode's link count: At this point we know that every dirent exists in a directory (has a corresponding inode of the correct type), has a valid target, and is reachable (because we've checked that all directories are reachable). All that's left is to scan the dirents btree, counting the number of dirents that point to each inode number, and verifying that against each inode's link count.

Changes for snapshots:

Subvolumes vs. snapshots:

Filesystem code has to know what subvolume it is in, in order to fetch the current snapshot ID at the start of every btree transaction. Fsck code does not know or care what subvolume a given key belongs to - a key may belong to multiple subvolumes, if it's in a snapshot that's an interior node in the snapshots tree.

Instead, when one key references another (e.g. a dirent that points to an inode), it does the lookup for the inode by searching for an inode with a snapshot ID equal to or an ancestor of the dirent key's snapshot ID.

IOW, snaphsot IDs represent (branchable) "points in time", and fsck wants to check that every view of the filesystem, at every point in time, is consistent.

Checking the filesystem for consistency in this fashion is conceptually straightforward - every key has a snapshot ID, and we use that when looking up anything that it references. However, we'll have to be careful about how we repair inconsistencies when the snapshot ID is an interior node:

If we repair an inconsistency by deleting a key, we need to ensure that key is not referenced by child snapshots. For example, if we find an extent that does not have an S_ISREG inode, currently we delete it. But we might find an extent that doesn't have a corresponding inode in with a snapshot ID equal to or ancestor to the extents snapshot ID, but does have an inode in a child snapshot. If that were to occur, we should probably copy that inode to the extent's snapshot ID.

We would like it to be as hard a rule as possible that fsck never deletes or modifies keys that belong to interior node snapshot IDs, because when people are using snapshots they'll generally want the data in that snapshot to always stay intact and it would be rather bad form if data was torched due to a bug in fsck.

It may not be possible to adhere to this rule while fixing all filesystem inconsistencies - e.g. if extents exist above an inode's i_size, then generally speaking the right thing to do is to delete that extent. But if we find inconsistencies in at interior node snapshot IDs, then perhaps the best thing to do would be to just leave it, or only apply changes in leaf node snapshot IDs.

Algorithmic considerations:

We want bcachefs to scale well into the millions of snapshots. This means that there can't generally be operations that have to run for each subvolume or snapshot - the fsck algorithms have to work one key at a time, processing them in natural btree order.

This means that the extents, dirents and xattrs passes need to use BTREE_ITER_ALL_SNAPSHOTS, meaning that as they walk keys for a given inode, they'll be seeing keys from every snapshot version mixed together - and there will be multiple versions of the inode, in different snapshots, that they need to be checked against.

There's a helper in the fsck code that these passes use, walk_inode(), that fetches the inode for the current inode number when it changes. This will probably change to load every copy of that inode, in each snapshot ID. Also, where we have to maintain state we'll have to duplicate that state for every snapshot ID that exists at that inode number.

We'll also need to know, for every key that we process, which snapshot IDs it is visible in: so that e.g. the extents pass can correctly count i_sectors.

Note that "millions of snapshots" will not typically mean "millions of snaphot IDs at the same time" - when we allocate new inodes we pick inode numbers that aren't in use in any other snapshot or subvolume, and most files in a filesytem are created, written to once and then later atomically replaced with a rename call. When we're dealing with a file that keys with many different snapshot IDs we need to make sure we don't have any hidden O(n2) algorithms, but if fsck requires in the hundreds of megabytes of state to check these files with worst case numbers of overwrites, that's not the end of the world.

Checking directory structure:

The current approach of doing a DFS from the root directory still works with snapshots, but algorithmically is too slow. If we have many sparse snapshots, the DFS would recurse into each of those snapshots, walking directory structure it has already seen in the original subvolume - but those snapshots could have directory structure changes in one or two places, so it would have to. We need an algorithm that iterators over keys, not filesystem structure.

Fortunately, since backpointers from inodes to dirents were recently added we can take a different approach. We walk the inodes btree, and for every directory inode we find we chase backpointers until we get to the root directory. Note that since previous passes have already checked backpointers, we're really just check for loops here. This will still check the same structure repeatedly, like the DFS approach, but here we can solve it by adding every inum:snapshot tuple to a hash table after we've verified it has a path to the root.

Checking inode link counts:

This one will be tedious, but algorithmically doesn't change much. The existing code walks the dirents btree in natural key order, and then for each dirent increments a link count in a radix tree. After walking the dirents, we walk the inodes btree and compare link counts we calculated to what's in the inode.

As before, we'll walk the dirents btree in natural key order. However, we can't accumulate link counts without also looking at what actually exists in the inodes btree: a dirent may point to multiple inodes with different snapshot IDs, and which of those we increment also depends on which snapshots that dirent is visible in (we'll already have a helper for that part, as discussed previously).

This means that instead of using a radix tree, we'll have to start by walking the inodes btree, and for every inode we find creating a new in memory data structure (eytzinger layout search tree, perhaps) indexed by inode number:snapshot id and containing just the link count we'll be calculating.

Fortunately hardlinks are usually rare in actual practice. We can make use of the fact that we've already checked inode backpointers and only include inodes that actually have hardlinks in this table.