Fsck in bcachefs:
This document is an overview of fsck in bcachefs, as well as what will need to change for snapshots.
In bcachefs, the fsck code is only responsible for checking higher level
filesystem structure. Btree topology and allocation information are checked by
btree_gc.c
, which doesn't need to be aware of snapshots. This reduces the
scope of fsck considerably.
What it does have to check is:
For each extent, dirent and xattr, that a corresponding inode exists and is of the appropriate type.
Dirents and xattrs indexed by hash, with linear probing. The fsck code is responsible for checking that they're stored at the correct index (in case there was a hash collision, every index between the hash value and the actual index must be occupied by keys or hash whiteouts) and that there are no duplicates.
Dirents: check that target inode exists and matches dirent's
d_type
, and check the target inode's backpointer fields.Extents:
- Check that extents do not overlap
- Check that extents do not exist past inode's
i_size
Count extents for each inode and check inode's
i_sectors
.Check that root directory exists
Check that lost+found exists
Check directory stucture: currently, this is done with a depth-first traversal from the root directory. We check for directories with multiple links, and then scan the inodes btree for directories that weren't found by the DFS.
Check each inode's link count: At this point we know that every dirent exists in a directory (has a corresponding inode of the correct type), has a valid target, and is reachable (because we've checked that all directories are reachable). All that's left is to scan the dirents btree, counting the number of dirents that point to each inode number, and verifying that against each inode's link count.
Changes for snapshots:
References between keys:
Filesystem code operates within a subvolume: at the start of a btree transaction it looks up the subvolume's snapshot ID, and passes that to the btree iterator code so that lookups always return the newest key visible to that snapshot ID.
The fsck code does not operate within a specific subvolume or snapshot ID - we have to iterate over and process keys in natural btree order, for performance. From fsck's point of view, any key that points to another key may now point to multiple versions of that key in different snapshots.
For example, given a dirent that points to an inode, we search for the first
inode with a snapshot ID equal to or an ancestor of the snapshot ID of the
dirent - a normal BTREE_ITER_FILTER_WHITEOUTS
lookup, but using the snapshot
ID of the dirent, not the subvolume - this key should exist, if the dirent
points to a valid inode. Additionally, the dirent also references inodes at that
position in snapshot IDs that are descendends of the dirent's snapshot ID -
provided a newer dirent in a snapshot ID equal to or ancestor of the inode's
snapshot ID doesn't exist.
Repair:
If we repair an inconsistency by deleting a key, we need to ensure that key is
not referenced by child snapshots. For example, if we find an extent that does
not have an S_ISREG
inode, currently we delete it. But we might find an extent
that doesn't have a corresponding inode in with a snapshot ID equal to or
ancestor to the extents snapshot ID, but does have an inode in a child snapshot.
If that were to occur, we should probably copy that inode to the extent's
snapshot ID.
We would like it to be as hard a rule as possible that fsck never deletes or modifies keys that belong to interior node snapshot IDs, because when people are using snapshots they'll generally want the data in that snapshot to always stay intact and it would be rather bad form if data was torched due to a bug in fsck.
It may not be possible to adhere to this rule while fixing all filesystem
inconsistencies - e.g. if extents exist above an inode's i_size
, then
generally speaking the right thing to do is to delete that extent. But if we
find inconsistencies in at interior node snapshot IDs, then perhaps the best
thing to do would be to just leave it, or only apply changes in leaf node
snapshot IDs.
Algorithmic considerations:
We want bcachefs to scale well into the millions of snapshots. This means that there can't generally be operations that have to run for each subvolume or snapshot - the fsck algorithms have to work one key at a time, processing them in natural btree order.
This means that the extents, dirents and xattrs passes need to use
BTREE_ITER_ALL_SNAPSHOTS
, meaning that as they walk keys for a given inode,
they'll be seeing keys from every snapshot version mixed together - and there
will be multiple versions of the inode, in different snapshots, that they need
to be checked against.
There's a helper in the fsck code that these passes use, walk_inode()
, that
fetches the inode for the current inode number when it changes. This will
probably change to load every copy of that inode, in each snapshot ID. Also,
where we have to maintain state we'll have to duplicate that state for every
snapshot ID that exists at that inode number.
We'll also need to know, for every key that we process, which snapshot IDs it is
visible in: so that e.g. the extents pass can correctly count i_sectors
.
Note that "millions of snapshots" will not typically mean "millions of snaphot IDs at the same time" - when we allocate new inodes we pick inode numbers that aren't in use in any other snapshot or subvolume, and most files in a filesytem are created, written to once and then later atomically replaced with a rename call. When we're dealing with a file that keys with many different snapshot IDs we need to make sure we don't have any hidden O(n2) algorithms, but if fsck requires in the hundreds of megabytes of state to check these files with worst case numbers of overwrites, that's not the end of the world.