bcachefs status, roadmap:

Performance:

Core btree:

The core btree code is heavily optimized, and I don't expect much in the way of significant gains without novel data structures. Btree performance is particularly important in bcachefs because we not shard btrees on a per inode basis like most filesystems. Instead have one global extents btree, one global dirents btree, et cetera. Historically, this came about because bcachefs started as a block cache where we needed a single index that would be fast enough to index an entire SSD worth of blocks, and for persistent data structures btrees have a number of advantages over hash tables (can do extents, and when access patterns have locality they preserve that locality). Using singleton btrees is a major simplification when writing a filesystem, but it does mean that everything is leaving rather heavily on the performance and scalability of the btree implementation.

Bcachefs b-trees are a hybrid compacting data structure - they share some aspect with COW btrees, but btree nodes internally are log structured. This means that btree nodes are big (default 256k), and have a very high fanout. We keep multiple sorted sets of keys within a btree node (up to 3), compacting/sorting when the last set, the one that updates go to, becomes too large.

Lookups within a btree node use eytzinger layout search trees with summary keys that fit in 4 bytes. Eytzinger layout means descendents of any given node are adjacent in memory, meaning they can be effectively prefetched, and 4 byte nodes mean 16 of them fit on a cacheline, which means we can prefetch 4 levels ahead - this means we can pipeline effectively enough to get good lookup performance even when our memery accesses are not cached.

On my Ryzen 3900X single threaded lookups across 16M keys go at 1.3M second, and scale almost linearly - with 12 threads it's about 12M/second. We can do 16M random inserts at 670k per second single threaded; with 12 threads we achieve 2.4M random inserts per second.

Btree locking is very well optimized. We have external btree iterators which make it easy to aggressively drop and retake locks, giving the filesystem as a whole very good latency.

Minor scalability todo item: btree node splits/merges still take intent locks all the way up to the root, they should be checking how far up they need to go. This hasn't shown up yet because bcachefs btree nodes are big and splits/merges are infrequent.

IO path:

IO paths: the read path looks to be in excellent shape - I see around 350k iops doing async 4k random reads, single threaded, on a somewhat pokey Xeon server. We're nowhere near where we should be on 4k random writes - I've been getting in the neighborhood of 70k iops lately, single threaded, and it's unclear where the bottleneck is - we don't seem to be cpu bound. Considering the btree performance this shouldn't be a core architectural issue - work needs to be done to identify where we're losing performance.

Filesystem metadata operation performance (e.g. fsmark): On most benchmarks, we seem to be approaching or sometimes mathing XFS on performance. There are exceptions: multithreaded rm -rf of empty files is currently roughly 4x slower than XFS - we bottleneck on journal reclaim, which is flushing inodes from the btree key cache back to the inodes btree.

Scalability:

There are still some places where we rely on walking/scanning metadata too much.

Allocator:

For allocation purposes, we divide up the device into buckets (default 512k, currently support up to 8M). There is an in-memory array that represents these buckets, and the allocator periodically scans this array to fill up freelists - this is currently a pain point, and eliminating this scanning is a high priority item. Secondarily, we want to get rid of this in memory array - buckets are primarily represented in the alloc btree. Almost all of the work to get rid of this has been done, fixing the allocator to not rescan for empty buckets is the last major item.

Copygc:

Because allocation is bucket based, we're subject to internal fragmentation that we need copygc to deal with, and when copygc needs to run it has to scan the bucket arrays to determine which buckets to evacuate, and then it has to scan the extents and reflink btrees for extents to be moved.

This is something we'd like to improve, but is not a major pain point issue - being a filesystem, we're better positioned to avoid mixing unrelated writes which tends to be the main cause of write amplification in SSDs. Improving this will require adding a backpointer index, which will necessarily add overhead to the write path - thus, I intend to defer this until after we've diagnosed our current lack of performance in the write path and worked to make is fast as we think it can go. We may also want backpointers to be an optional feature.

Rebalance:

Various filesystem features may want data to be moved around or rewritten in the background: e.g. using a device as a writeback cache, or using the background_compression option - excluding copygc, these are all handled by the rebalance thread. Rebalance currently has to scan the entire extents and reflink btrees to find work items - and this is more of a pain point than copygc, as this is a normal filesystem feature. We should add a secondary index for extents that rebalance will need to move or rewrite - this will be a "pay if you're using it" feature so the performance impact should be less of a concern than with copygc.

Some thought and attention should be given to what else we might want to use rebalance for in the future.

There are also current bug reports of rebalance reading data and not doing any work; while work is being done on this subsystem we should think about improving visibility as to what it's doing.

Disk space accounting:

Currently, disk space accounting is primarily persisted in the journal (on clean shutdown, it's also stored in the superblock) - and each disk space counter is added to every journal entry. As the number of devices in a filesystem increases this overhead grows, but where this will really become an issue is adding per-snapshot disk space accounting. We really need these counters to be stored in btree keys, but this will be a nontrivial change as we use a complicated system of percpu counters for disk space accounting, with multiple sets of counters for each open journal entry - rolling them up just prior to each journal write. My plan is to extend the btree key cache so that it can support a similar mechanism.

Also, we want disk space accounting to be broken by compression type, for compressed data, and compressed/uncompressed totals to both be kept - right now there's no good way for users to find out the compression ratio they're getting.

Number of devices in a filesystem:

Currently, we're limited to a maximum of 64 devices in a filesystem. It would be desirable and not much work to lift that limit - it's not encoded in any on disk data structures, just in memory bitmasks. After that we'll support 256 devices, and to go past that we'll need another, larger bch_extent_ptr type and a new superblock section for larger device index fields.

bch_extent_ptr uses 44 bits for the offset field (in units of 512 byte sectors), giving us a current maximum device size of 8 petabytes. We already have code for different types of checksum fields within an extent for different size checksums, so adding support for a new, larger extent ptr type will be fairly trivial.

Device management:

We need a concept of "failure domains", and perhaps something more, for managing large numbers of devices with bcachefs to make sense. Currently, disks can be assigned labels that are essentially paths, delimited by periods, e.g.

controller1.ssd.ssd1 controller1.ssd.ssd2 controller1.hdd.hdd1 controller2.hdd.hdd1 controller2.hdd.hdd2

This lets disks be organized into a heirarchy, and referring to any part of the heirarchy will include all the disks underneath that point: e.g. controller1 would refer to the first three disks in the example.

This is insufficient, though. If a user has a filesystem with 60 drives, all alike, we need a way to specify that some disks are on the same controller and that multiple replicas should not be stored on the same controller, and we also need a way to constrain how replicated writes are laid out. Today, in a 60 devices filesystem with 3x replication, each allocation will be unconstrained which means that the failure of any three devices would lead to loss of data.

A feature that addresses these issues still needs to be designed.

Fsck

Fsck performance is good, but for supporting huge filesystems we are going to need online fsck.

Fsck in bcachefs is divided up into two main components: there is btree_gc, which walks the btree and regenerates allocation information (and also checks btree topology), and there is fsck proper which checks higher level filesystem invariants (e.g. extents/dirents/xattrs must belong to an inode, and filesystem connectivity).

The btree_gc subsystem is so named because originally (when bcachefs was bcache) not only did not have persistent allocation information, it also didn't keep bucket sector counts up to date when adding/removing extents from the btree - it relied entirely on periodically rescanning the btree.

This capability was kept for a very long time, partly as an aid to testing an debugging when uptodate allocation information was being developed, then persistent, then transactionally persistent allocation information was developed. The capability to run btree_gc at runtime has been disabled for roughly the past year or so - it was a bit too much to deal with while getting everything else stabilized - but the code has been retained and there haven't been any real architectural changes that would preclude making it a supported feature again - this is something that we'd like to do, after bcachefs is upstreamed and when more manpower is available.

Additionally, fsck itself is part of the kernel bcachefs codebase - the userspace fsck tool uses a shim layer to run the kernel bcachefs codebase in userspace, and fsck uses the normal btree iterators interface that the rest of the filesystem uses. The in-kernel fsck code can be run at mount time by mounting wiht the -o fsck option - this is essentially what the userspace fsck tool does, just running the same code in userspace.

This means that there's nothing preventing us from running the current fsck implementation at runtime, while the filesystem is in use, and there's actually not much work that needs to be done for this to work properly and reliably: we need to add locking for the parts of fsck are checking nonlocal invariants and can't rely solely on btree iterators for locking.

Since backpointers from inodes to dirents were recently added, the checking of directory connectivity is about to become drastically simpler and shouldn't need any external data structures or locking with the rest of the filesystem. The only fsck passes that appear to require new locking are:

  • Counting of extents/dirents to validate i_sectors and subdirectory counts.

  • Checking of i_nlink for inodes with hardlinks - but since inode backpointers have now been added, this pass now only needs to consider inodes that actually have hardlinks.

So that's cool.

Snapshots:

I expect bcachefs to scale well into the millions of snapshots. There will need to be some additional work to make this happen, but it shouldn't be much.

  • We have a function, bch2_snapshot_is_ancestor(), that needs to be fast - it's called by the btree iterator code (in BTREE_ITER_FILTER_SNAPSHOTS mode) for every key it looks at. The current implementation is linear in the depth of the snapshot tree. We'll probably need to improve this - perhaps by caching a bit vector, for each snapshot ID, of ancestor IDs.

  • Too many overwrites in different snapshots at the same part of the keyspace will slow down lookups - we haven't seen this come up yet, but it certainly will at some point.

    When this happens, it won't be for most files in the filesystem, in anything like typical usage - most files are created, written to once, and then overwritten with a rename; it will be for a subset of files that are long lived and continuously being overwritten - i.e. databases.

    To deal with this we'll need to implement per-inode counters so that we can detect this situation - and then when we detect too many overwrites, we can allocate a new inode number internally, and move all keys that aren't visible in the main subvolume to that inode number.