"The COW filesystem for Linux that won't eat your data".
Bcachefs is an advanced new filesystem for Linux, with an emphasis on reliability and robustness. It has a long list of features, completed or in progress:
- Copy on write (COW) - like zfs or btrfs
- Full data and metadata checksumming
- Multiple devices, including replication and other types of RAID
- Scalable - has been tested to 50+ TB, will eventually scale far higher
- Already working and stable, with a small community of users
We prioritize robustness and reliability over features and hype: we make every effort to ensure you won't lose data. It's building on top of a codebase with a pedigree - bcache already has a reasonably good track record for reliability (particularly considering how young upstream bcache is, in terms of engineer man/years). Starting from there, bcachefs development has prioritized incremental development, and keeping things stable, and aggressively fixing design issues as they are found; the bcachefs codebase is considerably more robust and mature than upstream bcache.
Developing a filesystem is also not cheap or quick or easy; we need funding! Please chip in on Patreon - the Patreon page also has more information on the motivation for bcachefs and the state of Linux filesystems, as well as some bcachefs status updates and information on development.
If you don't want to use Patreon, I'm also happy to take donations via paypal: firstname.lastname@example.org.
Join us in the bcache IRC channel, we have a small group of bcachefs users and testers there: #bcache on OFTC (irc.oftc.net).
Bcachefs is not yet upstream - you'll have to build a kernel to use it.
First, check out the bcache kernel and tools repositories:
git clone https://evilpiepirate.org/git/bcachefs.git git clone https://evilpiepirate.org/git/bcachefs-tools.git
Build and install as usual - make sure you enable
CONFIG_BCACHE_FS. Then, to
format and mount a single device with the default options, run:
bcachefs format /dev/sda1 mount -t bcachefs /dev/sda1 /mnt
For a multi device filesystem, with sda1 caching sdb1:
bcachefs format /dev/sda1 --tier=1 /dev/sdb1 mount -t bcachefs /dev/sda1:/dev/sdb1 /mnt
bcachefs format --help for more options.
End user documentation is currently fairly minimal; this would be a very helpful area for anyone who wishes to contribute - I would like the bcache man page in the bcachefs-tools repository to be rewritten and expanded.
Bcachefs can currently be considered beta quality. It has a small pool of outside users and has been stable for quite some time now; there's no reason to expect issues as long as you stick to the currently supported feature set. It's been passing all xfstests for well over a year, and serious bugs are rare at this point. However, given that it's still under active development backups are a good idea.
Performance is generally quite good - generally faster than btrfs, and not far behind xfs/ext4. On metadata intensive benchmarks, it's often considerably faster than xfs/ext4/btrfs.
Normal posix filesystem functionality is all finished - if you're using bcachefs as a replacement for ext4 on a desktop, you shouldn't find anything missing. For servers, NFS export support is still missing (but coming soon) and we don't yet support quotas (probably further off).
Until bcachefs goes upstream I reserve the right to change the on disk format if necessary, but I'm not expecting any more incompatible disk format changes.
Full data checksumming
Fully supported and enabled by default. We do need to implement scrubbing, once we've got replication and can take advantage of it.
Not quite finished - it's safe to enable, but there's some work left related to copy GC before we can enable free space accounting based on compressed size: right now, enabling compression won't actually let you store any more data in your filesystem than if the data was uncompressed
Bcachefs allows you to assign devices to different tiers - the faster tier will effectively be used as a writeback cache for the slower tier, and metadata will be pinned in the faster tier.
Basic tiering functionality works, but it's not (yet) as configurable as bcache's caching (e.g. you can't specify writethrough caching).
All the core functionality is complete, and it's getting close to usable: you can create a multi device filesystem with replication, and then while the filesystem is in use take one device offline without any loss of availability.
Whole filesystem AEAD style encryption (with ChaCha20 and Poly1305) is done and merged. I would suggest not relying on it for anything critical until the code has seen more outside review, though.
Snapshot implementation has been started, but snapshots are by far the most complex of the remaining features to implement - it's going to be quite awhile before I can dedicate enough time to finishing them, but I'm very much looking forward to showing off what it'll be able to do.
We currently walk all metadata at mount time (multiple times, in fact) - on flash this shouldn't even be noticeable unless your filesystem is very large, but on rotating disk expect mount times to be slow.
This will be addressed in the future - mount times will likely be the next big push after the next big batch of on disk format changes.
Compression is almost done: it's quite thoroughly tested, the only remaining issue is a problem with copygc fragmenting existing compressed extents that only breaks accounting.
NFS export support is almost done: implementing i_generation correctly required some new transaction machinery, but that's mostly done. What's left is implementing a new kind of reservation of journal space for the new, long running transactions.
Other wishlist items:
When we're using compression, we end up wasting a fair amount of space on internal fragmentation because compressed extents get rounded up to the filesystem block size when they're written - usually 4k. It'd be really nice if we could pack them in more efficiently - probably 512 byte sector granularity.
On the read side this is no big deal to support - we have to bounce compressed extents anyways. The write side is the annoying part. The options are:
- Buffer up writes when we don't have full blocks to write? Highly problematic, not going to do this.
- Read modify write? Not an option for raw flash, would prefer it to not be our only option
- Do data journalling when we don't have a full block to write? Possible solution, we want data journalling anyways
Inline extents - good for space efficiency for both small files, and compression when extents happen to compress particularly well.
Full data journalling - we're definitely going to want this for when the journal is on an NVRAM device (also need to implement external journalling (easy), and direct journal on NVRAM support (what's involved here?)).
Would be good to get a simple implementation done and tested so we know what the on disk format is going to be.