"The COW filesystem for Linux that won't eat your data".

Bcachefs is an advanced new filesystem for Linux, with an emphasis on reliability and robustness and the complete set of features one would expect from a modern filesystem.

Copy on write (COW) - like zfs
Full data and metadata checksumming, for full data integrity: the filesystem should always detect (and where possible, recover from) damage; it should never return incorrect data.
Multiple devices
Replication
Erasure coding (incomplete) High performance: doesn't fragment your writes (like ZFS), no RAID hole
Caching, data placement
Compression
Encryption
Snapshots
Nocow mode
Reflink
Extended attributes, ACLs, quotas
Petabyte scalability
Full online fsck, check and repair (in progress)
Robustness and rock solid repair. Damage and breakage are a fact of life, it's not a matter of if, but when. It doesn't matter what happened to the filesystem: bad hardware, lightning strikes, an errant dd, you can expect that bcachefs will repair the damage and keep going, usually with no user intervention required.

It's the job of the filesystem to never lose your data: anything that can be repaired, will be.

Documentation

Distribution packages, installation instructions, DKMS:

As of 6.18, bcachefs is no longer being distributed with the kernel, for reasons too complicated to go into here.

(But it might have something to do with QA).

So we're shipping as a DKMS module now. (Like ZFS!). The DKMS package supports Linux 6.16 and later.

Nixos, Arch: your distribution bcachefs-tools package has everything you need. You're good to go!

Debian, Ubuntu: https://apt.bcachefs.org/

Fedora: https://copr.fedorainfracloud.org/coprs/ngompa/bcachefs/builds/

OpenSUSE: https://build.opensuse.org/package/show/filesystems:bcachefs:release/bcachefs

Repositories

https://evilpiepirate.org/git/bcachefs.git
https://evilpiepirate.org/git/bcachefs-tools.git

Release tarballs

https://evilpiepirate.org/bcachefs-tools/

Debugging tools

bcachefs has extensive debugging tools and facilities for inspecting the state of the system while running.

Development tools

bcachefs development is done with ktest, which is used for both interactive and automated testing (including code coverage analysis), with a large test suite: dashboard.

About

There are two main themes or ideas that make up bcachefs: copy on write, and filesystem-as-database. These are the two primary tools for achieving reliability: a filesystem that won't corrupt or lose your data, and that can always be repaired even when things go horribly wrong.

History - ZFS

The history of modern filesystems starts with ZFS. Their primary goal was a filesystem that absolutely guarantees data integrity, and COW is primary mechanism by which ZFS achieves that goal: it enables full data and metadata checksumming, with a chain of trust up to a single root, the superblock.

This came out of real world experience with failure modes of block storage that conventional filesystems weren't able to protect against: bit corruption, lost writes, misdirected writes, and others. COW, coupled with full data and metadata checksumming, makes it possible for the filesystem to definitively guard againts all of them: a properly implemented COW filesystem should never return incorrect data.

The push towards copy on write goes back even further in academic research. It had long been realized that it was fundamentally a safer technique than update-in-place, but ZFS was the first to realize COW in a filesystem and demonstrate the benefits to data integrity.

Filesystem as a database

The other half of bcachefs is "filesystem as a database".

The original Unix filesystem (and other filesystems of the era are broadly similar in concept, though the ideas were different) was implented with various special purpose (but simple - originally) on disk data structures: inodes (file metadata) of fixed size that live in fixed locations on disk, file data stored in blocks (not extents) indexed by radix trees hanging off the inode, and simple bitmaps for free space allocation.

This worked well when filesystems were 1k loc, or 10k loc. But every additional feature needed new on disk data structures and pushed the limits of this design, with transactional correctness (filesystems that don't need a fsck after a crash) being a particularly big jump in complexity. In this model, every additional feature has needed new on disk data structures - each unique and special purpose. High performance directory indexing, xattrs, reflink, and a great many features not visible to users but necessary for scalability to modern filesystem sizes.

Notably, ZFS has more in common with the original Unix filesystem in the design of its data structures than other modern filesystems, for what were good and pragmatic reasons at the time. Their primary goal was robustness, and that has been the strength of filesystems in that lineage (as with ext2/3/4, also based off the original Unix filesystem): those simple, often fixed-location on disk data structures make for reliable repair.

But as research into b-trees and databases advanced, a great many filesystems looked at that and said: "Wouldn't life be simpler if all metadata was just keys in a key/value store?", with that ideally being a single core b-tree/database implementation, with scalability and performance usually being primary goals.

XFS was the earliest and most successful of the "everything is a b-tree" approach, but largely predating the push towards COW filesystems, and didn't push the "unified data model" very far in comparison to later filesystems. Reiserfs and btrfs are also notable for pushing this approach.

Historically, the primary disadvantages of this approach has been:

Robustness. "Everything is a b-tree" introduces new single points of failure (b-tree roots), and high performance, production quality b-trees are notoriously complicated beasts.
Performance. The original Unix filesystem design had a major scalability advantage, with a lot of built in sharding at the inode level. Fully realizing the benefits of the "filesystem as a database" approach requires tossing that out: instead of many b-trees, one per inode, having internals that work like a normal-ish database requires the filesystem to be implemented with a few massive b-trees: then we can have something that starts to look like a relational database, with tables corresponding to data types, and simple unified transactional interface.

These have been the achille's heels, and the reason why other more modern COW filesystems have not lived up to ZFS's robustness.

History of bcachefs, and how these were solved

bcache:

Enter bcache, the prototype for bcachefs. When SSDs were first being introduced, they were from the beginning orders of magnitude faster than rotating disk - but expensive, so block layer caching was an obvious approach for using them effectively.

Whereas other approaches towards SSD caching at the time were based on hash tables (which have major disadvantages as persistent data structures, and don't support range lookups - making indexing extents instead of blocks extremely problematic), bcache went with a b-tree approach, with several novel innovations to make the b-tree performant enough to work for indexing every IO lookup, without sharding. The primary innovation was log structured btree nodes: this enabled eytzinger search trees, which eliminates binary search. This was huge: binary search is O(log n), but from a CPU cache perspective is the worst lookup algorithm possible.

This made bcache's b-tree the fastest ordered, persistent, production quality key value store in existence; perhaps bcachefs's biggest claim to fame, but more importantly, making the rest of the problems in "modern COW filesystem with real database underpinnings" tractable.

Robustness and reliability: making "restore from backup" a thing of the past

Unlike the rest, there is no silver bullet to producing truly reliable code. Reliability comes from making it a priority and sticking to it, at every stage of development; learning from every accident and mistake, discovering all the failure modes that can happen in the wild, and continually building on those lessons and experience.

Some approaches pioneered in other filesystems:

btree node scan, for disaster recovery scenarios - originally done in reiserfs. The reiserfs implementation had issues; notably, it would pick up btree nodes from unrelated filesystems on the same device. bcachefs corrects all of these with improved per-btree-node metadata, and makes it practical for the huge filesystems of today by segregating data and metadata (and noting which regions are which in the superblock).
Extensive runtime verification of metadata, any time metadata is read - or written; XFS also makes heavy use of this approach. This drastically reduces the impact of bugs: bugs that would otherwise result in filesystem corruption typically result in emergency shutdown - restart and keep going, and bugs are caught much more quickly (and are therefore much easier to debug).
Full filesystem rollback, as in log structured filesystems (like nilfs2), and ZFS in limited fashion. This falls out of the "filesystem as a database" approach: with all metadata as keys in b-trees, with all updates (and overwrites) logged in a single unified journal, anything can be rolled back.

Other areas of emphasis:

Comprehensive and thorough logging, introspection, and debugging tools.

You can't debug what you can't see. Any time something goes wrong, the system should tell you clearly what went wrong, and tell you clearly and directly everything you, as an end user or engineer, need to understand what went wrong and debug the issue.
Comprehensive and robust fsck

Notably, filesystem repair is considerably simplified with the "filesystem as a database" approach, but it's still a major topic, worth its own chapter in the user manual.

The main point is: repair is not, and cannot be, an afterthought. It doesn't matter what the cause of the damage was, it's the filesystem's job to repair it, without causing additional damage or loss.

If you're running bcachefs: you won't lose data. If you find a way to break it: first, I'll be impressed, and secondly - get in contact. We've got the debugging tools to make short work of anything that may come up, and real world testing (the crazier, the better!) is how we make this filesystem more robust and rock solid for everyone.

A few other things worth mentioning

Unified codebase

The entire bcachefs codebase can be built and used either inside the kernel, or in userspace - notably, fsck is not a from-scratch implementation, it's just a small module in the larger bcachefs codebase.

Rust

We've got some initial work done on transitioning to Rust, with plans for much more: here's an example of walking the btree, from Rust: cmd_list

Contact and support

Developing a filesystem is also not cheap, quick, or easy; we need funding! Please chip in on Patreon

We're also now offering contracts for support and feature development - email for more info. Check the roadmap for ideas on things you might like to support.

Join us in the bcache IRC channel, we have a small group of bcachefs users and testers there: #bcache on OFTC (irc.oftc.net).

Mailing list: https://lore.kernel.org/linux-bcachefs/, or linux-bcachefs@vger.kernel.org.

Bug trackers: bcachefs, bcachefs-tools

News

Blog:

RSS Atom

Members v2, configurationless tiering

A feature request we've had is configurationless tiering, smart tiering of member devices in a filesystem based on performance. This feature will allow easy and simple tiering of devices within a filesystem based on the performance of the device. The effect of this is that it will allow data that is commonly accessed, hot data, stored on the faster drives while data that is not used as often, cold data, will be stored on slower drives. This will increase filesystem performance greatly.

Background: extensible metadata

Extensible metadata means having to ability to add new fields to metadata while still being compatible with older versions of the filesystem. To achieve this structs cannot be of a fixed size and need to be able to add new fields using a bounds check or filling in wth zeroes when reading data from a previous version.

Non extensible members

Early in the process of implementing the tiering feature we ran into an interesting problem. Members within the superblock were not large enough to properly store this data. While we could have just resized the member, this would have caused further issues regarding compatability. Instead we opted to implement resizable members, members v2 if you will. The effect of adding resizable members allows us to add new fields to the members while still ensuring backwards compatability.

Members v2

The superblock of a filesystem is the start of that filesystem and requires extensible fields which contain important data such as a list of member devices. A suberblock needs extensible fields in cases such as a new device being added to the filesystem, in which case the members field needs to be extended.

In the case of members v1, the members array itself was extensible, the members themselves were a fixed size. Due to their fixed size it was quite easy to index and retrieve members from the list. However, when members can be dynamically resizable it is not that easy. The location of each member can not be known before runtime and therefore has to be found and accessed manually within the array of members. This was at times a complicated process for me to implement but will make future expansions of the members much simpler.

Configurationless tiering

Configurationless tiering is a feature that has been commonly requested. Instead the user specifying foreground and background targets, foreground allocations will go to the fastest device(s) and cold data will be moved to the slower device(s) in the background. To implement this the filesystem will require some idea of device performance which needs to be stored in the superblock.

Storing device performance

The devices within the filesytem will now store IOPS measurements for randread, randwrite, seq-read, and seq-write. In the future the new IOPS field can also be useful in other features such as monitoring device health.

Addendum: Cap'n proto

Some of the ideas in bcachefs about how to handle metadata were inspired by Cap'n Proto, which is highly recommended reading - it's a library that does everything we have to do by hand in C, exactly the way we want it.

Posted Wed Sep 27 18:58:27 2023

Archive: Archive

Other useful links

Developer documentation