"The COW filesystem for Linux that won't eat your data".

Bcachefs is an advanced new filesystem for Linux, with an emphasis on reliability and robustness and the complete set of features one would expect from a modern filesystem.

  • Copy on write (COW) - like zfs or btrfs
  • Full data and metadata checksumming
  • Multiple devices
  • Replication
  • Erasure coding (not stable)
  • Caching, data placement
  • Compression
  • Encryption
  • Snapshots
  • Nocow mode
  • Reflink
  • Extended attributes, ACLs, quotas
  • Scalable - has been tested to 100+ TB, expected to scale far higher (testers wanted!)
  • High performance, low tail latency
  • Already working and stable, with a small community of users

Documentation

Repositories

https://evilpiepirate.org/git/bcachefs.git
https://evilpiepirate.org/git/bcachefs-tools.git

Release tarballs

https://evilpiepirate.org/bcachefs-tools/

Debugging tools

bcachefs has extensive debugging tools and facilities for inspecting the state of the system while running.

Development tools

bcachefs development is done with ktest, which is used for both interactive and automated testing (including code coverage analysis), with a large test suite: dashboard.

Philosophy, vision

We prioritize robustness and reliability over features: we make every effort to ensure you won't lose data. It's building on top of a codebase with a pedigree - bcache already has a reasonably good track record for reliability Starting from there, bcachefs development has prioritized incremental development, and keeping things stable, and aggressively fixing design issues as they are found; the bcachefs codebase is considerably more robust and mature than upstream bcache.

The long term goal of bcachefs is to produce a truly general purpose filesystem:

  • scalable and reliable for the high end
  • simple and easy to use
  • an extensible and modular platform for new feature development, based on a core that is a general purpose database, including potentially distributed storage

Some technical high points

Filesystems have conventionally been implemented with a great deal of special purpose, ad-hoc data structures; a filesystem-as-a-database is a rarer beast:

btree: high performance, low latency

The core of bcachefs is a high performance, low latency b+ tree. Wikipedia covers some of the highlights - large, log structured btree nodes, which enables some novel performance optimizations. As a result, the bcachefs b+ tree is one of the fastest production ordered key value stores around: benchmarks.

Tail latency has also historically been a difficult area for filesystems, due largely to locking and metadata write ordering dependencies. The database approach allows bcachefs to shine here as well, it gives us a unified way to handle locking for all on disk state, and introduce patterns and techniques for avoiding aforementioned dependencies - we can easily avoid holding btree locks while doing blocking operations, and as a result benchmarks show write performance to be more consistant than even XFS.

Sophisticated transaction model

The main interface between the database layer and the filesystem layer provides

  • Transactions: updates are queued up, and are visible to code running within the transaction, but not the rest of the system until a successful transaction commit
  • Deadlock avoidance: High level filesystem code need not concern itself with lock ordering
  • Sophisticated iterators
  • Memoized btree lookups, for efficient transaction restart handling, as well as greatly simplifying high level filesystem code that need not pass iterators around to avoid lookups unnecessarily.

Triggers

Like other database systems, bcachefs-the-database provides triggers: hooks run when keys enter or leave the btree - this is used for e.g. disk space accounting.

Coupled with the btree write buffer code, this gets us highly efficient backpointers (for copygc), and in the future and efficient way to maintain an index-by-hash for data deduplication.

Unified codebase

The entire bcachefs codebase can be built and used either inside the kernel, or in userspace - notably, fsck is not a from-scratch implementation, it's just a small module in the larger bcachefs codebase.

Rust

We've got some initial work done on transitioning to Rust, with plans for much more: here's an example of walking the btree, from Rust: cmd_list

Contact and support

Developing a filesystem is also not cheap, quick, or easy; we need funding! Please chip in on Patreon

We're also now offering contracts for support and feature development - email for more info. Check the roadmap for ideas on things you might like to support.

Join us in the bcache IRC channel, we have a small group of bcachefs users and testers there: #bcache on OFTC (irc.oftc.net).

Mailing list: https://lore.kernel.org/linux-bcachefs/, or linux-bcachefs@vger.kernel.org.

Bug trackers: bcachefs, bcachefs-tools

News

Blog:

Members v2, configurationless tiering

A feature request we've had is configurationless tiering, smart tiering of member devices in a filesystem based on performance. This feature will allow easy and simple tiering of devices within a filesystem based on the performance of the device. The effect of this is that it will allow data that is commonly accessed, hot data, stored on the faster drives while data that is not used as often, cold data, will be stored on slower drives. This will increase filesystem performance greatly.

Background: extensible metadata

Extensible metadata means having to ability to add new fields to metadata while still being compatible with older versions of the filesystem. To achieve this structs cannot be of a fixed size and need to be able to add new fields using a bounds check or filling in wth zeroes when reading data from a previous version.

Non extensible members

Early in the process of implementing the tiering feature we ran into an interesting problem. Members within the superblock were not large enough to properly store this data. While we could have just resized the member, this would have caused further issues regarding compatability. Instead we opted to implement resizable members, members v2 if you will. The effect of adding resizable members allows us to add new fields to the members while still ensuring backwards compatability.

Members v2

The superblock of a filesystem is the start of that filesystem and requires extensible fields which contain important data such as a list of member devices. A suberblock needs extensible fields in cases such as a new device being added to the filesystem, in which case the members field needs to be extended.

In the case of members v1, the members array itself was extensible, the members themselves were a fixed size. Due to their fixed size it was quite easy to index and retrieve members from the list. However, when members can be dynamically resizable it is not that easy. The location of each member can not be known before runtime and therefore has to be found and accessed manually within the array of members. This was at times a complicated process for me to implement but will make future expansions of the members much simpler.

Configurationless tiering

Configurationless tiering is a feature that has been commonly requested. Instead the user specifying foreground and background targets, foreground allocations will go to the fastest device(s) and cold data will be moved to the slower device(s) in the background. To implement this the filesystem will require some idea of device performance which needs to be stored in the superblock.

Storing device performance

The devices within the filesytem will now store IOPS measurements for randread, randwrite, seq-read, and seq-write. In the future the new IOPS field can also be useful in other features such as monitoring device health.

Addendum: Cap'n proto

Some of the ideas in bcachefs about how to handle metadata were inspired by Cap'n Proto, which is highly recommended reading - it's a library that does everything we have to do by hand in C, exactly the way we want it.

Posted Wed Sep 27 18:58:27 2023

Archive: Archive