Bcachefs btree iterators:

Btree iterators in bcachefs implement more or less the usual operations for iterating over an ordered key/value store.

Iterators may be unlocked arbitrarily, due to e.g. transaction restarts (due to lock ordering, failure to get a journal reservation, btree node splits, etc.). This means they must remember what they were pointing at, so that subsequent peek/next etc. operations behave correctly. Also, a given btree transaction may have multiple iterators in use at the same time; updates to via one iterator must not invalidate other iterators (but this is a relatively easy requirement compared to getting restartable iterators correct).

Iterators may be used to iterate over keys or slots; for slots we iterate over every possible position, returning KEY_TYPE_deleted keys for empty slots. They can iterate forwards or backwards, and they can be used to iterate over extents or non-extents. Extents always have a nonzero size and have start and end positions; iterating over extents in slots mode (where we synthesize ranges to represent holes between keys in the btree) is used frequently by the filesystem IO code in bcachefs.

Primary iterator state:

  • iter.pos: the current position of the iterator.

    The iterator position may be changed directly with bch2_btree_iter_set_pos(), and the various peek()/next()/prev() operations all update the iterator position to match what they are returning.

  • iter.l[BTREE_MAX_DEPTH] (struct btree_iter_level)

    These contain, one for each level of the btree, a pointer to a btree node, an iterator for that btree node (struct btree_node_iter, see bset.h), and a lock sequence number. Thus, we have a full path from the root of the btree down to a specific key in a specific btree leaf node.

    Btree node iterators are much simpler: they can be treated as if they point to a single key within a sorted array of keys. Because btree nodes have multiple sorted arrays of code btree node iterators are internally maintaining multiple pointers and sorting the results, but this is an internal implementation detail.

The most important invariant for btree iterators is that iter.pos must always be consistent with the btree node iterators: that is, for every btree node iterator, the key the node iterator points to must compare >= iter.pos, and the previous key must compare < iter.pos.

That is - the node iterator must point to the correct position for inserting a new key at iter.pos.

For the various peek()/next()/prev() operations, iter.pos will usually be updated to point to the start of the key we're returning, and the leaf btree node iterator will usually point to the key we're returning, unless we're at a hole - but there are exceptions for extents.

Extents:

When returning an extent, iter.pos will typically be updated to point to the start of the extent we're returning (which is not the key the extent is indexed by; extents are indexed by where they end, not where they start).

This may mean that when we return an extent, the leaf btree node iterator won't necessarily point to the key we're returning - due to our #1 invariant, and the possible existence of whiteouts (that we filter out when iterating and don't return to user code).

Additionally, iter.pos won't be updated to point to the start of the extent we're returning when that would move iter.pos backwards (unless we're actually iterating backwards); this is because user code wants to see a monotonically increasing iterator position (as user code often uses iter.pos to track how far we've gotten/how much work we have to do).