Bcachefs btree iterators:
Btree iterators in bcachefs implement more or less the usual operations for iterating over an ordered key/value store.
Iterators may be unlocked arbitrarily, due to e.g. transaction restarts (due to lock ordering, failure to get a journal reservation, btree node splits, etc.). This means they must remember what they were pointing at, so that subsequent peek/next etc. operations behave correctly. Also, a given btree transaction may have multiple iterators in use at the same time; updates to via one iterator must not invalidate other iterators (but this is a relatively easy requirement compared to getting restartable iterators correct).
Iterators may be used to iterate over keys or slots; for slots we iterate over
every possible position, returning KEY_TYPE_deleted
keys for empty slots. They
can iterate forwards or backwards, and they can be used to iterate over extents
or non-extents. Extents always have a nonzero size and have start and end
positions; iterating over extents in slots mode (where we synthesize ranges to
represent holes between keys in the btree) is used frequently by the filesystem
IO code in bcachefs.
Primary iterator state:
iter.pos
: the current position of the iterator.The iterator position may be changed directly with
bch2_btree_iter_set_pos()
, and the various peek()/next()/prev() operations all update the iterator position to match what they are returning.iter.l[BTREE_MAX_DEPTH]
(structbtree_iter_level
)These contain, one for each level of the btree, a pointer to a btree node, an iterator for that btree node (struct
btree_node_iter
, see bset.h), and a lock sequence number. Thus, we have a full path from the root of the btree down to a specific key in a specific btree leaf node.Btree node iterators are much simpler: they can be treated as if they point to a single key within a sorted array of keys. Because btree nodes have multiple sorted arrays of code btree node iterators are internally maintaining multiple pointers and sorting the results, but this is an internal implementation detail.
The most important invariant for btree iterators is that iter.pos
must always
be consistent with the btree node iterators: that is, for every btree node
iterator, the key the node iterator points to must compare >= iter.pos
, and
the previous key must compare < iter.pos
.
That is - the node iterator must point to the correct position for inserting a
new key at iter.pos
.
For the various peek()/next()/prev() operations, iter.pos
will usually be
updated to point to the start of the key we're returning, and the leaf btree
node iterator will usually point to the key we're returning, unless we're at a
hole - but there are exceptions for extents.
A discussion on iterator position
bch2_btree_iter_peek_slot()
and bch2_btree_iter_peek_slot_extents()
never
change iter->pos
. bch2_btree_iter_peek()
will change iter->pos
to match
the key just returned.
iter->pos
should always match the key we're currently at; this is so that
on transaction restart we can get back to where we were at, and also so that
when updating the key at the current position the iterator is at the correct
position for that update.
Also, iter->pos
is used by code that uses btree iterators to keep track of
where it's at in the work it's doing. Importantly, when iterating over extents
by slots (i.e. in bch2_read()
), as we call next_slot()
iter->pos
will be
advanced to where the previously returned extent ended.
If we're iterating forwards, iter->pos
will be monotonically increasing, and
if we're iterating forwards by slots we guarantee that we'll return keys/extents
that exactly cover the keyspace.
For non extent iterators, we return the first key >= iter->pos
; for extent
iterators, we return the first key strictly greater than iter->pos
. Right now
this is handled by btree_iter_search_key()
. This is because when we're
iterating over extents we want the first extent that covers the range starting
at iter->pos
.
For iterating backwards, we have peek_prev()
that returns the first key <=
iter->pos
. Note: when using peek_prev()
for extents, it's the same search
condition as for non extents - first key <= iter->pos
.
Also: we need iter->pos
to correspond to and be consistent what the rest of
the iterator physically points to.
So we want two different notions of iterator position:
- One
Extents:
When returning an extent, iter.pos
will typically be updated to point to the
start of the extent we're returning (which is not the key the extent is indexed
by; extents are indexed by where they end, not where they start).
This may mean that when we return an extent, the leaf btree node iterator won't necessarily point to the key we're returning - due to our #1 invariant, and the possible existence of whiteouts (that we filter out when iterating and don't return to user code).
Additionally, iter.pos won't be updated to point to the start of the extent
we're returning when that would move iter.pos backwards (unless we're actually
iterating backwards); this is because user code wants to see a monotonically
increasing iterator position (as user code often uses iter.pos
to track how
far we've gotten/how much work we have to do).