It's past time for the allocator to be redesigned, or at least heavily reconsidered. Nothing should be off the table at this point - we could conceivably even move away from bucket based allocation.

Here's a list of considerations:

  • The bucket based allocator does have advantages. It's fast - most of the work in the allocator is per bucket, which are typically 512k to 2M, even up to 8M. Metadata is one key per bucket, and the alloc btree uses the btree key cache, which is very good for performance - it means extents btree updates don't have to do a second (relatively expensive) btree update to update alloc info, they're only updating the key cache.

  • The bucket based allocator also lets us store and cheaply discard cached data, by incrementing the generation number for that bucket.

  • Disadvantage: the bucket allocator requires us to have copygc, with everything that entails.

  • It would be really nice to have backpointers at some point - copygc would greatly benefit from having backpointers (currently we have to scan the entire extents and reflink btrees to find data that needs to be moved), and erasure coding could also make good use of backpointers.

    If we added another btree for backpointers, it wouldn't make sense for it to use the btree key cache, and thus extents btree updates would get more expensive. Perhaps alloc keys could include a list of backpointers? They'd be 32 bytes each, and bkeys can be a little short of 2k - that's only ~50 backpointers per bkey, which we'd easily overflow with small writes and large buckets. If a secondary backpointers btree was just used as a fallback, it would be getting used right when the performance impact matters most - small writes.

    We should investigate the actual performance overhead of adding another btree update for backpointers - it might not be all that bad. Right now we're looking at ~450k random updates per second on a single core, and that includes the full bch2_trans_commit() path, and the most expensive things in that are per transaction commit, not per btree update (e.g. getting the journal reservation). Also, we'll hopefully be getting gap buffers soon, which will improve btree update performance.

The most pressing consideration though is that we need to eliminate, as much as possible, the periodic scanning the allocator thread(s) have to do to find available buckets. It's too easy to end up with bugs where this scanning is happening repeatedly (we have one now...), and it's a scalability issue and we shouldn't be doing it at all.

Also, we would really like to get rid of the in memory array of buckets, this is another scalability issue. The in memory struct bucket now mostly just mirrors what's in the btree, in KEY_TYPE_alloc keys. The one exception is journal_seq, which is the journal sequence number of the last btree update that touched that bucket. It's needed for knowing whether we need to flush the journal before allocating and writing to that bucket - perhaps it should just be added to struct bch_alloc and stored in the keys in the btree.