Killian De Volder (Qantourisc) and I ended up having a long discussion about TCQ vs. write caching, stable pages, and other random stuff about the Linux IO stack - none of it bcachefs specific, but we thought it might be useful and/or interesting enough to be worth dumping into the wiki and possibly cleaning up later.

IRC conversation is reproduced below:

drkshadow: sidenode: I think I will write a custom tool, to test this, that is less complicated to run

as one can use math, to verify the consistency of a file


that stuff is a pain

py1hon: what stuff ?

making sure syncs are passed down correctly

py1hon: the writing the tool, or dealing with that shitstorm of cache-lies ?

i spotted a bug (hope it's fixed now...) in the DIO code - with O_DIRECT|O_SYNC aio the sync just wasn't happening

py1hon: i'd like to know what setups / disks work ... I run everything with write-cache off, it's painfull

i mean finding/testing every possible way code can fuck up

you read enough code you see scary shit

py1hon: ... still ?

i haven't checked

reminds me 2006 and lvm

really don't want to :p

"ow now we handle syncs"

i have enough bugs i'm responsible for thank you very much

i was like ... whaaaat ?



so I run all my servers in direct IO :/

i think that guy's TCQ thing is a red herring though

TCQ is just another form of cache handeling

the trouble with the TCQ/barrier model is that it's utterly insane trying to preserve ordering all the way up and down the io stack


py1hon: i'm just VERRY confused why handeling sync in kernel is so hard

software ends up being coded to just use explicit flushes

Qantourisc: i think mostly it's becuase a lot of storage developers rode in on the short bus

"short bus" ?


what's that ?

the bus you ride to school if you have "special needs"


imo, each bio request should return AFTER the write is DONE


return a promise

in an ideal world yea

the trouble is

write caching is really useful

how any other is allowed, I do no understand, me eyeballs linus

the reason is

until the write completes

that's what the promises are for

that memory buffer - you can't touch it

and when you request a sync, you wait until they are all completed (imo)

but yes it's some work

hold on i'm explaining :P

so this comes up most notably with the page cache

also "stable pages" was still a clusterfuck last i checked

but it'd be even worse without write caching


if you're a filesystem that cares about data and checksums crap

you checksum the data you're writing, and then the data you're writing better goddamn not change until the write completes - say userspace is changing it, because that's what they do

or if it does, your checksum is invalid

which means:

to write some data that's cached in the page cache, and mapped into potentially multiple userspace processes

Qantourisc makes a notes: if userspace writes, just update checksum, or are they writing in the buffer you are about to write ?

you have to mark those page table entries RO, flush tlbs etc. etc. so that userspace can't modify that crap

if userspace modifies the buffer while the write is in flight

you don't know if the old or the new version was the one that got written

it's a race

you're fucked

... imo once you submit to the kernel what you will write: it's hands off

quess they wanted to prevent an extra page copy

ignore direct io for now (that has its own issues though)

yes, that extra copy is bad

py1hon: so is screwing up your data :)

not that bad, that's actually what bcachefs does

but it's a shitty situation

also having to allocate memory to copy data jsut so you can do io, that's also not a great situation

screwing up data < extra memory copy

there are lots of good reasons why you don't want to bounce data you're writing if you don't have to

and mostly, you don't have to


Also what wrong with doing "fu" userspace ?

don't do it wrong ?

or is the writing to it' "allowed" ?

like i said bcachefs is just copying the data

which i'm sure someone is going to complain about eventually


so getting back to the tcq vs write caching thing

fundamental issue is, while a write is outstanding the device owns that buffer, you can't touch it or reuse it for anything else

py1hon: then you reply: "I am sorry the current design of what is allowed in the kernel API is to liberal, I can't write out the data you can constantly modify while i'm writing. Complain to Linus to fick this braindead design."

if doing a write is fast, because writes are cached, this isn't a huge deal

you can just wait until the write completes, and (potentially) avoid having to bounce

py1hon: quick break "bounce" ?

bounce = allocate a bounce buffer, copy data into bounce buffer, write using bounce buffer instead of original buffer

a ok

"copy page" in a sence

if doing a write is slow, because it's always waiting for the device to physically make it persistent

then, you're probably going to end up bouncing all the writes on the host - introducing an extra copy


this is stupid

because the writes are gonna get copied to the device's buffer regardless, so it can do the actual io

so if you have to copy the data to the device's buffer anyways

sound logical, everyone who promises to write something out later, should copy the data

just do that INSTEAD of bouncing

boom, done

except not really, there's shitty tradeoffs all around

why are they favoring speed over safety ?

anyways, there's really no good answers to the bouncing and tcq vs write buffering stuff

well they're not these days

excepting bugs of course

so they bounce more then ?

no you don't want to know about the bouncing situation

pretend i didn't bring that up because then i'd have to talk about stable pages and like i said that's another bag of shit


userspace has no concept of tcq

or when a write hits the device

or what

all userspace has is fsync and O_SYNC

and that's fine

those interfaces are completely adequate in practice

the kernel just has to make sure it actually honors them, which (again, excluding bugs), it does these days

so who could write in the write-page then (in the earlier example with the checksum) ?

whether it honors fsync and O_SYNC by using flushes or by using TCQ doesn't matter one damn to userspace

ok, so you mmap a file

agreed, f/Osync is enough

how's that work?

or say multiple processes mmap the same file


py1hon: the only way I see this working: cow


that would be MAP_PRIVATE

say they're all writing to the file

so all their changes, via the mmap() mapping, have to be written out to disk (and also seen by each other blah blah)

sec first, mmap works by pages correct ? and those pages are backed by FS ?


and yes

so that file is cached in the page cache

hold on constructing mental model

it's cached just once


every time a process calls mmap()

py1hon: does the kernel know when one has written to a page ?

the kernel maps those same physical pages to locations in that process's address space by setting up page table entries and mappings and all that crap


with help from the CPU

a nice way :)

page starts out clean, right?


so when the kernel maps that clean page into the userspace's address space, it maps it read only

REGARDLESS of whether userspace wants read write access or not

then, when userspace writes to it, it traps

SIGBUS, except not, because the kernel sees what was going on

kernel switches it to read write, notes the page is now dirty, and continues on like nothing ever happened

but how does it detect a second write ?

doesn't need to

all it needs to know is that the page is now dirty

and, at some point in the future, has to be written

ow right, if you want to write it, lock it again first ?

make it read only, write it, mark it clean

userspace writes again, cycle repeats

clean == RO map ?


or a bitmap somewhere ?

bit in struct page

however: that was all a lie

that's how it works conceptually

ok who has cut corners, and why :(

but, dirtying pages and crap like that is important enough that CPUs actually track this stuff for you without you having to map pages read only and having userspace go to the trouble of faulting

the end result is, the kernel can just check the page table entries that it sets up for the CPU to see if userspace has written to them

ok we can detect this differently, sound nice, and it blows up in our face how ?

(this actually isn't 100% my area, i would be reading docs to check the details on this stuff if i cared)

ok so, nice thing about this is

no more traps :)

pages are never read only! userspace can always write to them!

yes, no more traps!

(aka, minor faults)

annoying side effect:

if pages are never read only...



if we're not bouncing

and we don't want to bounce if we don't have to, so usually we don't

.... so why are they not marked RO ?

the buffer we're writing out is literally the same buffer that is mapped into userspace

because if they were marked RO, userspace would trap and things would be slow


py1hon: i'd argue it would NOT be slow

oh i'll get to that

either things are going bad: we are race-condition writing

this is where it starts to get hilarious

no not yet

and slow is "ok"

this is how it worked for many years

and it worked fine

reason is

or things are not race-ing and it should be fine

if there's a write in flight

and userspace is scribbling over that buffer with new data

who cares? we're going to overwrite that write with the new version later

it got marked dirty again

there is ofcours 1 asset of mmap files: write order not garnteed

if userspace cares about which specific version gets written, they need to stop scribbling over stuff and call fsync()

if one looks at this api

no really, trust me, this actually does work completely fine

no data integrity is broken

py1hon: with fsync it works too yes

but if the app refuses to wait: not a clue what version you will get

yes but that's fine!

if the app isn't doing an fsync, it CANNOT care

PS: mind if I publish this conversation ? :)

go for it

verry informative

Might rewrite it later as a doc :p

i want to emphasize that this approach REALLY IS COMPLETELY OK

that is

here's the hilarious part


Qantourisc ponders

that is literally the ONLY THING throwing a giant monkey wrench into this approach

and it's fucking stupid, but it's where we are

py1hon: ... there is a fix I think

remember: if the filesystem is checksumming the data, then the FILESYSTEM damn well cares about which version gets written because it has to push down the correct checksum

but I don't know if it's allowed

but if userspace is scribbling over the data underneath everyone... oh fuck

don't write until you get a fsync

no that doesn't work, for a variety of reasons

performance will however be ... mwea

there really isn't a good solution


bounce is also a way

yes, but

if your solution is bouncing, then you have to preemptively bounce everything

which is stupid

py1hon: just the dirty pages no ?

i mean every single write you do from the page cache you have to bounce

well you need a copy you can trust


end of story :/

there's an alternative

I missed one ?

we talked about it earlier

Qantourisc feels inferior

you flip the pages to RO in userspace's mapping

just until the write completes

but ?

(other then extra work)


should be fine, right? i mean, we're not writing that many pages at once at any given time

writes are fast because most devices cache writes

what could go wrong?

write order, diks lie

power loss


we're only talking about flipping pages to RO in userspace for the duration of the write to avoid having to bounce them

nothing else changes

if app does fsync, we still issue a cache flush

looks fine, I missed anything ?

devices USUALLY don't lie about cache flushes in the past 10-20 years because people get very angry if they do

py1hon: PRO users get angry

desktop users ... myabe

i know but enterprise storage people are fucking morons

trust me

you don't even want to know

py1hon: i've seen my own morons

i know

they had 2 sans

and they're morons too

but dear god fuck enterprise storage

they where syncing 2 raids between 2 sans

1 whas a RAID5 and the other RAID10

so, stable pages:

and I was like ... guys ... this stack probably doesn't return writes are complete until all raids have hit the platter

... downloaded the docs, of the san, and it was true

year or so back, they did that for stable pages, flipping them RO in the userspace mappings

SAN admin ...; ooow ...euu ... dang

btrfs needs stable pages for checksums

other stuff in the io stack would like stable pages

it was regarded as generally a good idea

stable page == page you can trust userspace will not modify (without you knowing)

so this was pushed out


trouble is, after it was released

someone came up with a benchmark that got like 200x slower

py1hon: writing to the same page ?

yeah, just because of userspace having to wait if they tried to write to a page that was being written out

The correct reply would then be "We are sorry, we cannot garantee this without messing up your data, or we bounce"

and if you think about it

userspace having to block on IO

when all it's trying to do is change something in memory

that is kinda stupid

and adding latency like that actually is a shitty thing to do because then someone is going to have to debug a horrific latency bug years later and be really pissed when they figure out what it was

Well, there are 2 options here

wait no option


today, it's pretty much just bouncing

now, what we COULD do, in THEORY

why is bouncing sooo bad ?

it's not that bouncing is bad, exactly

it's that bouncing EVERY SINGLE FUCKING WRITE when 99% of them won't be modified is retarded

py1hon: you could RO lock, when requesting an unlock, bounce

i was actually about to bring that up

that is what you'd like to be able to do


if you tell the VM developers that this is what you want to do

the guys who work on the page cache code and all that crap

Qantourisc says fucks sake, another but ?

i'm pretty sure they just run away screaming

py1hon: why ?

apparently (and this part of the code I know fuck all about) swapping out the page that's mapped into userspace like this would be a giant pain in the ass

because it's not easy ?


like i said, i don't know that code

i would actually imagine with all the stuff that's been going on for page migration it ought to be doable these days


i am not a VM developer

py1hon: btw howmuch room do you need to bounce ?

as in MB's

it's not the amount of memory you need, you only need it for the writes that are in flight

it's the overhead of all the memcpys and the additional cache pressure that sucks


that or you disable checksums :D


maybe this should be a ionctl option one day ?

prefebly yesterday :D


my priority is getting all the bugs fixed

so if the program doesn't care about checksum, and want's is 200x speed back at 0 bounce cost

he can have it

py1hon: this would be a general kernel feature :)

wich ... right you would need to add :p

and realistically it's not THAT big of a performance impact

py1hon: btw i'm still kinda set on writing code to stresstest the lot :p

I really don't trust IO stacks

many disks in the past have lied through their teeth

and so has the kernel

xfstests actually does have quite a few good tests for torture testing fsync


but i'm talking while yanking power :D

and nothing fundamental has changed w.r.t. fsync since early days of bcache

so that stuff has all been tested for a looong time

and bcache ain't perfect if you really hammer on it, but i know about those bugs and they're fixed in bcachefs :p

And it's just not about bcache kernel code

it's also about disks

yeah i don't want to know about whatever you find there :P