Killian De Volder (Qantourisc) and I ended up having a long discussion about TCQ vs. write caching, stable pages, and other random stuff about the Linux IO stack - none of it bcachefs specific, but we thought it might be useful and/or interesting enough to be worth dumping into the wiki and possibly cleaning up later.
IRC conversation is reproduced below:
http://brad.livejournal.com/2116715.html
drkshadow: sidenode: I think I will write a custom tool, to test this, that is less complicated to run
as one can use math, to verify the consistency of a file
heh
that stuff is a pain
py1hon: what stuff ?
making sure syncs are passed down correctly
py1hon: the writing the tool, or dealing with that shitstorm of cache-lies ?
i spotted a bug (hope it's fixed now...) in the DIO code - with O_DIRECT|O_SYNC aio the sync just wasn't happening
py1hon: i'd like to know what setups / disks work ... I run everything with write-cache off, it's painfull
i mean finding/testing every possible way code can fuck up
you read enough code you see scary shit
py1hon: ... still ?
i haven't checked
reminds me 2006 and lvm
really don't want to :p
"ow now we handle syncs"
i have enough bugs i'm responsible for thank you very much
i was like ... whaaaat ?
hah
yeaaaa
so I run all my servers in direct IO :/
i think that guy's TCQ thing is a red herring though
TCQ is just another form of cache handeling
the trouble with the TCQ/barrier model is that it's utterly insane trying to preserve ordering all the way up and down the io stack
thus
py1hon: i'm just VERRY confused why handeling sync in kernel is so hard
software ends up being coded to just use explicit flushes
Qantourisc: i think mostly it's becuase a lot of storage developers rode in on the short bus
"short bus" ?
yes
what's that ?
the bus you ride to school if you have "special needs"
ugh
imo, each bio request should return AFTER the write is DONE
OR
return a promise
in an ideal world yea
the trouble is
write caching is really useful
how any other is allowed, I do no understand, me eyeballs linus
the reason is
until the write completes
that's what the promises are for
that memory buffer - you can't touch it
and when you request a sync, you wait until they are all completed (imo)
but yes it's some work
hold on i'm explaining :P
so this comes up most notably with the page cache
also "stable pages" was still a clusterfuck last i checked
but it'd be even worse without write caching
so
if you're a filesystem that cares about data and checksums crap
you checksum the data you're writing, and then the data you're writing better goddamn not change until the write completes - say userspace is changing it, because that's what they do
or if it does, your checksum is invalid
which means:
to write some data that's cached in the page cache, and mapped into potentially multiple userspace processes Qantourisc makes a notes: if userspace writes, just update checksum, or are they writing in the buffer you are about to write ?
you have to mark those page table entries RO, flush tlbs etc. etc. so that userspace can't modify that crap
if userspace modifies the buffer while the write is in flight
you don't know if the old or the new version was the one that got written
it's a race
you're fucked
... imo once you submit to the kernel what you will write: it's hands off
quess they wanted to prevent an extra page copy
ignore direct io for now (that has its own issues though)
yes, that extra copy is bad
py1hon: so is screwing up your data :)
not that bad, that's actually what bcachefs does
but it's a shitty situation
also having to allocate memory to copy data jsut so you can do io, that's also not a great situation
screwing up data < extra memory copy
there are lots of good reasons why you don't want to bounce data you're writing if you don't have to
and mostly, you don't have to
anyways
Also what wrong with doing "fu" userspace ?
don't do it wrong ?
or is the writing to it' "allowed" ?
like i said bcachefs is just copying the data
which i'm sure someone is going to complain about eventually
sigh
so getting back to the tcq vs write caching thing
fundamental issue is, while a write is outstanding the device owns that buffer, you can't touch it or reuse it for anything else
py1hon: then you reply: "I am sorry the current design of what is allowed in the kernel API is to liberal, I can't write out the data you can constantly modify while i'm writing. Complain to Linus to fick this braindead design."
if doing a write is fast, because writes are cached, this isn't a huge deal
you can just wait until the write completes, and (potentially) avoid having to bounce
py1hon: quick break "bounce" ?
bounce = allocate a bounce buffer, copy data into bounce buffer, write using bounce buffer instead of original buffer
a ok
"copy page" in a sence
if doing a write is slow, because it's always waiting for the device to physically make it persistent
then, you're probably going to end up bouncing all the writes on the host - introducing an extra copy
but
this is stupid
because the writes are gonna get copied to the device's buffer regardless, so it can do the actual io
so if you have to copy the data to the device's buffer anyways
sound logical, everyone who promises to write something out later, should copy the data
just do that INSTEAD of bouncing
boom, done
except not really, there's shitty tradeoffs all around
why are they favoring speed over safety ?
anyways, there's really no good answers to the bouncing and tcq vs write buffering stuff
well they're not these days
excepting bugs of course
so they bounce more then ?
no you don't want to know about the bouncing situation
pretend i didn't bring that up because then i'd have to talk about stable pages and like i said that's another bag of shit
anywys
userspace has no concept of tcq
or when a write hits the device
or what
all userspace has is fsync and O_SYNC
and that's fine
those interfaces are completely adequate in practice
the kernel just has to make sure it actually honors them, which (again, excluding bugs), it does these days
so who could write in the write-page then (in the earlier example with the checksum) ?
whether it honors fsync and O_SYNC by using flushes or by using TCQ doesn't matter one damn to userspace
ok, so you mmap a file
agreed, f/Osync is enough
how's that work?
or say multiple processes mmap the same file
MAP_SHARED
py1hon: the only way I see this working: cow
no
that would be MAP_PRIVATE
say they're all writing to the file
so all their changes, via the mmap() mapping, have to be written out to disk (and also seen by each other blah blah)
sec first, mmap works by pages correct ? and those pages are backed by FS ?
yes
and yes
so that file is cached in the page cache
hold on constructing mental model
it's cached just once
then
every time a process calls mmap()
py1hon: does the kernel know when one has written to a page ?
the kernel maps those same physical pages to locations in that process's address space by setting up page table entries and mappings and all that crap
yes
with help from the CPU
a nice way :)
page starts out clean, right?
yap
so when the kernel maps that clean page into the userspace's address space, it maps it read only
REGARDLESS of whether userspace wants read write access or not
then, when userspace writes to it, it traps
SIGBUS, except not, because the kernel sees what was going on
kernel switches it to read write, notes the page is now dirty, and continues on like nothing ever happened
but how does it detect a second write ?
doesn't need to
all it needs to know is that the page is now dirty
and, at some point in the future, has to be written
ow right, if you want to write it, lock it again first ?
make it read only, write it, mark it clean
userspace writes again, cycle repeats
clean == RO map ?
yea
or a bitmap somewhere ?
bit in struct page
however: that was all a lie
that's how it works conceptually
ok who has cut corners, and why :(
but, dirtying pages and crap like that is important enough that CPUs actually track this stuff for you without you having to map pages read only and having userspace go to the trouble of faulting
the end result is, the kernel can just check the page table entries that it sets up for the CPU to see if userspace has written to them
ok we can detect this differently, sound nice, and it blows up in our face how ?
(this actually isn't 100% my area, i would be reading docs to check the details on this stuff if i cared)
ok so, nice thing about this is
no more traps :)
pages are never read only! userspace can always write to them!
yes, no more traps!
(aka, minor faults)
annoying side effect:
if pages are never read only...
userspace can KEEP WRITING TO THEM WHILE THEY'RE BEING WRITTEN
remember
if we're not bouncing
and we don't want to bounce if we don't have to, so usually we don't
.... so why are they not marked RO ?
the buffer we're writing out is literally the same buffer that is mapped into userspace
because if they were marked RO, userspace would trap and things would be slow
now
py1hon: i'd argue it would NOT be slow
oh i'll get to that
either things are going bad: we are race-condition writing
this is where it starts to get hilarious
no not yet
and slow is "ok"
this is how it worked for many years
and it worked fine
reason is
or things are not race-ing and it should be fine
if there's a write in flight
and userspace is scribbling over that buffer with new data
who cares? we're going to overwrite that write with the new version later
it got marked dirty again
there is ofcours 1 asset of mmap files: write order not garnteed
if userspace cares about which specific version gets written, they need to stop scribbling over stuff and call fsync()
if one looks at this api
no really, trust me, this actually does work completely fine
no data integrity is broken
py1hon: with fsync it works too yes
but if the app refuses to wait: not a clue what version you will get
yes but that's fine!
if the app isn't doing an fsync, it CANNOT care
PS: mind if I publish this conversation ? :)
go for it
verry informative
Might rewrite it later as a doc :p
i want to emphasize that this approach REALLY IS COMPLETELY OK
that is
here's the hilarious part
UNTIL THE FILESYSTEM STARTS CHECKSUMMING DATA Qantourisc ponders
that is literally the ONLY THING throwing a giant monkey wrench into this approach
and it's fucking stupid, but it's where we are
py1hon: ... there is a fix I think
remember: if the filesystem is checksumming the data, then the FILESYSTEM damn well cares about which version gets written because it has to push down the correct checksum
but I don't know if it's allowed
but if userspace is scribbling over the data underneath everyone... oh fuck
don't write until you get a fsync
no that doesn't work, for a variety of reasons
performance will however be ... mwea
there really isn't a good solution
so
bounce is also a way
yes, but
if your solution is bouncing, then you have to preemptively bounce everything
which is stupid
py1hon: just the dirty pages no ?
i mean every single write you do from the page cache you have to bounce
well you need a copy you can trust
well
end of story :/
there's an alternative
I missed one ?
we talked about it earlier Qantourisc feels inferior
you flip the pages to RO in userspace's mapping
just until the write completes
but ?
(other then extra work)
yeah
should be fine, right? i mean, we're not writing that many pages at once at any given time
writes are fast because most devices cache writes
what could go wrong?
write order, diks lie
power loss
no
we're only talking about flipping pages to RO in userspace for the duration of the write to avoid having to bounce them
nothing else changes
if app does fsync, we still issue a cache flush
looks fine, I missed anything ?
devices USUALLY don't lie about cache flushes in the past 10-20 years because people get very angry if they do
py1hon: PRO users get angry
desktop users ... myabe
i know but enterprise storage people are fucking morons
trust me
you don't even want to know
py1hon: i've seen my own morons
i know
they had 2 sans
and they're morons too
but dear god fuck enterprise storage
they where syncing 2 raids between 2 sans
1 whas a RAID5 and the other RAID10
so, stable pages:
and I was like ... guys ... this stack probably doesn't return writes are complete until all raids have hit the platter
... downloaded the docs, of the san, and it was true
year or so back, they did that for stable pages, flipping them RO in the userspace mappings
SAN admin ...; ooow ...euu ... dang
btrfs needs stable pages for checksums
other stuff in the io stack would like stable pages
it was regarded as generally a good idea
stable page == page you can trust userspace will not modify (without you knowing)
so this was pushed out
yes
trouble is, after it was released
someone came up with a benchmark that got like 200x slower
py1hon: writing to the same page ?
yeah, just because of userspace having to wait if they tried to write to a page that was being written out
The correct reply would then be "We are sorry, we cannot garantee this without messing up your data, or we bounce"
and if you think about it
userspace having to block on IO
when all it's trying to do is change something in memory
that is kinda stupid
and adding latency like that actually is a shitty thing to do because then someone is going to have to debug a horrific latency bug years later and be really pissed when they figure out what it was
Well, there are 2 options here
wait no option
bouncing
today, it's pretty much just bouncing
now, what we COULD do, in THEORY
why is bouncing sooo bad ?
it's not that bouncing is bad, exactly
it's that bouncing EVERY SINGLE FUCKING WRITE when 99% of them won't be modified is retarded
py1hon: you could RO lock, when requesting an unlock, bounce
i was actually about to bring that up
that is what you'd like to be able to do
however
if you tell the VM developers that this is what you want to do
the guys who work on the page cache code and all that crap Qantourisc says fucks sake, another but ?
i'm pretty sure they just run away screaming
py1hon: why ?
apparently (and this part of the code I know fuck all about) swapping out the page that's mapped into userspace like this would be a giant pain in the ass
because it's not easy ?
yeah
like i said, i don't know that code
i would actually imagine with all the stuff that's been going on for page migration it ought to be doable these days
but
i am not a VM developer
py1hon: btw howmuch room do you need to bounce ?
as in MB's
it's not the amount of memory you need, you only need it for the writes that are in flight
it's the overhead of all the memcpys and the additional cache pressure that sucks
yea
that or you disable checksums :D
yep
maybe this should be a ionctl option one day ?
prefebly yesterday :D
eventually
my priority is getting all the bugs fixed
so if the program doesn't care about checksum, and want's is 200x speed back at 0 bounce cost
he can have it
py1hon: this would be a general kernel feature :)
wich ... right you would need to add :p
and realistically it's not THAT big of a performance impact
py1hon: btw i'm still kinda set on writing code to stresstest the lot :p
I really don't trust IO stacks
many disks in the past have lied through their teeth
and so has the kernel
xfstests actually does have quite a few good tests for torture testing fsync
sweet
but i'm talking while yanking power :D
and nothing fundamental has changed w.r.t. fsync since early days of bcache
so that stuff has all been tested for a looong time
and bcache ain't perfect if you really hammer on it, but i know about those bugs and they're fixed in bcachefs :p
And it's just not about bcache kernel code
it's also about disks
yeah i don't want to know about whatever you find there :P