[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...



On Thu, Jun 18, 2020 at 05:28:11PM +0300, Reco wrote:
	Hi.

On Thu, Jun 18, 2020 at 08:57:48AM -0400, Michael Stone wrote:
On Thu, Jun 18, 2020 at 08:50:49AM +0300, Reco wrote:
> On Wed, Jun 17, 2020 at 05:54:51PM -0400, Michael Stone wrote:
> > On Wed, Jun 17, 2020 at 11:45:53PM +0300, Reco wrote:
> > > Long story short, if you need a primitive I/O benchmark, you're better
> > > with both dsync and nocache.
> >
> > Not unless that's your actual workload, IMO. Almost nothing does sync i/o;
>
> Almost everything does (see my previous e-mails). No everything does it
> with O_DSYNC, that's true.

You're not using the words like most people use them, which does certainly confuse the conversation.

Earlier this thread someone posted a link to Wikipedia article on the
matter. Whatever terminology I'm using is consistent with it.
Qualifies for "common terminology" IMO.

It would really be better to just drop any kind of metaphysical argument about what to call things and just focus on command lines and other concrete examples. Again, you seem fixated on certain APIs and then making leaps in other contexts where the distinctions you're trying to make don't apply.

writing one block at a time is *really* *really* bad for performance.

True. But also it's good for the integrity of written data, which is why
(presumably) sqlite upstream did it.

strace -e open,openat,fsync,fdatasync sqlite3 test.sqlite3
[snip]
SQLite version 3.32.2 2020-06-04 12:58:43
Enter ".help" for usage hints.
[snip]
sqlite> create table test (test varchar); openat(AT_FDCWD, "test.sqlite3", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/tmp/test.sqlite3", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 5
openat(AT_FDCWD, "/tmp/test.sqlite3-journal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
openat(AT_FDCWD, "/dev/urandom", O_RDONLY|O_CLOEXEC) = 7
fdatasync(6)                            = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7)                            = 0
fdatasync(6)                            = 0
fdatasync(5)                            = 0
sqlite> insert into test VALUES ('foo');
openat(AT_FDCWD, "/tmp/test.sqlite3-journal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
fdatasync(6)                            = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7)                            = 0
fdatasync(6)                            = 0
fdatasync(5)                            = 0
sqlite> update test set test = 'bar';
openat(AT_FDCWD, "/tmp/test.sqlite3-journal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
fdatasync(6)                            = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7)                            = 0
fdatasync(6)                            = 0
fdatasync(5)                            = 0

No O_DSYNCs to be seen, but quite a few fdatasync's! You don't seem to be checking that what you're saying matches actual practice & behavior.

Most applications for which I/O performance is important allow writes to buffer, then
flush the buffers as needed for data integrity.

No objections here. Most applications write their files as a whole, it
makes total sense to do it this way. But there are exceptions to this
rule, and if it modifies its files piecewise, it probably uses O_DSYNC
to be sure.

See above.

> > simply using conv=fdatasync to make sure that the cache is flushed before exiting
> > is going to be more representative.
>
> If you're answering the question "how fast is my programs are going to
> write there" - sure. If you're answering the question "how fast my
> drive(s) actually is(are)" - nope, you need O_DSYNC.

While OF COURSE the question people want answered is "how fast is my programs are going to write there"

But the most important hidden question here - which programs?
That ones that write their files by one big chunk (which is common) or
the ones that do it one piece at a time (any RDBMS, for instance)?

See above. RDBMS usually try really hard to coalesce write operations rather than writing little tiny pieces, even at the cost of writing the data twice. (Once in a sequential journal, and again as part of combined
random writes.)

Real programs that write large amounts of data have to handle the possibility of partial writes *even if* they are using O_DSYNC. In non-trivial cases if you're doing the work to handle the problems that occur with a partial write, you can just as easily write larger amounts of data unsynchronized to get better performance, then establish a synchronization point with f(data)sync. There are cases where O_DSYNC might be the best option, mostly around appending in relatively small chunks. Otherwise, as above, you're probably using some kind of journal and there's no reason to slow down every operation when you only need things to hit the disk in a certain relative order to get the same level of integrity.

This is also all stuff that's evolved over time and across systems. Programs may behave one way on one system and another way on another system because they have or lack certain guarantees or because of dramatic performance differences. E.g., postgresql's write-ahead log is an obvious candidate for O_DSYNC, but even there on linux it defaults to fdatasync because of historic cases where O_DSYNC behaved dramatically worse. (The two should be close to identical if you fdatasync after every single write, with slightly higher overhead when making two separate system calls, but in some cases that wasn't happening. I think that doesn't happen anymore, but there's no strong incentive to change the default because the difference when things are working properly isn't large but if they're broken the difference is huge.)


Reply to: