Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...

To: debian-user@lists.debian.org
Subject: Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
From: Michael Stone <mstone@debian.org>
Date: Thu, 18 Jun 2020 11:55:20 -0400
Message-id: <[🔎] 7bd4474a-b171-11ea-9b6a-00163eeb5320@msgid.mathom.us>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] E1jlvWV-0000u2-N2@enotuniq.net>
References: <[🔎] CAFakBwgmdEZfwuPGamv9Vei53g+NA_fZiAvtv8FFkRm=ZvvwOA@mail.gmail.com> <[🔎] 4731fe87-a17c-41eb-566e-e50b31a94d2d@holgerdanske.com> <[🔎] E1jldhV-0005Ao-7c@enotuniq.net> <[🔎] 2d5e4f05-3b96-7159-b61f-8a8b203d5eb7@holgerdanske.com> <[🔎] 20200617203351.GB8139@tuxteam.de> <[🔎] E1jlewU-0005NB-3V@enotuniq.net> <[🔎] e42f51c8-b0e4-11ea-9b6a-00163eeb5320@msgid.mathom.us> <[🔎] E1jlnRp-0000Kl-9B@enotuniq.net> <[🔎] 1e828b8e-b157-11ea-9b6a-00163eeb5320@msgid.mathom.us> <[🔎] E1jlvWV-0000u2-N2@enotuniq.net>

On Thu, Jun 18, 2020 at 05:28:11PM +0300, Reco wrote:

	Hi.

On Thu, Jun 18, 2020 at 08:57:48AM -0400, Michael Stone wrote:

On Thu, Jun 18, 2020 at 08:50:49AM +0300, Reco wrote:
> On Wed, Jun 17, 2020 at 05:54:51PM -0400, Michael Stone wrote:
> > On Wed, Jun 17, 2020 at 11:45:53PM +0300, Reco wrote:
> > > Long story short, if you need a primitive I/O benchmark, you're better
> > > with both dsync and nocache.
> >
> > Not unless that's your actual workload, IMO. Almost nothing does sync i/o;
>
> Almost everything does (see my previous e-mails). No everything does it
> with O_DSYNC, that's true.

You're not using the words like most people use them, which does certainly confuse the conversation.


Earlier this thread someone posted a link to Wikipedia article on the
matter. Whatever terminology I'm using is consistent with it.
Qualifies for "common terminology" IMO.

It would really be better to just drop any kind of metaphysical argumentabout what to call things and just focus on command lines and otherconcrete examples. Again, you seem fixated on certain APIs and thenmaking leaps in other contexts where the distinctions you're trying tomake don't apply.

writing one block at a time is *really* *really* bad for performance.


True. But also it's good for the integrity of written data, which is why
(presumably) sqlite upstream did it.

strace -e open,openat,fsync,fdatasync sqlite3 test.sqlite3

[snip]
SQLite version 3.32.2 2020-06-04 12:58:43
Enter ".help" for usage hints.
[snip]

sqlite> create table test (test varchar);openat(AT_FDCWD, "test.sqlite3", O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD, "/tmp/test.sqlite3", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 5
openat(AT_FDCWD, "/tmp/test.sqlite3-journal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
openat(AT_FDCWD, "/dev/urandom", O_RDONLY|O_CLOEXEC) = 7
fdatasync(6)                            = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7)                            = 0
fdatasync(6)                            = 0
fdatasync(5)                            = 0
sqlite> insert into test VALUES ('foo');
openat(AT_FDCWD, "/tmp/test.sqlite3-journal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
fdatasync(6)                            = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7)                            = 0
fdatasync(6)                            = 0
fdatasync(5)                            = 0
sqlite> update test set test = 'bar';
openat(AT_FDCWD, "/tmp/test.sqlite3-journal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 6
fdatasync(6)                            = 0
openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 7
fdatasync(7)                            = 0
fdatasync(6)                            = 0
fdatasync(5)                            = 0

No O_DSYNCs to be seen, but quite a few fdatasync's! You don't seem tobe checking that what you're saying matches actual practice & behavior.

Most applications for which I/O performance is important allow writes to buffer, then
flush the buffers as needed for data integrity.


No objections here. Most applications write their files as a whole, it
makes total sense to do it this way. But there are exceptions to this
rule, and if it modifies its files piecewise, it probably uses O_DSYNC
to be sure.


See above.

> > simply using conv=fdatasync to make sure that the cache is flushed before exiting
> > is going to be more representative.
>
> If you're answering the question "how fast is my programs are going to
> write there" - sure. If you're answering the question "how fast my
> drive(s) actually is(are)" - nope, you need O_DSYNC.

While OF COURSE the question people want answered is "how fast is my programs are going to write there"


But the most important hidden question here - which programs?
That ones that write their files by one big chunk (which is common) or
the ones that do it one piece at a time (any RDBMS, for instance)?

See above. RDBMS usually try really hard to coalesce write operationsrather than writing little tiny pieces, even at the cost of writing thedata twice. (Once in a sequential journal, and again as part of combined

random writes.)

Real programs that write large amounts of data have to handle thepossibility of partial writes *even if* they are using O_DSYNC. Innon-trivial cases if you're doing the work to handle the problems thatoccur with a partial write, you can just as easily write larger amountsof data unsynchronized to get better performance, then establish asynchronization point with f(data)sync. There are cases where O_DSYNCmight be the best option, mostly around appending in relatively smallchunks. Otherwise, as above, you're probably using some kind of journaland there's no reason to slow down every operation when you only needthings to hit the disk in a certain relative order to get the samelevel of integrity.

This is also all stuff that's evolved over time and across systems.Programs may behave one way on one system and another way on anothersystem because they have or lack certain guarantees or because ofdramatic performance differences. E.g., postgresql's write-ahead log isan obvious candidate for O_DSYNC, but even there on linux it defaults tofdatasync because of historic cases where O_DSYNC behaved dramaticallyworse. (The two should be close to identical if you fdatasync afterevery single write, with slightly higher overhead when making twoseparate system calls, but in some cases that wasn't happening. I thinkthat doesn't happen anymore, but there's no strong incentive tochange the default because the difference when things are workingproperly isn't large but if they're broken the difference is huge.)

Reply to:

References:
- Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: Albretch Mueller <lbrtchx@gmail.com>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: David Christensen <dpchrist@holgerdanske.com>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: Reco <recoverym4n@enotuniq.net>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: David Christensen <dpchrist@holgerdanske.com>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: <tomas@tuxteam.de>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: Reco <recoverym4n@enotuniq.net>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: Michael Stone <mstone@debian.org>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: Reco <recoverym4n@enotuniq.net>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: Michael Stone <mstone@debian.org>
- Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
  - From: Reco <recoverym4n@enotuniq.net>

Prev by Date: Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
Next by Date: Re: bash-completion pros/cons (was: Re: Need commands)
Previous by thread: Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
Next by thread: Re: Could RAM possibly be just 3-4 times faster than bare hdd writes and reads? or, is the Linux kernel doing its 'magic' in the bg? or, ...
Index(es):
- Date
- Thread