[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reading an empty directory after reboot is very slow



On 2015-04-22 23:06:47 -0500, David Wright wrote:
> Quoting Vincent Lefevre (vincent@vinc17.net):
> > On 2015-04-21 11:05:58 -0500, David Wright wrote:
> > > Quoting Vincent Lefevre (vincent@vinc17.net):
> > > > On 2015-04-20 13:04:41 -0500, David Wright wrote:
> > > > > Quoting Vincent Lefevre (vincent@vinc17.net):
> > > > > > But with the current solution (no automatic moving of an
> > > > > > entry), you can't miss an entry that hasn't been removed.
> > > [...]
> > > > > ...so if you happen to be reading the entry for file5 at the
> > > > > time I typed mv, you'll get the entry for file4 twice, under
> > > > > two different names. (Or the opposite.)
> > > > 
> > > > OK, so, if the rename(2) system call can reorder the entries
> > > > (this is not quite clear because one doesn't see the empty
> > > > entries here),
> > > 
> > > No, and you wouldn't *normally* see them with readdir, I'd suppose.
> > 
> > But this matters for the implementation in the kernel.
> 
> What's "this". And what does it matter? You make some system calls,
> and you get replies. They come out of a black box.

I mean: to know what is done internally in order to prove that there
is a problem. Otherwise you need to provide a test based on readdir,
not on the "ls -lU ..." output, which contains incomplete information
(because the kernel has more information, which could be used by the
implementation of readdir).

[...]
> So you don't believe the problem when it's demonstrated,

What I'm saying is that you haven't demonstrating anything: you showed
some "ls -lU ..." output, but did not say what was done internally by
the kernel. So, the readdir test as you have now done below was
necessary.

> but you do believe some hypotheticals you just made up.

I do not believe that. I've just said that this was a possibility.

> Ask yourself why an efficient filesystem would move a load of
> directory entries just because someone renamed a file.

First I wonder why such an efficient filesystem moves a directory
entry while this is not needed. With your example:

drwxr-x--- 2 david david 4096 Apr 21 10:58 file1
drwxr-x--- 2 david david 4096 Apr 21 10:58 file4
drwxr-x--- 2 david david 4096 Apr 21 10:58 file5
drwxr-x--- 2 david david 4096 Apr 21 10:58 file6
drwxr-x--- 2 david david 4096 Apr 21 10:58 file2
drwxr-x--- 2 david david 4096 Apr 21 10:58 file3

becomes

drwxr-x--- 2 david david 4096 Apr 21 10:58 file1
drwxr-x--- 2 david david 4096 Apr 21 10:58 file4
drwxr-x--- 2 david david 4096 Apr 21 10:58 file5
drwxr-x--- 2 david david 4096 Apr 21 10:58 file3file3file3file3file3file3file3file3file3file3file3file3file3
drwxr-x--- 2 david david 4096 Apr 21 10:58 file6
drwxr-x--- 2 david david 4096 Apr 21 10:58 file2

after the rename, while I would have expected:

drwxr-x--- 2 david david 4096 Apr 21 10:58 file1
drwxr-x--- 2 david david 4096 Apr 21 10:58 file4
drwxr-x--- 2 david david 4096 Apr 21 10:58 file5
drwxr-x--- 2 david david 4096 Apr 21 10:58 file6
drwxr-x--- 2 david david 4096 Apr 21 10:58 file2
drwxr-x--- 2 david david 4096 Apr 21 10:58 file3file3file3file3file3file3file3file3file3file3file3file3file3

So, the kernel is doing non-trivial things.

> > What actually needs to be done is a real test
> > using readdir.
> 
> OK. Here's a demonstration of a file going AWOL by moving *up* the
> directory listing. Because of read-ahead, readdir still sees the old
> name and the stat() fails. Again, because of read-ahead, I can't
> demonstrate the opposite effect in the same program because
> you'd have to have a directory bigger than the read-ahead buffer
> in order to see any effect. But, as was said already, it's occurrence
> can be discovered by checking the inode numbers for duplicate returns.
> 
> I scan the directory with readdir, then stat the file to obtain its
> inode number. E is stat's return code, I is inode number.
> When the latter matches 497051, I sleep for 5 seconds so that
> another process can rename a file.
> 
> ~ $ for j in 1 2 3 4 5 6 ; do mkdir /tmp/testdir/file$j ; done
> 
> ~ $ /tmp/a.out /tmp/testdir/ ← before doing anything
> 1 E: 0 I: 496992 file1
> 2 E: 0 I: 497007 file4
> 3 E: 0 I: 497039 file5
> 4 E: 0 I: 488682 .
> 5 E: 0 I: 497051 file6
> sleeping ← I give myself 5 seconds to do something
> 6 E: 0 I: 488641 ..
> 7 E: 0 I: 497003 file2
> 8 E: 0 I: 497006 file3
> 
> ~ $ /tmp/a.out /tmp/testdir/ ← during the alteration
> 1 E: 0 I: 496992 file1
> 2 E: 0 I: 497007 file4
> 3 E: 0 I: 497039 file5
> 4 E: 0 I: 488682 .
> 5 E: 0 I: 497051 file6
> sleeping                ← here I renamed file2 (in another xterm)
> 6 E: 0 I: 488641 ..
> 7 E: -1 I: 488641 file2 ← oops, file2 stat() fails (so the inode number is untouched from the previous call)
> 8 E: 0 I: 497006 file3
> 
> ~ $ /tmp/a.out /tmp/testdir/ ← after the alteration
> 1 E: 0 I: 496992 file1
> 2 E: 0 I: 497007 file4
> 3 E: 0 I: 497039 file5
> 4 E: 0 I: 488682 .
> 5 E: 0 I: 497003 file2file2file2file2file2file2file2file2file2file2file2file2file2file2 ← here it is
> 6 E: 0 I: 497051 file6
> sleeping
> 7 E: 0 I: 488641 ..
> 8 E: 0 I: 497006 file3
> ~ $ 

Thanks for this test. I have two remarks:

1. I'm wondering why the kernel moves a renamed directory entry instead
of just modifying it. For easier recovery in case of serious problem
(hardware failure, kernel crash...)?

2. By checking errors and the ctime, modifications can be detected.
Indeed, either one gets an error as you did above or a new "file2"
was added in the meantime, in which case one doesn't get an error
but the ctime is recent enough, meaning that the directory will have
to be re-read to be sure (impractical because of possible endless
loop, but safe).

But if after a rm, the kernel moved the last entry up to the freed
entry, then the missed object could not be detected in the same way
(as a workaround, perhaps check the mtime of the directory inode?).

> > Any idea of the algorithm to choose the directory entries? The fact
> > that the files are not ordered initially is unintuitive.
> 
> A hashing function, so I guess one reads that as "random".
> Oh, oh, I better be careful what I say. "Pseudorandom", as it's
> deterministic. I get the same sequence every time I make those
> files.

Do you know why it is doing this? To retrieve a file faster via the
hash, by starting at the "right" place instead of the beginning of
the directory? I suppose that this would work well only when there
is only one block (small directories), otherwise things get messed
up after that.

[...]
> This subthread started at https://lists.debian.org/debian-user/2015/04/msg01157.html
> with your statement "But with the current solution (no automatic
> moving of an entry), you can't miss an entry that hasn't been removed."
> 
> I disagreed, giving evidence. Take it or leave it. Why should I care?

I now agree with you, as you have just shown.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Reply to: