[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reading an empty directory after reboot is very slow



Le tridi 3 floréal, an CCXXIII, David Wright a écrit :
> OK. Here's a demonstration of a file going AWOL by moving *up* the
> directory listing. Because of read-ahead, readdir still sees the old
> name and the stat() fails.

What are you trying to prove with that test?

You would get the same failure if you put your delay between readdir() and
stat(). And on a preemptive multitasking OS (or even worse: with
multiprocessing), that "delay" could be just the normal run time of the
program. That is called a race condition, I am sure you know it.

The Unix filesystem API has race conditions all over the place, everybody
knows it. To eliminate them would require an explicit transactional API, and
these cause a whole lot of problems of their own (deadlocks).

I do not see any merit in singling out this particular race condition above
all the others.

>			     Again, because of read-ahead, I can't
> demonstrate the opposite effect in the same program because
> you'd have to have a directory bigger than the read-ahead buffer
> in order to see any effect.

Please do. Creating a thousands of files takes only a few seconds; strace
can show you the calls to getdents() that lie underneath readdir() and tell
you how many entries are read at once.

>			      But, as was said already, it's occurrence
> can be discovered by checking the inode numbers for duplicate returns.

I am not convinced that occurrence happens.

I believe that the readdir() should offer the following guarantee over the
course of a single "opendir + full readdir loop":

  All entries that were present in the directory during the whole run are
  returned exactly once, under any of the names they had during the run.

And for now, I have not seen any indication that this property were
violated, i.e. the same entry shown twice or none at all.

(There may be a more subtle issue: what happens if file9999 is renamed into
file file0042 while readdir() is scanning around file5000? Would "file0042"
be returned twice, but with different inode values?)

I remember someone asking what happens with backup programs. I do not see it
as an issue, for two reasons:

First, a carefully written backup program could just make a consistency
check at the end: if stat()ing any file failed with ENOENT, assume something
has moved and run again. But this is useless, because:

Second, the issue is much broader than that. Imagine you move the
"billion_dollars_project" directory from ~/experimental to ~/finished while
the backup program is running. If the backup program proceeds in that order:
~/finished, ~/music, ~/experimental, and the move happens while it is
scanning ~/music, then it never sees billion_dollars_project at all, and
never sees an error for it.

To make reliable backups, you need a way of getting the state of the full
tree atomically. Nowadays, that is done with filesystem snapshots. Unless
you use that, you have to assume that any file that was moved in any way
during the backup was moved the stupid way, i.e. first delete the source
then re-create the target.

Regards,

-- 
  Nicolas George

Attachment: signature.asc
Description: Digital signature


Reply to: