[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reading an empty directory after reboot is very slow



Quoting Nicolas George (george@nsup.org):
> Le tridi 3 floréal, an CCXXIII, David Wright a écrit :
> > OK. Here's a demonstration of a file going AWOL by moving *up* the
> > directory listing. Because of read-ahead, readdir still sees the old
> > name and the stat() fails.
> 
> What are you trying to prove with that test?

With respect, if you read the thread, you'd know. In fact, just read
the post ("P") you've commented on. It's in the penultimate paragraph.

> You would get the same failure if you put your delay between readdir() and
> stat(). And on a preemptive multitasking OS (or even worse: with
> multiprocessing), that "delay" could be just the normal run time of the
> program. That is called a race condition, I am sure you know it.

Yes, I know it. I'm trying to demonstrate it.

> The Unix filesystem API has race conditions all over the place, everybody
> knows it. To eliminate them would require an explicit transactional API, and
> these cause a whole lot of problems of their own (deadlocks).
> 
> I do not see any merit in singling out this particular race condition above
> all the others.

Because refuting an argument only requires one counterexample. I'm not
trying to write a guide to systems programming, you know.

> >			     Again, because of read-ahead, I can't
> > demonstrate the opposite effect in the same program because
> > you'd have to have a directory bigger than the read-ahead buffer
> > in order to see any effect.
> 
> Please do. Creating a thousands of files takes only a few seconds; strace
> can show you the calls to getdents() that lie underneath readdir() and tell
> you how many entries are read at once.

As the last sentence of "P" said: "Why should I care?".

getdents read many entries at a time. AFAICT it requires a buffer to
put them in. AFAICT you don't know a priori how long to make that
buffer to get all the entries in one call (a 32KB buffer takes 2
calls to read my /usr/share/doc/). So don't you just push the problem
one level down? I don't know. You tell me.

> > But, as was said already, it's occurrence
> > can be discovered by checking the inode numbers for duplicate returns.
> 
> I am not convinced that occurrence happens.
> 
> I believe that the readdir() should offer the following guarantee over the
> course of a single "opendir + full readdir loop":
> 
>   All entries that were present in the directory during the whole run are
>   returned exactly once, under any of the names they had during the run.

Is that a quotation? Where from?

> And for now, I have not seen any indication that this property were
> violated, i.e. the same entry shown twice or none at all.
> 
> (There may be a more subtle issue: what happens if file9999 is renamed into
> file file0042 while readdir() is scanning around file5000? Would "file0042"
> be returned twice, but with different inode values?)

You really haven't read the thread, have you.

> I remember someone asking what happens with backup programs.

Yes, I quoted it in "P".

> I do not see it
> as an issue, for two reasons:
> 
> First, a carefully written backup program could just make a consistency
> check at the end: if stat()ing any file failed with ENOENT, assume something
> has moved and run again.

Yes, that's in "P". Vincent brought up the looping problem.

> But this is useless, because:
> 
> Second, the issue is much broader than that. Imagine you move the
> "billion_dollars_project" directory from ~/experimental to ~/finished while
> the backup program is running. If the backup program proceeds in that order:
> ~/finished, ~/music, ~/experimental, and the move happens while it is
> scanning ~/music, then it never sees billion_dollars_project at all, and
> never sees an error for it.

Yes, but I was showing that that can happen even without moving directory.

> To make reliable backups, you need a way of getting the state of the full
> tree atomically. Nowadays, that is done with filesystem snapshots. Unless
> you use that, you have to assume that any file that was moved in any way
> during the backup was moved the stupid way, i.e. first delete the source
> then re-create the target.

Yes, but I know nothing about doing all that. As I said in the last
sentence of the antepenultimate paragraph of "P", "But don't expect me
to come up with a bullet-proof scheme."

Cheers,
David.


Reply to: