[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: the correct way to read a big directory? Mutt?

Quoting Vincent Lefevre (vincent@vinc17.net):

> [...]
> One can see an obvious difference: grep and my script both read the
> files in the directory order (I know that this is the case with my
> script, and grep's behavior is identical), which can be regarded as
> random due to the use of a hash (see the other thread). Mutt uses a
> different order, and after a look at its mh.c source file, I can see
> that it sorts the files by inode number (see maildir_delayed_parsing
> function). IMHO, this is a good choice because, specially in big
> directories, doing that may lead to contiguous files on the disk,
> and I think that it is the reason why Mutt is much faster.

Well that's a relief. I was getting worried about there being some
"magic" involved when you said you didn't use cacheing. So, looking
back at https://lists.debian.org/debian-user/2015/04/msg01265.html ,

 "In which case, if you want to know how come mutt is so fast, take a
 look at the source. Just to mention one optimisation I would consider:
 slurp the directory and sort the entries by inode. Open the files in
 inode order.
 And another: it's probably faster to slurp bigger chunks of each file
 (with an intelligent guess of the best buffer size) and use a fast
 search for \nMessage-ID rather than reading and checking line by line."

perhaps my second suggestion would also contribute to a speed-up. Here,
it does come down to "black magic": I can't understand the methods
they use to string-search so quickly in regular expressions etc.

(Note: obviously these suggestions were not original.)

> Now I wonder whether the use of the hash by ext3 is a good idea...

I don't see why. Directory-hashing only slows down the process of
obtaining the inode numbers from the directory. With a simple linear
directory, you might get that list of inode numbers more quickly,
and it might even be closer to being sorted.

But that's all fairly localised on the disk, and sorting is quick.
The major speed-up that you've demonstrated is made by accessing the
file contents from a sorted list of inode numbers (correlating with
the position of the files on the disk).

So in the absence of sorting (ie with general purpose tools like grep),
doing away with hashing will speed up the special case of reading all
the files in (a) just one entire directory (b) which hasn't had the
entries jumbled by insertion/deletion/renaming of files and (c) which
is specified using the directory's name (like grep -r <directory-name>).
And should you read the whole directory by specifying <directory-name>/*,
you lose the benefit and thrash the disk again.

I have useful little bash functions that return the alphabetically
first or last, or the most recently modified file among the filenames
supplied. Perhaps I'll write one to take a list of filenames and return
them all, but sorted into inode order. (Maybe it already exists.)


Reply to: