[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: the correct way to read a big directory? Mutt?

On 2015-04-24 16:39:51 -0500, David Wright wrote:
>  And another: it's probably faster to slurp bigger chunks of each file
>  (with an intelligent guess of the best buffer size) and use a fast
>  search for \nMessage-ID rather than reading and checking line by line."

This is not that simple. I want my script to be very reliable.
In particular, if there is a message without a Message-ID and
with "\nMessage-ID" in the body, I want to detect it. This kind
of thing really happens in practice (though this is rare), e.g.
due to some buggy mail software that breaks the headers and put
a part of them in the body. I also want to check the format of
the headers and possible duplicate Message-ID. What my script
really does is:

    while (<FILE>)
        /^[\t ]/ and next;
        /^\S+:/ || (!$from++ && /^From /)
          or die "$proc: bad message format ($file)";
        /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/i or next;
        defined $files{$1}
          and die "$proc: duplicate message-id $1 ($files{$1} and $file)\n";
        $files{$1} = $file;

> And should you read the whole directory by specifying <directory-name>/*,
> you lose the benefit and thrash the disk again.

With zsh, I often do things like: grep ... <directory-name>/**/*.c

One can choose to sort the result, but zsh doesn't support sorting
by inode number. I've sent a feature request.

Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to: