[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: the correct way to read a big directory? Mutt?

On 2015-04-25 11:28:15 +0200, Nicolas George wrote:
> Le sextidi 6 floréal, an CCXXIII, Vincent Lefevre a écrit :
> > This is not that simple. I want my script to be very reliable.
> Well, your script is in Perl, so implicitly you consider that CPU cost is
> negligible. If you manage to optimize everything else (or make the
> processing more complex) so that it becomes CPU-bound, then you will have to
> consider reimplementing in C.

The CPU time is OK. If I really want an improvement (small delay in
real time), I should probably do multithreading.

> Until then, I believe you are right to trust Perl's IO buffering.
> > In particular, if there is a message without a Message-ID and
> > with "\nMessage-ID" in the body, I want to detect it. This kind
> > of thing really happens in practice (though this is rare), e.g.
> > due to some buggy mail software that breaks the headers and put
> > a part of them in the body. I also want to check the format of
> > the headers and possible duplicate Message-ID. What my script
> > really does is:
> IMHO, if you really want to validate the format of the headers, I advise to
> read the whole header into a string and work from it. Something like:
>   my $header = "";
>   while (<$file>) {
>     last if $_ eq "\n"; # or /^\r?\n\z/ if you do not trust line ends
>     $header .= $_;
>   }
>   my @header = split /\n(?!\s)/, $header;

I don't understand the point. Accumulating in strings (which involves
copies and possible reallocations) and doing a split is much slower
than reading lines one by one and treating them separately.

> >     while (<FILE>)
> Out of curiosity, do you have a particular reason not to use a real
> variable for your file handles?

This is a small loop. The code like that is compact and more readable
for me. Personal taste.

> >         /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/i or next;
> I have never seen this "added by" in my mails, but assuming it is
> necessary for you,

Yes, obviously. This came from some MTA's when the MUA didn't generate
a Message-ID. This lasted at least until 2005.

> note that it may be written like that:
> "Message-ID: <foo@bar> (added\n\tby someone)\n"

I don't think so: AFAIK, these MTA's never wrapped this header.
Anyway my regexp is sufficient in my mailbox. If there is a need
(e.g. because new mail software does something else with the
Message-ID), I can modify my script.

Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to: