Re: the correct way to read a big directory? Mutt?

To: debian-user@lists.debian.org
Subject: Re: the correct way to read a big directory? Mutt?
From: Vincent Lefevre <vincent@vinc17.net>
Date: Mon, 27 Apr 2015 10:05:05 +0200
Message-id: <[🔎] 20150427080505.GB3277@ypig.lip.ens-lyon.fr>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] 20150425092815.GA3655186@phare.normalesup.org>
References: <[🔎] 20150424135238.GA12147@ypig.lip.ens-lyon.fr> <[🔎] 20150424213951.GA13410@alum> <[🔎] 20150425003907.GB12262@xvii.vinc17.org> <[🔎] 20150425092815.GA3655186@phare.normalesup.org>

On 2015-04-25 11:28:15 +0200, Nicolas George wrote:
> Le sextidi 6 floréal, an CCXXIII, Vincent Lefevre a écrit :
> > This is not that simple. I want my script to be very reliable.
> 
> Well, your script is in Perl, so implicitly you consider that CPU cost is
> negligible. If you manage to optimize everything else (or make the
> processing more complex) so that it becomes CPU-bound, then you will have to
> consider reimplementing in C.

The CPU time is OK. If I really want an improvement (small delay in
real time), I should probably do multithreading.

> Until then, I believe you are right to trust Perl's IO buffering.
> 
> > In particular, if there is a message without a Message-ID and
> > with "\nMessage-ID" in the body, I want to detect it. This kind
> > of thing really happens in practice (though this is rare), e.g.
> > due to some buggy mail software that breaks the headers and put
> > a part of them in the body. I also want to check the format of
> > the headers and possible duplicate Message-ID. What my script
> > really does is:
> 
> IMHO, if you really want to validate the format of the headers, I advise to
> read the whole header into a string and work from it. Something like:
> 
>   my $header = "";
>   while (<$file>) {
>     last if $_ eq "\n"; # or /^\r?\n\z/ if you do not trust line ends
>     $header .= $_;
>   }
>   my @header = split /\n(?!\s)/, $header;

I don't understand the point. Accumulating in strings (which involves
copies and possible reallocations) and doing a split is much slower
than reading lines one by one and treating them separately.

> >     while (<FILE>)
> 
> Out of curiosity, do you have a particular reason not to use a real
> variable for your file handles?

This is a small loop. The code like that is compact and more readable
for me. Personal taste.

> >         /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/i or next;
> 
> I have never seen this "added by" in my mails, but assuming it is
> necessary for you,

Yes, obviously. This came from some MTA's when the MUA didn't generate
a Message-ID. This lasted at least until 2005.

> note that it may be written like that:
> "Message-ID: <foo@bar> (added\n\tby someone)\n"

I don't think so: AFAIK, these MTA's never wrapped this header.
Anyway my regexp is sufficient in my mailbox. If there is a need
(e.g. because new mail software does something else with the
Message-ID), I can modify my script.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to:

Follow-Ups:
- Re: the correct way to read a big directory? Mutt?
  - From: Bob Proulx <bob@proulx.com>
- Re: the correct way to read a big directory? Mutt?
  - From: Nicolas George <george@nsup.org>

References:
- the correct way to read a big directory? Mutt?
  - From: Vincent Lefevre <vincent@vinc17.net>
- Re: the correct way to read a big directory? Mutt?
  - From: David Wright <deblis@lionunicorn.co.uk>
- Re: the correct way to read a big directory? Mutt?
  - From: Vincent Lefevre <vincent@vinc17.net>
- Re: the correct way to read a big directory? Mutt?
  - From: Nicolas George <george@nsup.org>

Prev by Date: Re: the correct way to read a big directory? Mutt?
Next by Date: Re: the correct way to read a big directory? Mutt?
Previous by thread: Re: the correct way to read a big directory? Mutt?
Next by thread: Re: the correct way to read a big directory? Mutt?
Index(es):
- Date
- Thread