Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)

To: debian-user@lists.debian.org
Subject: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
From: Vincent Lefevre <vincent@vinc17.net>
Date: Mon, 25 May 2015 03:22:36 +0200
Message-id: <[🔎] 20150525012235.GA29810@xvii.vinc17.org>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] 20150523020101.GA8528@alum>
References: <20150424135238.GA12147@ypig.lip.ens-lyon.fr> <20150424213951.GA13410@alum> <20150425003907.GB12262@xvii.vinc17.org> <20150425092815.GA3655186@phare.normalesup.org> <20150427080505.GB3277@ypig.lip.ens-lyon.fr> <20150428082739.GA112663@phare.normalesup.org> <[🔎] 20150518153809.GA2057@ypig.lip.ens-lyon.fr> <[🔎] 20150523020101.GA8528@alum>

On 2015-05-22 21:01:01 -0500, David Wright wrote:
> However, in https://lists.debian.org/debian-user/2015/04/msg01265.html
> I was perhaps less ambiguous (point 2):
> 
> "In which case, if you want to know how come mutt is so fast, take a
>  look at the source. Just to mention one optimisation I would consider:
>  slurp the directory and sort the entries by inode. Open the files in
>  inode order.
>  And another: it's probably faster to slurp bigger chunks of each file
>  (with an intelligent guess of the best buffer size) and use a fast
>  search for \nMessage-ID rather than reading and checking line by line.
> "

This may be interesting with mmap. Otherwise, one may do unnecessary
copies.

> > Then I don't think that in the particular case of header validation,
> > there is much gain applying regexp's on the full header at once; the
> > reason is that my regexp's use the end of line as a separator (things
> > like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file
> > line by line, I already do a part of the job of regexp matching.
> 
> But I would assume that regexp in languages like Perl/Python has code
> far more optimised than reading files line by line.

This is not clear. All my regexp's are anchored on a newline.
Reading files line by line allows one to do some factoring.

> So you would search for \nmessage-id:.*?\n (where .*? is
> non-greedy).

One can do better. The code I used in the second test was:

    $header =~ /^\S+:/ || $header =~ /^From / or die;
    $header =~ /\n[^:\s]+\s/ and die;
    $header =~ /^Message-ID:.*^Message-ID:/ims and die;
    $header =~ /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/im or die;

where $header is the full header.

> > And finally, for each test, the header has to be read several times.
> 
> I'm not sure why, without knowing the tests to apply (or did I miss
> seeing them?).

See above.

> > In my case, I don't need to deal with folded headers, except validating
> > the format, which is very easy with a line-by-line parsing.
> 
> You did mention validating message-id and other headers and checking
> for missing ones, but do your scripts throw all this work away and,
> if so, why? For example, if you add your own distinctive Message-ID
> header to any file that doesn't have one, then that's one test you
> never have to repeat.

I don't understand.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to:

Follow-Ups:
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: Vincent Lefevre <vincent@vinc17.net>
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: David Wright <deblis@lionunicorn.co.uk>

References:
- Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: Vincent Lefevre <vincent@vinc17.net>
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: David Wright <deblis@lionunicorn.co.uk>

Prev by Date: Re: Laptops, UEFI, Secure Boot and Debian
Next by Date: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Previous by thread: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Next by thread: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Index(es):
- Date
- Thread