Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)

To: debian-user@lists.debian.org
Subject: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
From: Vincent Lefevre <vincent@vinc17.net>
Date: Mon, 18 May 2015 17:38:10 +0200
Message-id: <[🔎] 20150518153809.GA2057@ypig.lip.ens-lyon.fr>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <20150428082739.GA112663@phare.normalesup.org>
References: <20150424135238.GA12147@ypig.lip.ens-lyon.fr> <20150424213951.GA13410@alum> <20150425003907.GB12262@xvii.vinc17.org> <20150425092815.GA3655186@phare.normalesup.org> <20150427080505.GB3277@ypig.lip.ens-lyon.fr> <20150428082739.GA112663@phare.normalesup.org>

[The context: *very basic* header validation of e-mail messages]

On 2015-04-28 10:27:40 +0200, Nicolas George wrote:
> L'octidi 8 floréal, an CCXXIII, Vincent Lefevre a écrit :
> > I don't understand the point. Accumulating in strings (which involves
> > copies and possible reallocations) and doing a split is much slower
> > than reading lines one by one and treating them separately.
> 
> First: not necessarily, because once the header is loaded in a string, you
> can apply regexps to the whole header at once instead of using a loop. This
> may prove faster.

I've finally tried this solution (i.e. accumulating, then apply
regexp on the full strings) and it takes about 60% more time when
the data are in the disk cache. This is not surprising, IMHO, for
the following reasons:

First, as I've said, accumulating lines in a string may involve copies
and reallocations because the string grows (I don't know whether there
is a way to solve that without obfuscating the code).

Then I don't think that in the particular case of header validation,
there is much gain applying regexp's on the full header at once; the
reason is that my regexp's use the end of line as a separator (things
like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file
line by line, I already do a part of the job of regexp matching.

And finally, for each test, the header has to be read several times.
However I'm not sure that this is a problem here, because this could
be seen as reordering read from the L1 cache[*] and tests. So, it is
not clear what is the best.

[*] Each header should fit in it.

> The gist of it is the usual saying: "profile, don't speculate". You had a
> particular issue that made your program immensely slower. Now that this
> problem is resolved and your program run-time is acceptable, you may want to
> trade a bit of CPU consumption for simplicity: having the whole header in a
> string makes a lot of things easier and/or more robust, especially
> everything that has to do with folded headers. And remember you already
> traded A LOT of CPU for simplicity: you are using Perl, not assembly.

In my case, I don't need to deal with folded headers, except validating
the format, which is very easy with a line-by-line parsing.

I may have other scripts that need to deal with them, but in this case,
I accumulate physical lines into a single logical one. AFAIK, this is
what mail processors do (postfix header filtering, procmail...). But
there is no need to accumulate the full header in a single string.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Reply to:

Follow-Ups:
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: David Wright <deblis@lionunicorn.co.uk>

Prev by Date: Mousepad Not Saving Prefs
Next by Date: Re: libre office doesn't repaint
Previous by thread: Re: Mousepad Not Saving Prefs
Next by thread: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Index(es):
- Date
- Thread