[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)



[The context: *very basic* header validation of e-mail messages]

On 2015-04-28 10:27:40 +0200, Nicolas George wrote:
> L'octidi 8 floréal, an CCXXIII, Vincent Lefevre a écrit :
> > I don't understand the point. Accumulating in strings (which involves
> > copies and possible reallocations) and doing a split is much slower
> > than reading lines one by one and treating them separately.
> 
> First: not necessarily, because once the header is loaded in a string, you
> can apply regexps to the whole header at once instead of using a loop. This
> may prove faster.

I've finally tried this solution (i.e. accumulating, then apply
regexp on the full strings) and it takes about 60% more time when
the data are in the disk cache. This is not surprising, IMHO, for
the following reasons:

First, as I've said, accumulating lines in a string may involve copies
and reallocations because the string grows (I don't know whether there
is a way to solve that without obfuscating the code).

Then I don't think that in the particular case of header validation,
there is much gain applying regexp's on the full header at once; the
reason is that my regexp's use the end of line as a separator (things
like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file
line by line, I already do a part of the job of regexp matching.

And finally, for each test, the header has to be read several times.
However I'm not sure that this is a problem here, because this could
be seen as reordering read from the L1 cache[*] and tests. So, it is
not clear what is the best.

[*] Each header should fit in it.

> The gist of it is the usual saying: "profile, don't speculate". You had a
> particular issue that made your program immensely slower. Now that this
> problem is resolved and your program run-time is acceptable, you may want to
> trade a bit of CPU consumption for simplicity: having the whole header in a
> string makes a lot of things easier and/or more robust, especially
> everything that has to do with folded headers. And remember you already
> traded A LOT of CPU for simplicity: you are using Perl, not assembly.

In my case, I don't need to deal with folded headers, except validating
the format, which is very easy with a line-by-line parsing.

I may have other scripts that need to deal with them, but in this case,
I accumulate physical lines into a single logical one. AFAIK, this is
what mail processors do (postfix header filtering, procmail...). But
there is no need to accumulate the full header in a single string.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Reply to: