Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Quoting Vincent Lefevre (vincent@vinc17.net):
> [The context: *very basic* header validation of e-mail messages]
>
> On 2015-04-28 10:27:40 +0200, Nicolas George wrote:
> > L'octidi 8 floréal, an CCXXIII, Vincent Lefevre a écrit :
> > > I don't understand the point. Accumulating in strings (which involves
> > > copies and possible reallocations) and doing a split is much slower
> > > than reading lines one by one and treating them separately.
> >
> > First: not necessarily, because once the header is loaded in a string, you
> > can apply regexps to the whole header at once instead of using a loop. This
> > may prove faster.
>
> I've finally tried this solution (i.e. accumulating, then apply
> regexp on the full strings) and it takes about 60% more time when
> the data are in the disk cache.
I can't quite understand Nicolas's sentence because I'm not sure
whether by "the header" and "the whole header" he means the several
lines of headers taken together.
However, in https://lists.debian.org/debian-user/2015/04/msg01265.html
I was perhaps less ambiguous (point 2):
"In which case, if you want to know how come mutt is so fast, take a
look at the source. Just to mention one optimisation I would consider:
slurp the directory and sort the entries by inode. Open the files in
inode order.
And another: it's probably faster to slurp bigger chunks of each file
(with an intelligent guess of the best buffer size) and use a fast
search for \nMessage-ID rather than reading and checking line by line.
"
> This is not surprising, IMHO, for
> the following reasons:
>
> First, as I've said, accumulating lines in a string may involve copies
> and reallocations because the string grows (I don't know whether there
> is a way to solve that without obfuscating the code).
By slurp, I meant for you to try reading the top of the file as a
single chunk using a read-a-load-of-bytes method rather than a
repetitive readline method.
In Python's terms (because I don't know the Perl ones) a call of read()
class io.RawIOBase
read(size=-1)
Read up to size bytes from the object and return them. As a
convenience, if size is unspecified or -1, readall() is
called. Otherwise, only one system call is ever made...
rather than readline()
class io.IOBase
readline(size=-1)
Read and return one line from the stream. If size is specified, at
most size bytes will be read.
> Then I don't think that in the particular case of header validation,
> there is much gain applying regexp's on the full header at once; the
> reason is that my regexp's use the end of line as a separator (things
> like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file
> line by line, I already do a part of the job of regexp matching.
But I would assume that regexp in languages like Perl/Python has code
far more optimised than reading files line by line. So you would
search for \nmessage-id:.*?\n (where .*? is non-greedy).
> And finally, for each test, the header has to be read several times.
I'm not sure why, without knowing the tests to apply (or did I miss
seeing them?).
> In my case, I don't need to deal with folded headers, except validating
> the format, which is very easy with a line-by-line parsing.
You did mention validating message-id and other headers and checking
for missing ones, but do your scripts throw all this work away and,
if so, why? For example, if you add your own distinctive Message-ID
header to any file that doesn't have one, then that's one test you
never have to repeat.
> I may have other scripts that need to deal with them, but in this case,
> I accumulate physical lines into a single logical one. AFAIK, this is
> what mail processors do (postfix header filtering, procmail...). But
> there is no need to accumulate the full header in a single string.
Why not think of it this way: the "full header" (ie all the header
lines of a message) *is* a single string: it's the beginning of the
file, terminated by \n\n. I wonder how much speed-up you could achieve
with a C function using strstr to find the end of the headers and
returning them as a single string.
Cheers,
David.
Reply to: