Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)

To: debian-user@lists.debian.org
Subject: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
From: David Wright <deblis@lionunicorn.co.uk>
Date: Fri, 22 May 2015 21:01:01 -0500
Message-id: <[🔎] 20150523020101.GA8528@alum>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] 20150518153809.GA2057@ypig.lip.ens-lyon.fr>
References: <20150424135238.GA12147@ypig.lip.ens-lyon.fr> <20150424213951.GA13410@alum> <20150425003907.GB12262@xvii.vinc17.org> <20150425092815.GA3655186@phare.normalesup.org> <20150427080505.GB3277@ypig.lip.ens-lyon.fr> <20150428082739.GA112663@phare.normalesup.org> <[🔎] 20150518153809.GA2057@ypig.lip.ens-lyon.fr>

Quoting Vincent Lefevre (vincent@vinc17.net):
> [The context: *very basic* header validation of e-mail messages]
> 
> On 2015-04-28 10:27:40 +0200, Nicolas George wrote:
> > L'octidi 8 floréal, an CCXXIII, Vincent Lefevre a écrit :
> > > I don't understand the point. Accumulating in strings (which involves
> > > copies and possible reallocations) and doing a split is much slower
> > > than reading lines one by one and treating them separately.
> > 
> > First: not necessarily, because once the header is loaded in a string, you
> > can apply regexps to the whole header at once instead of using a loop. This
> > may prove faster.
> 
> I've finally tried this solution (i.e. accumulating, then apply
> regexp on the full strings) and it takes about 60% more time when
> the data are in the disk cache.

I can't quite understand Nicolas's sentence because I'm not sure
whether by "the header" and "the whole header" he means the several
lines of headers taken together.

However, in https://lists.debian.org/debian-user/2015/04/msg01265.html
I was perhaps less ambiguous (point 2):

"In which case, if you want to know how come mutt is so fast, take a
 look at the source. Just to mention one optimisation I would consider:
 slurp the directory and sort the entries by inode. Open the files in
 inode order.
 And another: it's probably faster to slurp bigger chunks of each file
 (with an intelligent guess of the best buffer size) and use a fast
 search for \nMessage-ID rather than reading and checking line by line.
"

> This is not surprising, IMHO, for
> the following reasons:
> 
> First, as I've said, accumulating lines in a string may involve copies
> and reallocations because the string grows (I don't know whether there
> is a way to solve that without obfuscating the code).

By slurp, I meant for you to try reading the top of the file as a
single chunk using a read-a-load-of-bytes method rather than a
repetitive readline method.

In Python's terms (because I don't know the Perl ones) a call of read()

  class io.RawIOBase

  read(size=-1)

    Read up to size bytes from the object and return them. As a
    convenience, if size is unspecified or -1, readall() is
    called. Otherwise, only one system call is ever made...

rather than readline()

  class io.IOBase

  readline(size=-1)

    Read and return one line from the stream. If size is specified, at
    most size bytes will be read.

> Then I don't think that in the particular case of header validation,
> there is much gain applying regexp's on the full header at once; the
> reason is that my regexp's use the end of line as a separator (things
> like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file
> line by line, I already do a part of the job of regexp matching.

But I would assume that regexp in languages like Perl/Python has code
far more optimised than reading files line by line. So you would
search for \nmessage-id:.*?\n (where .*? is non-greedy).

> And finally, for each test, the header has to be read several times.

I'm not sure why, without knowing the tests to apply (or did I miss
seeing them?).

> In my case, I don't need to deal with folded headers, except validating
> the format, which is very easy with a line-by-line parsing.

You did mention validating message-id and other headers and checking
for missing ones, but do your scripts throw all this work away and,
if so, why? For example, if you add your own distinctive Message-ID
header to any file that doesn't have one, then that's one test you
never have to repeat.

> I may have other scripts that need to deal with them, but in this case,
> I accumulate physical lines into a single logical one. AFAIK, this is
> what mail processors do (postfix header filtering, procmail...). But
> there is no need to accumulate the full header in a single string.

Why not think of it this way: the "full header" (ie all the header
lines of a message) *is* a single string: it's the beginning of the
file, terminated by \n\n. I wonder how much speed-up you could achieve
with a C function using strstr to find the end of the headers and
returning them as a single string.

Cheers,
David.

Reply to:

Follow-Ups:
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: Vincent Lefevre <vincent@vinc17.net>

References:
- Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: Vincent Lefevre <vincent@vinc17.net>

Prev by Date: Re: Need SAS HBA for Debian Jessie
Next by Date: Re: Need SAS HBA for Debian Jessie
Previous by thread: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Next by thread: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Index(es):
- Date
- Thread