[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)



On 2015-05-22 21:01:01 -0500, David Wright wrote:
> However, in https://lists.debian.org/debian-user/2015/04/msg01265.html
> I was perhaps less ambiguous (point 2):
> 
> "In which case, if you want to know how come mutt is so fast, take a
>  look at the source. Just to mention one optimisation I would consider:
>  slurp the directory and sort the entries by inode. Open the files in
>  inode order.
>  And another: it's probably faster to slurp bigger chunks of each file
>  (with an intelligent guess of the best buffer size) and use a fast
>  search for \nMessage-ID rather than reading and checking line by line.
> "

This may be interesting with mmap. Otherwise, one may do unnecessary
copies.

> > Then I don't think that in the particular case of header validation,
> > there is much gain applying regexp's on the full header at once; the
> > reason is that my regexp's use the end of line as a separator (things
> > like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file
> > line by line, I already do a part of the job of regexp matching.
> 
> But I would assume that regexp in languages like Perl/Python has code
> far more optimised than reading files line by line.

This is not clear. All my regexp's are anchored on a newline.
Reading files line by line allows one to do some factoring.

> So you would search for \nmessage-id:.*?\n (where .*? is
> non-greedy).

One can do better. The code I used in the second test was:

    $header =~ /^\S+:/ || $header =~ /^From / or die;
    $header =~ /\n[^:\s]+\s/ and die;
    $header =~ /^Message-ID:.*^Message-ID:/ims and die;
    $header =~ /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/im or die;

where $header is the full header.

> > And finally, for each test, the header has to be read several times.
> 
> I'm not sure why, without knowing the tests to apply (or did I miss
> seeing them?).

See above.

> > In my case, I don't need to deal with folded headers, except validating
> > the format, which is very easy with a line-by-line parsing.
> 
> You did mention validating message-id and other headers and checking
> for missing ones, but do your scripts throw all this work away and,
> if so, why? For example, if you add your own distinctive Message-ID
> header to any file that doesn't have one, then that's one test you
> never have to repeat.

I don't understand.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Reply to: