Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)

To: debian-user@lists.debian.org
Subject: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
From: David Wright <deblis@lionunicorn.co.uk>
Date: Mon, 25 May 2015 20:14:50 -0500
Message-id: <[🔎] 20150526011450.GA14799@alum>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] 20150525012235.GA29810@xvii.vinc17.org>
References: <20150424135238.GA12147@ypig.lip.ens-lyon.fr> <20150424213951.GA13410@alum> <20150425003907.GB12262@xvii.vinc17.org> <20150425092815.GA3655186@phare.normalesup.org> <20150427080505.GB3277@ypig.lip.ens-lyon.fr> <20150428082739.GA112663@phare.normalesup.org> <[🔎] 20150518153809.GA2057@ypig.lip.ens-lyon.fr> <[🔎] 20150523020101.GA8528@alum> <[🔎] 20150525012235.GA29810@xvii.vinc17.org>

Quoting Vincent Lefevre (vincent@vinc17.net):
> On 2015-05-22 21:01:01 -0500, David Wright wrote:
> > However, in https://lists.debian.org/debian-user/2015/04/msg01265.html
> > I was perhaps less ambiguous (point 2):
> > 
> > "In which case, if you want to know how come mutt is so fast, take a
> >  look at the source. Just to mention one optimisation I would consider:
> >  slurp the directory and sort the entries by inode. Open the files in
> >  inode order.
> >  And another: it's probably faster to slurp bigger chunks of each file
> >  (with an intelligent guess of the best buffer size) and use a fast
> >  search for \nMessage-ID rather than reading and checking line by line.
> > "
> 
> This may be interesting with mmap. Otherwise, one may do unnecessary
> copies.
> 
> > > Then I don't think that in the particular case of header validation,
> > > there is much gain applying regexp's on the full header at once; the
> > > reason is that my regexp's use the end of line as a separator (things
> > > like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file
> > > line by line, I already do a part of the job of regexp matching.
> > 
> > But I would assume that regexp in languages like Perl/Python has code
> > far more optimised than reading files line by line.
> 
> This is not clear. All my regexp's are anchored on a newline.
> Reading files line by line allows one to do some factoring.
> 
> > So you would search for \nmessage-id:.*?\n (where .*? is
> > non-greedy).
> 
> One can do better. The code I used in the second test was:
> 
>     $header =~ /^\S+:/ || $header =~ /^From / or die;
>     $header =~ /\n[^:\s]+\s/ and die;
>     $header =~ /^Message-ID:.*^Message-ID:/ims and die;
>     $header =~ /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/im or die;
> 
> where $header is the full header.
> 
> > > And finally, for each test, the header has to be read several times.
> > 
> > I'm not sure why, without knowing the tests to apply (or did I miss
> > seeing them?).
> 
> See above.
> 
> > > In my case, I don't need to deal with folded headers, except validating
> > > the format, which is very easy with a line-by-line parsing.
> > 
> > You did mention validating message-id and other headers and checking
> > for missing ones, but do your scripts throw all this work away and,
> > if so, why? For example, if you add your own distinctive Message-ID
> > header to any file that doesn't have one, then that's one test you
> > never have to repeat.
> 
> I don't understand.

Well, the discussion in these threads has ranged widely over trying to
speed up the reading of directories and large numbers of files. Every
so often, I think about what you're doing with that huge directory of
emails, all 145k of them.

AIUI, and correct me if I'm wrong, you have to be able to read them
with a mail client (mutt). You have to check that (all) the header
lines are correctly formed and that each email has a single unique
message-id.

Every so often (quite frequently) you run Perl scripts (like those
posted) over them and modify the header lines (or flags) of some of
them, then restart mutt so it picks up the modifications.

Not being conversant with the maildir format, I took a look at
http://wiki2.dovecot.org/MailboxFormat/Maildir to see how filenames
are used, and how flags are implemented. I see one also might have to
be careful about preserving timestamps.

Anyway, the questions that pop into my head are things like:

If an email doesn't have a message-id, why not give it one with a
X-header that you recognise as your own? (You could process duplicates
similarly.)

Why not put your X-header as the first line in the file? (In most
cases, it would be a copy of the original message-id.) Then you only
have to read one line to get at your X-header/message-id on every
subsequent occasion that you process the files.

If a header line is malformed, why not fix it up straight away as best
you can (rather than die), perhaps flagging the fact.

Why not do all these things just the once? Process all the existing
messages in however long it takes. Do it when you're not running mutt,
not renaming files etc, so that the directory is static. Then keep
track of a mtime "tidemark" so that you can recognise new messages,
which need their X-header to be added and to be checked over.

Now when you do all your message filtering/flagging, you don't have to
faff around with variable numbers of header lines yet again.

BTW I couldn't help being amused by this paragraph in the dovecot wiki:
"Issues with the specification

 Locking

 Although maildir was designed to be lockless, Dovecot locks the
 maildir while doing modifications to it or while looking for new
 messages in it. This is required because otherwise Dovecot might
 temporarily see mails incorrectly deleted, which would cause
 trouble. Basically the problem is that if one process modifies the
 maildir (eg. a rename() to change a message's flag), another process
 in the middle of listing files at the same time could skip a file. The
 skipping happens because readdir() system call doesn't guarantee that
 all the files are returned if the directory is modified between the
 calls to it. This problem exists with all the commonly used
 filesystems. 
"

Cheers,
David.

Reply to:

Follow-Ups:
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: Vincent Lefevre <vincent@vinc17.net>

References:
- Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: Vincent Lefevre <vincent@vinc17.net>
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: David Wright <deblis@lionunicorn.co.uk>
- Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
  - From: Vincent Lefevre <vincent@vinc17.net>

Prev by Date: Re: Laptops, UEFI, Secure Boot and Debian
Next by Date: Re: HELP- very slow download speeds
Previous by thread: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Next by thread: Re: Perl scripts: line by line parsing vs accumulating (was: the correct way to read a big directory? Mutt?)
Index(es):
- Date
- Thread