[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: reading an empty directory after reboot is very slow



On 2015-04-20 15:59:22 -0500, David Wright wrote:
> Quoting Vincent Lefevre (vincent@vinc17.net):
> > Possibly, but individual modifications would take much more time than
> > with Maildir (such modifications, consisting in retagging, occur from
> > time to time).
> 
> I take it these are real emails being read etc with an email client
> (like, say, mutt) at various times, rather than a dead archive of old
> emails that you just happen to keep processing from time to time.

This mailbox is constantly open in a Mutt running in screen (in
read-only mode). I often read it, and I modify it from time to time,
either by adding new messages in the usual way, or by modifying some
header of existing messages with some tool of mine (in which case, I
restart Mutt to take the changes into account).

BTW, the best way would be to have this header in a different file,
but Mutt has no way to support that. Alternatively, I could modify
my tool to cache the Message-Id -> filename mapping, since this is
what I actually need.

When I wrote my tool, I thought that such a cached mapping would be
useless because the mailbox would have to be read by Mutt anyway.
So, there's still something I don't understand: after dropping the
caches, why is Mutt fast to read the mailbox (about 1 minute), but
not my tool (about 30 minutes)?

Note: my tool stops reading the headers of a message once the
Message-Id has been found (and all the messages have a Message-Id
header). More precisely:

[...]

print "Reading $dir ...\n";
my %files;
opendir DIR, $dir or die "$!\n$proc: can't open directory $dir\n";
foreach (readdir DIR)
  {
    $_ ne '.' && $_ ne '..' or next;
    /^\./ and die "$proc: hidden filename $_\n";
    my $file = "$dir/$_";
    my $from;
    open FILE, '<', $file or die "$!\n$proc: can't open file $file\n";
    while (<FILE>)
      {
        /^[\t ]/ and next;
        /^\S+:/ || (!$from++ && /^From /)
          or die "$proc: bad message format ($file)";
        /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/i or next;
        defined $files{$1}
          and die "$proc: duplicate message-id $1 ($files{$1} and $file)\n";
        $files{$1} = $file;
        last;
      }
    close FILE or die "$!\n$proc: can't close file $file\n";
  }
closedir DIR or die "$!\n$proc: can't close directory $dir\n";

[...]

> For a start, if you put them in a single mbox, then all 145k messages
> are locked whenever any one of them is. Maildir doesn't have that
> problem.
> 
> When you process them, do you avoid updating the atime on the files:
> that can involve a lot of writing to the directory.

The file system is mounted with relatime, so that atime will be
modified at most once.

> > > It all depends upon the distribution of data size of the body of the
> > > messages since then it would need to read and skip the message
> > > bodies.
> > 
> > With an uncompressed mbox file, using the Content-Length, it could be
> > faster, but there's still the problem with individual changes.
> 
> That could be mitigated if you use that mbox extension which puts a
> fixed-length header on each email for metadata (as long as that's the
> kind of metadata you're operating on).

I meant that with the maildir format, an individual change just
modifies the message file: this is very fast. With the mbox format,
the whole file containing all the messages needs to be copied...

> > I wonder whether there exists some specific FS that would make maildir
> > access very fast and whether using it on a disk image that could be
> > loop-mounted would be interesting.
> 
> Have you considered running a local IMAP server to handle this (and
> any other) maildir?

There would be other problems. All the tools would have to talk
with this server... and for instance, mairix doesn't support IMAP.

> Handling those volumes of email must be bread and butter to hosting
> services. I assume such servers build persistent caches of the
> emails rather than just depending on the filesystem's.

Then this wouldn't solve the problem since the slowness I observe
occurs only when the caches are empty (typically after a reboot).
But actually, as I've said above, only with my tool, not with Mutt.

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


Reply to: