[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debbugs: The Next Generation

On Wed, Aug 08, 2001 at 09:02:36PM -0400, Matt Zimmerman wrote:
> I think that reiserfs has at least as many corruption issues as postgresql, so
> that's probably a double-edged sword.  Using subdirectories will speed lookup
> of individual bugs, but doesn't do anything for broader queries (it could even
> slow them down).

That's not true: ext2 queries and directory lists are the problem here,
in that they're O(N) and O(N^2) rather than O(1)/O(N). Using reiser or
subdirectories would reduce those to O(lg(N)) and O(N lg(N)).

> Once you switch to a btree filesystem, 

btrees aren't necessary: files don't often get created or deleted, there are
just a lot of them.

> hashed subdirectories, and do live
> updates of indices, you are essentially re-implementing the core of a "real"
> database, but less scalability and without any advanced query functionality.
> You would still have to read through all of the files to do a complex report.

It depends. You'd have to do it for a query like "work out the average
submission time of every email containing four or more commas", but
that's not really a big deal.

You need to optimise and do indices for the common case, certainly;
but making things that aren't particularly useful easy and fast is just
going to be a nuisance in future (either trying to maintain debbugs,
or trying to use it for things that are useful, or other reliability
issues due to the extra complexity).

> I didn't realize that incoming reports were batched; has anyone tried
> processing them more frequently?  How long does a typical run take?  
> Turnaround from submission to CGI output is one of my major complaints.

You've got three batched jobs there; mail to the BTS is batched and
processed every 15 minutes (3,18,33,48 mins past the hour). Responses from
the BTS are put in exim's queue, which is run every half hour (default
settings). The bugreport.cgi or http://bugs.d.o/nnnnn urls are affected as
soon as the mails are processed, but the by-package/submitter/maintainer
pages are based on indices that are only updated every four hours.

The first batching could probably be changed to immediate processing
relatively easily. I'm not sure what the exim thing is. The final batching
is somewhat more difficult to fix, but only because it involves trying
to understand how the debbugs code works.

> > 	grep -E '(open|forwarded) \[.*\] (critical|grave|serious)' index.db |
> > 	cut -d\  -f1 | sort -u
> For simple queries against the index, they probably aren't all that different.
> For increasingly complex queries, the REs will get progressively more messy,
> then you'll need information that isn't in index.db and you're back to
> traversing the entire database.  Contrast:
> SELECT id FROM message WHERE age(time) < interval '30 days'
> 	AND message LIKE '%debbugs%'

I can't say I've ever wanted to do anything like this.

I also don't think letting random people from the internet run queries
like this on master is a particularly great idea. (Assuming "message"
is the full text of an email to the BTS)

And for reference:

ajt@master:/org/bugs.debian.org/spool/archive$ time find -type f -name '*.log' -newer ~/twomonthsago | xargs grep -l debbugs | wc
      9       9     139

real    0m8.358s
user    0m0.450s
sys     0m0.740s

(There were 5634 bug reports closed more recently than two months ago,
for reference)

> This naive query didn't take nearly as long as I feared on my test database
> (around 20 seconds).  The corresponding search through a debbugs spool
> directory would be considerably more expensive.  To support such a feature, we
> would have to create another index, for message vs. date, and keep it updated,
> then extract the right message from whichever .log or .report in resides in.

You don't need an index to support a feature; you just need it if it's
common enough that you want to make it really fast.

> > What're you planning on doing with archived bugs?
> I had planned to treat them exactly the same way, though they could be indexed
> and for exclusion from queries with very little performance cost.

If you're putting the archived bugs into psql too, you've got 3GB of
raw data, which you're going to be lucky to compress to much less than
1GB. It'll also just get bigger.

> > Even assuming it is, though, spending half an hour a day dumping 3GB
> > of stuff from the database, and however much longer compressing it to
> > around 1GB, seems a bit of a loss.
> The figures I gave were uncompressed size; my 60MB dump compressed down to
> under 15MB.  Uncompressed (or lightly compressed) dumps wouldn't be infeasible
> for the projected size of the database.

> > > In theory, the package pool database should be reproducible from the
> > > contents of the archive, yes?  Do tools exist to do this should it become
> > > necessary?
> > It was originally constructed from the archive itself, so it's somewhat
> > reconstructable, yes. I'm not sure if there'd be any dataloss or not. It'd
> > probably be very time consuming though.
> So while a failure of the pool database is probably much less likely than a
> failure of a hypothetical debbugs database, the downtime would also be more
> damaging (inability to fix bugs, rather than inability to look them up).  Both
> could be restored, given some time and trouble.

The pool database is backed up quite regularly (twice a day), and every backup
since the pool was created is readily available on auric. Even if all those
fail, that doesn't stop people using Debian (the Packages files are still
there), and it doesn't stop ftpmaster from recovering the database (based on
the Packages files).

It doesn't seem particularly unfair to think restoring from a 16MB dump
[0] should be at least an order of magnitude easier to handle than
restoring from a 3GB dump, either.

Further, I'd have to say that the BTS going down for a few days would
be worse than ftp-master going down: being unable to look up bugs means
people can't work on them, being unable to upload them to ftp-master
just means they'll have to be queued on one of the incoming queues until
ftp-master comes back up.

> > > > But in the context of Jason's message, it just doesn't get corrupted.
> > > I don't know what to say about the reliability issue.  I was under the
> > > impression that postgresql was more stable.  Backups and redundancy, as needed.
> > I'm not seeing how there'd be any redundancy?
> > (I'd suspect trying to pipe 3GB of db dump into psql would be pretty painful
> > too)
> Backups and redundancy is the usual prescription for reliability problems.

Yes, I know what redundancy *is*, I'm just not seeing how you're planning
on achieving it in this case.

> The dump speed depends on the format of the dump.  

Dumping 10MB of data (the size of the BTS indices used by the CGI
scripts atm) seems likely to be a lot faster than dumping 3GB of data
(uncompressed size of the BTS, archive and current), independently of
the format of the dump.

Don't take this the wrong way: it might be good to put all the emails into
postgresql, but I'm not seeing it.


[0] -rw-r--r-- 1 troup debadmin 16538622 Aug 13 16:02 dump_2001.08.13-16:02:09

Anthony Towns <aj@humbug.org.au> <http://azure.humbug.org.au/~aj/>
I don't speak for anyone save myself. GPG signed mail preferred.

``_Any_ increase in interface difficulty, in exchange for a benefit you
  do not understand, cannot perceive, or don't care about, is too much.''
                      -- John S. Novak, III (The Humblest Man on the Net)

Attachment: pgpmxvTzhTb0I.pgp
Description: PGP signature

Reply to: