[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debbugs: The Next Generation



On Tue, Aug 14, 2001 at 04:24:33PM +1000, Anthony Towns wrote:

> On Wed, Aug 08, 2001 at 09:02:36PM -0400, Matt Zimmerman wrote:
> > I think that reiserfs has at least as many corruption issues as postgresql, so
> > that's probably a double-edged sword.  Using subdirectories will speed lookup
> > of individual bugs, but doesn't do anything for broader queries (it could even
> > slow them down).
> 
> That's not true: ext2 queries and directory lists are the problem here,
> in that they're O(N) and O(N^2) rather than O(1)/O(N). Using reiser or
> subdirectories would reduce those to O(lg(N)) and O(N lg(N)).

Why on earth are directory lists O(N^2)?  I don't know much about ext2
internals, but that should be an O(N) operation in any sane filesystem.  If
that is truly the right big-O, then reducing the number of files in each
directory should speed up queries which scan the entire database as well.

> > hashed subdirectories, and do live updates of indices, you are essentially
> > re-implementing the core of a "real" database, but less scalability and
> > without any advanced query functionality.  You would still have to read
> > through all of the files to do a complex report.
> 
> It depends. You'd have to do it for a query like "work out the average
> submission time of every email containing four or more commas", but that's
> not really a big deal.

Or a query like "Has anyone reported a bug containing this error message in the
past 7 days?", which I have wanted to perform many times.  The abstract "Has
this bug been reported already?" is difficult to do any other way, especially
for bugs whose source is not obvious.

> You need to optimise and do indices for the common case, certainly; but
> making things that aren't particularly useful easy and fast is just going to
> be a nuisance in future (either trying to maintain debbugs, or trying to use
> it for things that are useful, or other reliability issues due to the extra
> complexity).

I think it makes sense to put everything into a database, upon which any kind
of future reporting application could be built, rather than creating
specialized indices.  Mixing reporting with data storage is (presumably) what
got us the current debbugs .log format.  For queries other than the one for
which it was intended, parsing it is a huge PITA.

> > > 	grep -E '(open|forwarded) \[.*\] (critical|grave|serious)' index.db |
> > > 	cut -d\  -f1 | sort -u
> > For simple queries against the index, they probably aren't all that different.
> > For increasingly complex queries, the REs will get progressively more messy,
> > then you'll need information that isn't in index.db and you're back to
> > traversing the entire database.  Contrast:
> > SELECT id FROM message WHERE age(time) < interval '30 days'
> > 	AND message LIKE '%debbugs%'
> 
> I can't say I've ever wanted to do anything like this.
>
> I also don't think letting random people from the internet run queries
> like this on master is a particularly great idea. (Assuming "message"
> is the full text of an email to the BTS)

With this information in a database, queries like this could presumably be
optimized such that this wouldn't be a big deal, any more than
http://search.debian.org/, or most other websites' search functions.

Yes, you could use glimpse (or a workalike) to index the messages, then query a
database (or another index) for the bug summary info, and combine them into a
report, but I'd rather get everything from one place.

> You don't need an index to support a feature; you just need it if it's
> common enough that you want to make it really fast.

This is true, but which queries are common, and how fast they need to be, and
how fast they tend to be without an optimized index, will change over time.

> > > What're you planning on doing with archived bugs?
> > I had planned to treat them exactly the same way, though they could be
> > indexed and for exclusion from queries with very little performance cost.
> 
> If you're putting the archived bugs into psql too, you've got 3GB of raw
> data, which you're going to be lucky to compress to much less than 1GB. It'll
> also just get bigger.

Unless there is a good reason not to do so, I'll try pulling down all of the
bugs from master and importing them into my database, and see what the live and
backup requirements are.  I have a feeling it won't be as bad as all that.

> It doesn't seem particularly unfair to think restoring from a 16MB dump [0]
> should be at least an order of magnitude easier to handle than restoring from
> a 3GB dump, either.

Faster, yes.  I don't see why it should be any more or less difficult.

> Further, I'd have to say that the BTS going down for a few days would be
> worse than ftp-master going down: being unable to look up bugs means people
> can't work on them, being unable to upload them to ftp-master just means
> they'll have to be queued on one of the incoming queues until ftp-master
> comes back up.

This is a valid point; the problems are the same for updates, but read access
is more broken in the debbugs case (only tools like madison would be broken in
the pool case).

> > > I'm not seeing how there'd be any redundancy?  (I'd suspect trying to
> > > pipe 3GB of db dump into psql would be pretty painful too)
> > Backups and redundancy is the usual prescription for reliability problems.
> 
> Yes, I know what redundancy *is*, I'm just not seeing how you're planning on
> achieving it in this case.

If we were to truly consider BTS queries to be a mission-critical service that
could not go down, it wouldn't be that difficult to maintain a read-only mirror
of the database, on the same server or a different one.  It could even be
maintained at some network-distant location, as BTS updates seem to be
relatively low-bandwidth.

New bug submissions would have to be queued until recovery was complete.

> Don't take this the wrong way: it might be good to put all the emails into
> postgresql, but I'm not seeing it.

I'm not offended.  It's certainly not the only way to get it done, but it
seemed natural to me when I started working on it.  There's no question that it
would require additional maintenance and precautions, so there is a tradeoff
involved.

At the very least, it seems prudent to track the messages in the database, if
not store their contents, in order to eliminate duplicates and provide fast
lookup of the message that triggered a bug status change.

-- 
 - mdz

Attachment: pgpuqCG14AVh7.pgp
Description: PGP signature


Reply to: