[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debbugs: The Next Generation



I've put up some rough CGI demos at:

http://alcor.ddts.net/cgi-bin/debian/pkgreport.cgi/<packagename>
and
http://alcor.ddts.net/cgi-bin/debian/bugreport.cgi/<bugnumber>

which look mostly like the current bugs.debian.org.  6085 non-archived bugs are
in the database, from 100000 through 106956.  There are currently a lot of
things missing from the output (like timestamps on the history entries) that
are in the database, but everything is retrieved by the query, so performance
should be a good indication of the final product.  Also, merges with bugs which
are not in the database are not reported, since they are discarded due to
referential integrity constraints in the database.

On Wed, Aug 08, 2001 at 07:41:19PM +1000, Anthony Towns wrote:

> On Wed, Aug 08, 2001 at 04:10:56AM -0400, Matt Zimmerman wrote:
> 
> > The SQL backend would be more complex to work with, but also much more
> > rewarding in flexibility and performance. 
> 
> It's not as beneficial as you may be thinking: most of the speed issues with
> debbugs are design decisions that've been outgrown. eg, the biggest speed
> issue with looking up individual bugs is that they're stored in a signle ext2
> directory; changing it to a reiserfs directory, or using subdirectories would
> basically remove that speed issue entirely, but hasn't been done since no one
> is entirely confident with the perl code. Likewise, the delays in processing
> reports is because they're cronned instead of processed immediately, probably
> for historical reasons (ie, when master was much slower); and the delays in
> getting updated package reports is because the indices are cronned, rather
> than updated live, again because no one's confident enough with the perl to
> work out what should be changed.

I think that reiserfs has at least as many corruption issues as postgresql, so
that's probably a double-edged sword.  Using subdirectories will speed lookup
of individual bugs, but doesn't do anything for broader queries (it could even
slow them down).

Once you switch to a btree filesystem, hashed subdirectories, and do live
updates of indices, you are essentially re-implementing the core of a "real"
database, but less scalability and without any advanced query functionality.
You would still have to read through all of the files to do a complex report.

I didn't realize that incoming reports were batched; has anyone tried
processing them more frequently?  How long does a typical run take?  Turnaround
from submission to CGI output is one of my major complaints.

> I'm not really sure there's much more flexibility either, but I'm a bit of a
> shell script fetishist. 
> 
> I don't find
> 
> 	SELECT DISTINCT package FROM bugs WHERE (severity = 'serious' OR
> 	severity = 'grave' OR severity = 'critical') AND (status = 'open' OR
> 	status = 'forwarded') ORDER BY package;
>
> much better than
> 
> 	grep -E '(open|forwarded) \[.*\] (critical|grave|serious)' index.db |
> 	cut -d\  -f1 | sort -u
> 
> even when you contrast the postgresql setup and the index.db setup.

For simple queries against the index, they probably aren't all that different.
For increasingly complex queries, the REs will get progressively more messy,
then you'll need information that isn't in index.db and you're back to
traversing the entire database.  Contrast:

SELECT id FROM message WHERE age(time) < interval '30 days'
	AND message LIKE '%debbugs%'

This naive query didn't take nearly as long as I feared on my test database
(around 20 seconds).  The corresponding search through a debbugs spool
directory would be considerably more expensive.  To support such a feature, we
would have to create another index, for message vs. date, and keep it updated,
then extract the right message from whichever .log or .report in resides in.

> > > You're also missing any way to do really quick hacks about working out
> > > bug stuff. The CGI scripts have a /var/lib/debbugs/index.db which I find
> > > pretty handy, eg.
> > At first glance, this looks pretty much like the same information that's in
> > the "bug" table in my schema, except that it includes the tags as well.  It
> > would be quite simple to write a tool to generate exactly that file from
> > the database, and it would be fast enough to run once per minute if
> > desired.
> 
> Sure. It's a lot easier to process a plain text file than to talk to a
> database from the shell though, IME.

I agree (though I've often thought it could be made so if command-line SQL
tools made it easier to tweak their output format).  But it's relatively easy
to convert between text and relational databases, so both kinds of queries
could be supported without much trouble, and the database could be used
programmatically and for complex queries.

> >> Doing regular database dumps seems likely to be pretty expensive. The debbugs
> >> archive is currently 2GB, while the active database is just under 1GB. By
> >> contrast, the twice-daily backups of the package pool database on ftp-master
> >> is 16MB of SQL. debbugs gets a lot more activity than the pool database too,
> >> so probably has a lot more chances to trip over any bugs in postgresql.
> > A dump of my test database is about 60MB and takes 30-40 seconds.  Assuming
> > linear scale up to the size of the full debbugs database, a full dump would
> > probably be around 1GB.  This is all text, and seems to compress by about 4:1
> > with gzip -9, so I don't think that the backups would be too unmanageable.
> 
> What're you planning on doing with archived bugs?

I had planned to treat them exactly the same way, though they could be indexed
and for exclusion from queries with very little performance cost.

> I'd suspect your 60MB dumps are probably getting pretty heavily buffered,
> which may or may not be able to be relied upon on master (which now
> has over 1GB of RAM, so could conceivably cache much of the active
> debbugs db).

That's probably true; I have a RAM surplus here at the moment.  It's hard to
estimate what a full database dump would require, but I'll wager that it's I/O
bound.  Postgresql goes to some trouble to ensure that the dump is consistent,
so it probably increases query/update overhead somewhat as well.

> Even assuming it is, though, spending half an hour a day dumping 3GB
> of stuff from the database, and however much longer compressing it to
> around 1GB, seems a bit of a loss.

The figures I gave were uncompressed size; my 60MB dump compressed down to
under 15MB.  Uncompressed (or lightly compressed) dumps wouldn't be infeasible
for the projected size of the database.

> > In theory, the package pool database should be reproducible from the
> > contents of the archive, yes?  Do tools exist to do this should it become
> > necessary?
> 
> It was originally constructed from the archive itself, so it's somewhat
> reconstructable, yes. I'm not sure if there'd be any dataloss or not. It'd
> probably be very time consuming though.

So while a failure of the pool database is probably much less likely than a
failure of a hypothetical debbugs database, the downtime would also be more
damaging (inability to fix bugs, rather than inability to look them up).  Both
could be restored, given some time and trouble.

> > > > What do we do if the current debbugs data gets corrupted (if we notice)?
> > > It doesn't get corrupted.
> > > Well, not by the underlying storage technology, anyway. Corruption due to
> > > bugs in debbugs itself is detected by verifying the db independently. Mostly,
> > > it just means we need to reprocess a messsage in
> > > /var/lib/debbugs/spool/incoming; otherwise we just expect people to refile
> > > the bug.
> > > But in the context of Jason's message, it just doesn't get corrupted.
> > I don't know what to say about the reliability issue.  I was under the
> > impression that postgresql was more stable.  Backups and redundancy, as needed.
> 
> I'm not seeing how there'd be any redundancy?
>
> (I'd suspect trying to pipe 3GB of db dump into psql would be pretty painful
> too)

Backups and redundancy is the usual prescription for reliability problems.
Redundancy would probably be overkill for our situation, but backups are
still good medicine.

The dump speed depends on the format of the dump.  pg_dump uses COPY by
default, which is relatively efficient, and supports (postgresql-specific, I
assume) binary dumps that might be smaller and faster (I haven't tried).

> You could probably end up pretty happy by having an on disk structure
> like:
> 
> 	db/10/23/_/102345.log
> 	db/10/23/_/10231.log.gz [0]
> 
> with each .log containing all the information to replay a bug's status
> from day 0, and being gzipped when a bug's been archived. By limiting
> your SQL db to only having the .status information, you'd be able to use
> your db for everything except things that need to access the *.log files,
> ie when a bug is modified or specifically viewed (which is relatively
> rarely), and you'd be able to keep your DB quite small (and controllable).
> 
> That'd let you have both backups and redundancy, without taking much of a
> space or perfomance hit, or having too heavy a dependency on psql, I think.

I admit that I like the idea of having an authoritative text database, but I
think my feature goals would be more difficult to meet with such a system, and
maintaining both in parallel seems like it would require excessive
software-complexity and administration resources.

I'll think on it some more.  Thanks for your ideas.

-- 
 - mdz

Attachment: pgpbzhUeboyl2.pgp
Description: PGP signature


Reply to: