On Wed, Aug 08, 2001 at 04:10:56AM -0400, Matt Zimmerman wrote:
> > In particular, efficiently getting a list of bugs by some particular
> > selection mechanism "all RC bugs", "all bugs related to source package <foo>"
> > or whatever, pretty much requires you to hack around the C++ or SQL stuff.
> This is true, and I don't see any good way around it.  The current mechanism
> suffers from this, too, in that you have to dig around the backend to answer
> sufficiently complex queries.
Sure. The point is, you can't just say "Oh, it doesn't matter what that's
written in; there'll be wrappers for it". 
> The SQL backend would be more complex to work
> with, but also much more rewarding in flexibility and performance. 
It's not as beneficial as you may be thinking: most of the speed
issues with debbugs are design decisions that've been outgrown. eg,
the biggest speed issue with looking up individual bugs is that they're
stored in a signle ext2 directory; changing it to a reiserfs directory,
or using subdirectories would basically remove that speed issue entirely,
but hasn't been done since no one is entirely confident with the perl
code. Likewise, the delays in processing reports is because they're
cronned instead of processed immediately, probably for historical reasons
(ie, when master was much slower); and the delays in getting updated
package reports is because the indices are cronned, rather than updated
live, again because no one's confident enough with the perl to work out
what should be changed.
Beyond those changes, I don't think there are really any efficiency
problems. Possibly the CGI scripts are a bit awkward at generating HTML
when you have truly large numbers of bugs (which is why you can't get
lists of all open bugs atm).
I'm not really sure there's much more flexibility either, but I'm a bit
of a shell script fetishist. 
I don't find
	SELECT DISTINCT package FROM bugs 
        WHERE (severity = 'serious' OR severity = 'grave' 
               OR severity = 'critical') 
          AND (status = 'open' OR status = 'forwarded') ORDER BY package;
much better than
	grep -E '(open|forwarded) \[.*\] (critical|grave|serious)' index.db |
	  cut -d\  -f1 | sort -u
even when you contrast the postgresql setup and the index.db setup.
> > You're also missing any way to do really quick hacks about working out
> > bug stuff. The CGI scripts have a /var/lib/debbugs/index.db which I find
> > pretty handy, eg.
> At first glance, this looks pretty much like the same information that's in the
> "bug" table in my schema, except that it includes the tags as well.  It would
> be quite simple to write a tool to generate exactly that file from the
> database, and it would be fast enough to run once per minute if desired.
Sure. It's a lot easier to process a plain text file than to talk to a
database from the shell though, IME.
>> Doing regular database dumps seems likely to be pretty expensive. The debbugs
>> archive is currently 2GB, while the active database is just under 1GB. By
>> contrast, the twice-daily backups of the package pool database on ftp-master
>> is 16MB of SQL. debbugs gets a lot more activity than the pool database too,
>> so probably has a lot more chances to trip over any bugs in postgresql.
> A dump of my test database is about 60MB and takes 30-40 seconds.  Assuming
> linear scale up to the size of the full debbugs database, a full dump would
> probably be around 1GB.  This is all text, and seems to compress by about 4:1
> with gzip -9, so I don't think that the backups would be too unmanageable.
What're you planning on doing with archived bugs?
I'd suspect your 60MB dumps are probably getting pretty heavily buffered,
which may or may not be able to be relied upon on master (which now
has over 1GB of RAM, so could conceivably cache much of the active
debbugs db).
Even assuming it is, though, spending half an hour a day dumping 3GB
of stuff from the database, and however much longer compressing it to
around 1GB, seems a bit of a loss.
> In theory, the package pool database should be reproducible from the contents
> of the archive, yes?  Do tools exist to do this should it become necessary?
It was originally constructed from the archive itself, so it's somewhat
reconstructable, yes. I'm not sure if there'd be any dataloss or not. It'd
probably be very time consuming though.
> > > What do we do if the current debbugs data gets corrupted (if we notice)?
> > It doesn't get corrupted.
> > Well, not by the underlying storage technology, anyway. Corruption due to
> > bugs in debbugs itself is detected by verifying the db independently. Mostly,
> > it just means we need to reprocess a messsage in
> > /var/lib/debbugs/spool/incoming; otherwise we just expect people to refile
> > the bug.
> > But in the context of Jason's message, it just doesn't get corrupted.
> I don't know what to say about the reliability issue.  I was under the
> impression that postgresql was more stable.  Backups and redundancy, as needed.
I'm not seeing how there'd be any redundancy?
(I'd suspect trying to pipe 3GB of db dump into psql would be pretty painful
too)
You could probably end up pretty happy by having an on disk structure
like:
	db/10/23/_/102345.log
	db/10/23/_/10231.log.gz [0]
with each .log containing all the information to replay a bug's status
from day 0, and being gzipped when a bug's been archived. By limiting
your SQL db to only having the .status information, you'd be able to use
your db for everything except things that need to access the *.log files,
ie when a bug is modified or specifically viewed (which is relatively
rarely), and you'd be able to keep your DB quite small (and controllable).
That'd let you have both backups and redundancy, without taking much
of a space or perfomance hit, or having too heavy a dependency on psql,
I think.
Cheers,
aj
[0] xxyyz and xxyyzz both goin in db/xx/yy/_/, so there's at most
    110 files in the _ directories, and there'll be at most 101 files
    in any xx/ directory, which gives you O(log(n)) access time (and
    O(n*log(n)) listing time) to any log file, on any reasonable fs,
    rather than the O(n) access time (and O(n^2) access time) the BTS
    currently gets on ext2.
-- 
Anthony Towns <aj@humbug.org.au> <http://azure.humbug.org.au/~aj/>
I don't speak for anyone save myself. GPG signed mail preferred.
``_Any_ increase in interface difficulty, in exchange for a benefit you
  do not understand, cannot perceive, or don't care about, is too much.''
                      -- John S. Novak, III (The Humblest Man on the Net)
Attachment:
pgpxOeaGKOY_p.pgp
Description: PGP signature