[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Debbugs: The Next Generation



On Wed, 8 Aug 2001, Anthony Towns wrote:

> It's not as beneficial as you may be thinking: most of the speed
> issues with debbugs are design decisions that've been outgrown. eg,
> the biggest speed issue with looking up individual bugs is that they're
> stored in a signle ext2 directory; changing it to a reiserfs directory,
> or using subdirectories would basically remove that speed issue entirely,
> but hasn't been done since no one is entirely confident with the perl
> code. Likewise, the delays in processing reports is because they're
> cronned instead of processed immediately, probably for historical reasons
> (ie, when master was much slower); and the delays in getting updated
> package reports is because the indices are cronned, rather than updated
> live, again because no one's confident enough with the perl to work out
> what should be changed.

Well, I have hashed db/ support half implemented, but like always, I got busy
with other things.

Also, now that processall has a lock file that keeps several instances from
stepping on each other, it may be possible to have it run as soon as an email
comes in.  The only thing I'd want to change first is to change the error
message that occurs when 2 processalls attempt to run, and one has that lock
message sent(read owner@bugs when dinstall floods the bug system).

Also, with debbugs.trace, we now have the first steps to index updating in
real time.


>
> Beyond those changes, I don't think there are really any efficiency
> problems. Possibly the CGI scripts are a bit awkward at generating HTML
> when you have truly large numbers of bugs (which is why you can't get
> lists of all open bugs atm).
>
> I'm not really sure there's much more flexibility either, but I'm a bit
> of a shell script fetishist.
>
> I don't find
>
> 	SELECT DISTINCT package FROM bugs
>         WHERE (severity = 'serious' OR severity = 'grave'
>                OR severity = 'critical')
>           AND (status = 'open' OR status = 'forwarded') ORDER BY package;


As a single query:

SELECT DISTINCT a.package FROM packages a, bugs b, severities c, status d
	WHERE a.id = b.package AND c.id = b.severity AND d.id = b.status
		AND c.severity IN ( 'serious', 'grave', 'critical' )
		AND d.status IN ( 'open', 'forwarded' )
	ORDER BY a.package;

Or, the way bugs-query does it, with local caching of certain values, for
speed:

CREATE TEMP TABLE tempTable_50 AS SELECT DISTINCT bugs.id, bugs.merge,
bugs.date, bugs.subject, bugs.severity, bugs.status, bugs.package FROM bugs
WHERE ( bugs.severity IN ( '0', '1', '2' ) ) AND ( bugs.status IN ( '2', '3' )
);
SELECT DISTINCT bugs.id, bugs.merge, bugs.date, bugs.subject, bugs.severity,
bugs.status, bugs.package, status.value, severities.value FROM status,
severities, tempTable_50 AS bugs WHERE ( status.id = bugs.status ) AND (
severities.id = bugs.severity ) ORDER BY bugs.package ASC LIMIT 20;

> Sure. It's a lot easier to process a plain text file than to talk to a
> database from the shell though, IME.

echo "SELECT foo FROM bar" | psql -U user -H database | processit

> What're you planning on doing with archived bugs?
>
> I'd suspect your 60MB dumps are probably getting pretty heavily buffered,
> which may or may not be able to be relied upon on master (which now
> has over 1GB of RAM, so could conceivably cache much of the active
> debbugs db).
>
> Even assuming it is, though, spending half an hour a day dumping 3GB
> of stuff from the database, and however much longer compressing it to
> around 1GB, seems a bit of a loss.

On master, the postgres db for debbugs takes up 66 megs.  A pg_dump is 13 megs
uncompressed, and takes just under 5 seconds to run.  Of course, this does NOT
include the .log.

My plan was to have a separate history table, with the msgids for each email
sent to the system used as the key, along with certain other meta data stored
for each email.  However, the actualy emails themsevles would be stored as an
mbox for each bug(and yes, I know that duplicates data).

> > In theory, the package pool database should be reproducible from the contents
> > of the archive, yes?  Do tools exist to do this should it become necessary?
>
> It was originally constructed from the archive itself, so it's somewhat
> reconstructable, yes. I'm not sure if there'd be any dataloss or not. It'd
> probably be very time consuming though.

Which is why I like my way of doing it.  The text files are still authorative,
and postgres is only used to speed up read only queries.


> You could probably end up pretty happy by having an on disk structure
> like:
>
> 	db/10/23/_/102345.log
> 	db/10/23/_/10231.log.gz [0]

db/5/4/3/102345.log
db/1/3/10231.log

That is how I was doing it when I started to add hashed support to debbugs.  I
think that gives a more even distrubtionn over time.


> [0] xxyyz and xxyyzz both goin in db/xx/yy/_/, so there's at most
>     110 files in the _ directories, and there'll be at most 101 files
>     in any xx/ directory, which gives you O(log(n)) access time (and
>     O(n*log(n)) listing time) to any log file, on any reasonable fs,
>     rather than the O(n) access time (and O(n^2) access time) the BTS
>     currently gets on ext2.

Hmm.  I understand what the _ is for.  In my scheme, howeer, it's only needed
on the short bug numbers.

Also, you don't discuss how many files are in the db/xx/yy/ dirs.  From your
explanation, it only appears to be a single dir, the '_' mentioned.



Reply to: