Re: RFC: Creation of a PostgreSQL database schema for wanna-build data
Hello, Roger, thanks for pushing this forward!
> We would be grateful for any comments you might have, particularly
> if you have used the existing wanna-build and can see any defects,
> omissions, or any suggestions for anything we missed or could
> implement in a more optimal fashion.
>  http://git.debian.org/?p=users/rleigh/sbuild.git;a=tree;f=db;h=26f19b8aef033bc07fdba1c33fa32c80b78ce468;hb=38dd87a21b925eca2c936adb9530530b1854cd07
Okay, I have a number of comments, suggestions, wishes, and patches.
They're quite a few, so please take your time. Hopefully more people can
join, letting us know what they think of them, and verifying I'm not
Also, I'm happy to provide patches for all of these issues once we have
reached an agreement on the list, so please don't take all this as a
"fix it fix it fix it" mail, and feel free to just comment on these
issues. OTOH, if you prefer to prepare the patches yourself, I'm very
much okay with that too, though I'd appreciate if you posted them here
first (with git-send-email or similar), so there's a chance for review
So, in no particular order, let's go over them:
1. Permission handling. I see nothing about this in the design, but I
thought we'd be needing something in the DB? (In particular, I don't
have a clear picture how updates to the DB are going to be handled,
eg. vorlon@raff giving back a package under his UID.)
Some thoughts for this: I would like for the builder table to have
"arch" and "admin" columns, so that buildd ownership can become the
primary source of information about who can do what. And then we need
a place to store permissions for users that don't admin buildds, or
extra permissions for certain admins.
"admin" in builder should be a foreign key to a developers or similar
table, which should have at least an "address" NOT NULL column (since
I don't want to duplicate the information of who should be behind
each of the <arch>@buildd.d.o aliases).
2. [trivial] I'd like for the builder table to have an additional
"address" NOT NULL column, as to not duplicate the list of builders
in the DB and the aliases file.
3. I'm doubtful about the "all" and "any" rows in the architectures
table. architectures is being referenced from tables with Packages
or Sources information, but also from stuff like build_jobs (and, as
per above, possibly builder too). And "all" or "any" don't really
make sense for this latter case.
Maybe we should create a view "pkg_architectures" that grabs
architectures and unions "any" and "all" on top of it, and have
sources and binaries reference the view? (Can a foreign key point to
A case could be made that "all" really belong in architectures, since
there could be an autobuilder for arch:all packages, but we can
easily change that later on. (And I don't think an arch:all
autobuilder would be handled that way anyway.)
4. Would it be useful to have the priority sortable from within the
database? Then it doesn't have to be done in software. (This is just
an idea, and I'm not completely sure it's a good one. Note: if we do
this, ensure auto-insertion works well with it.)
5. [nitpick] Could we do without listing all package sections? They'll
be auto-inserted anyway.
6. Maintainer and Uploaders handling. I would like for "uploaders" to be
a proper one-to-many relationship, because the new web frontend will
be querying by uploader.
Also, I don't think we need per-version maintainer/uploaders
information. Instead, the tool that populates the DB should just use
the information from the most recent version across suites.
(Oh, and we should just keep the maintainter/uploaders *addresses*.)
7. [distant future, no action needed now, but see the last paragraph in
this item] I think we should consider killing the P-a-s file as
input, and put that information in the DB. (P-a-s or similar would
still be created as output for third parties.)
Then, source_architectures would be a view of (eg.)
source_arch_from_control and source_arch_override (P-a-s).
(Or, well: without the above, where is P-a-s information merged into
the DB? Directly into source_arch? Maybe we can retain P-a-s as
input, but use already the two-table scheme I mentioned in the
paragraph above, writing P-a-s into source_arch_override.)
8. Is binary package information needed for anything else than
dep-waits? </ignorance> Does that table need section and priority?
9. History. In your introductory mail, you mention that the database is
meant to keep full history; I infer this mainly refers to the
"build_jobs" table. I have reservations about this design (not about
the idea of keeping full history itself), which I'll explain now.
I don't like very much the idea of keeping the state change history
(which is what build_jobs seems to do) together with the information
about the current (most recent) state. One concern would be
performance: how will fetching the current state for one package
perform in 5 years time? (This concern may just be due to me being
ignorant about PostgreSQL real capabilities. Also, I see you
mentioned pruning the table on IRC; see below about this.)
Another concern would be clumsiness of queries. For example, how
would one obtain the current state in unstable for packages foo, bar,
and baz? (I know current state is easy to obtain for a single
package, with eg. "ORDER BY ctime DESC LIMIT 1", but I can't figure
out how to get it for multiple packages with one query. And the
front-end *will* want to query for multiple packages. This, again,
may be a lack of SQL knowledge on my part, apologies if that's the
What I had imagine we would have, prior to reading this design, was a
build_state or similar table, that just records the current state for
each source/distribution/arch. And a separate build_state_history
table that would record every state change.
On IRC, there was talk about periodically pruning the sources table,
and hence pruning the history on build_jobs as well. I think sources
should always be prunned as soon as a version is no longer present in
any of the distributions, but I also think build_state_history
shouldn't be prunned at all, and it should just lose referential
integrity after prunning sources. (This is of course my opinion, and
the matter is open for discussions.)
Marc said he wouldn't know what to do with that state change history,
but I'd say "why not have it?". There's always time for pruning if
nobody ever finds a use for it, or it takes up too much space.
10. Build logs. I'd really like for the DB to store a table with all the
available logs: source, version, arch, date, state, and path.
11. I see from your 38dd87a commit message that the "log" table is meant
for "storing changelog". I've searched the IRC logs of your
conversations about this schema, but I couldn't find a reference to
it. Storing changelog of what, precisely?
12. binNMU handling. Marc mentioned en-passé that the binNMU
version/level should be a column in build_job/build_state, and I
agree. I'm just not very sure if we want history of these, and how
they'd fit with/into build_state_history.
13. The _name suffix everywhere annoys me a bit, since I like having
tables that can be natural join'ed, or using(foo) join'ed. Maybe
this is just me, and the _name suffix is excelent SQL practice, but
I thought I'd mention to see what other think.
14. I'm attaching 3 trivial patches that hopefully nobody will find
15. Could we name the database "wannab"?
Adeodato Simó dato at net.com.org.es
Debian Developer adeodato at debian.org
Testing can show the presence of bugs, but not their absence.