Re: RFC: Creation of a PostgreSQL database schema for wanna-build data

To: Roger Leigh <rleigh@codelibre.net>
Cc: debian-dak@lists.debian.org, Debian buildd-tools Developers <buildd-tools-devel@lists.alioth.debian.org>, Marc 'HE' Brockschmidt <he@debian.org>
Subject: Re: RFC: Creation of a PostgreSQL database schema for wanna-build data
From: Adeodato Simó <dato@net.com.org.es>
Date: Mon, 26 Jan 2009 18:39:26 +0100
Message-id: <[🔎] 20090126173926.GA13853@chistera.yi.org>
Mail-followup-to: Roger Leigh <rleigh@codelibre.net>, debian-dak@lists.debian.org, Debian buildd-tools Developers <buildd-tools-devel@lists.alioth.debian.org>, Marc 'HE' Brockschmidt <he@debian.org>
In-reply-to: <[🔎] 20090125190557.GG11576@codelibre.net>
References: <[🔎] 20090117162702.GA13984@codelibre.net> <[🔎] 20090124165846.GA29019@chistera.yi.org> <[🔎] 20090125190557.GG11576@codelibre.net>

* Roger Leigh [Sun, 25 Jan 2009 19:05:58 +0000]:

Hey again,

> As I mentioned on IRC, this tree contains three implementations of the
> debversion type for Postgres, implemented in PL/Perl, PL/PerlU and C.
> Once we have decided on the appropriate speed/security/code
> duplication tradeoff, we can pick one of these to go with.

This was discussed on IRC, but I'll mention here for the benefit of
whoever is reading this thread and not the channel.

The bottom line is that, the C wrapper around libapt-pkg is two orders
of magnitude faster than PL/Perl. I didn't check the PL/PerlU version,
but I imagine it'll be very similar to PL/Perl, if not slower.

I think the speed benefits are enough as to go with the version in C. It
is true that we will normally not do lookups based on version against
huge amount of data (since in the common case you only specify a version
when you're specifying a package as well), but the Perl version is so
slow, that you'll want to die the day you need to do a version lookup
against lots of rows, for fun or profit.

The version in C has a couple downsides. The most important is that you
need to be superuser in order to create functions in C. This shouldn't
be a problem, because either wbadm will be superuser in the wannab
database or, alternatively, DSA can run version.sql for us as a one time
operation. (Additionally, I think both the PL/Perl and the PL/PerlU
versions need superuser as well, for the functions that use the
"internal" language.)

> No problem.  One thing following from your patches and this thread is
> that there is a lot of similarity between the dak projectb schema and
> this one.  As a result, I think it makes sense to do things like dak
> does where possible, to make things as compatible as possible.  Quite
> a few of your suggestions take us in this direction anyway.

As I said on IRC, I don't see a need to be schema/table compatible with
projectb. (And projectb is presumably going away in the future.) I do
think consistency in names is useful, that's what I sent a patch that
renamed section to component, and why I'm attaching a patch that renames
distribution to suite as discussed on IRC.

> I've attached a copy of some changes I make to do this, but this is
> just for a couple of tables so far.  Comments?

I agree with what Marc said back on December, about preferring tables
without serial ids if possible. The other part (names of section and
priority columns) is better addressed, I think, by my patch to item 13,
at the end of this mail. Sorry for the wasted effort, I thought I had
mentioned this to you on IRC early enough.

> > So, in no particular order, let's go over them:

I've skipped items 9 (history), 11 ("log" table) and 12 (binNMUs), which
I'll address in a separate mail later on.

(NB: I'm following-up with a number of patches to this mail. Though they
are sent with git-send-email, that does not imply that they are to be
final. If there are concerns, I'll submit "v2" of the affected ones.)

> > 1. Permission handling. I see nothing about this in the design, but I
> >    thought we'd be needing something in the DB? (In particular, I don't
> >    have a clear picture how updates to the DB are going to be handled,
> >    eg. vorlon@raff giving back a package under his UID.)

> >    Some thoughts for this: I would like for the builder table to have
> >    "arch" and "admin" columns, so that buildd ownership can become the
> >    primary source of information about who can do what. And then we need
> >    a place to store permissions for users that don't admin buildds, or
> >    extra permissions for certain admins.

> >    "admin" in builder should be a foreign key to a developers or similar
> >    table, which should have at least an "address" NOT NULL column (since
> >    I don't want to duplicate the information of who should be behind
> >    each of the <arch>@buildd.d.o aliases).

> I hadn't considered this.  Your suggestion for the builder table sounds
> fine.  Would it be appropriate for non-admin developers to have the
> ability to do some tasks (such as sheduling a binNMU on all buildds)?

Yes, as mentioned, some non-admins have permissions, and some admins
have extra permissions (ie., not only for their arches). I'm attaching a
patch with the initial work on this, just keeping record of each buildd
admin(s). 

We need to figure out how to store extra access. In principle, a simple
table "extra_access" with (login, arch) rows should suffice. However, I
would like there's a way to say "all arches", and neither of the two
ways I thought of doing that pleased me much.

(Way #1: have a boolean in "extra_access"; then the "arch" column
becomes meaningless for that row. Way #2: allow an "all" value for arch
in extra_access; then, you can't reference architectures and have to
reference package_architectures instead.)

> > 2. [trivial] I'd like for the builder table to have an additional
> >    "address" NOT NULL column, as to not duplicate the list of builders
> >    in the DB and the aliases file.

> Sure.  Where is the aliases file currently located?  Is this just a
> standard /etc/aliases file?

It's a standard /etc/aliases file and it's in /org/buildd.debian.org/mail.
I expect there'll be a small program generating it out of the DB.

> Does this want to be a one-to-many mapping to cope with multiple
> admins for a system?

The "address" column in builder is a one-to-one mapping, because it's
the address of the buildd software (eg., buildd@whatever.debian.org).
OTOH, you raise a very valid point about multiple admins for a system;
but that means the "admin" column mentioned in #1 should be a
one-to-many, and not "address".

I'm attaching a patch adding an "address" column.

> > 3. I'm doubtful about the "all" and "any" rows in the architectures
> >    table. architectures is being referenced from tables with Packages
> >    or Sources information, but also from stuff like build_jobs (and, as
> >    per above, possibly builder too). And "all" or "any" don't really
> >    make sense for this latter case.

> >    Maybe we should create a view "pkg_architectures" that grabs
> >    architectures and unions "any" and "all" on top of it, and have
> >    sources and binaries reference the view? (Can a foreign key point to
> >    a view?)

> I'm not sure, I'll have to check.  If it's not possible, then we could
> have two separate tables for the two separate purposes.  Or alternatively,
> add an extra column to architectures to determine where it may be used
> (and add a constraint on that).

I checked, and it's indeed not possible to have a foreign key against a
view. With that, I prefer two tables rather than a single table with an
extra column.

I was going to suggest a trigger propagating insertion from
"architectures" to "architecture_field". However, "architecture_field"
should have a trigger anyway to auto-insert architectures from Sources
files, since there you can find stuff, not only like kfreebsd-i386, but
a plethora of others: netbsd-alpha, hurd-sparc, sh4eb, m32r... Hence the
trigger architectures -> architecture_field seems redundant.

I'm attaching a draft patch. Thoughts?

> >    A case could be made that "all" really belong in architectures, since
> >    there could be an autobuilder for arch:all packages, but we can
> >    easily change that later on. (And I don't think an arch:all
> >    autobuilder would be handled that way anyway.)

> Agreed, you would probably just configure one architecture to build arch-all
> packages in addition to arch-any.

Ack.

> I'll check up on the use of views (do you have an example of how to
> union data in a view?).  My current thought is that the primary key
> would be a bit screwy, but maybe it would be OK.

As mentioned on IRC:

=> CREATE VIEW pkg_arches AS
      SELECT name FROM architectures UNION SELECT 'any' UNION SELECT 'all';

Or, for PostgreSQL 8.3 or higher:

=> CREATE VIEW pkg_arches AS
      SELECT name FROM architectures UNION VALUES ('any'), ('all');

(They seem very similar, but the latter is very nice when you have quite
a few values to union with.)

> > 4. Would it be useful to have the priority sortable from within the
> >    database? Then it doesn't have to be done in software. (This is just
> >    an idea, and I'm not completely sure it's a good one. Note: if we do
> >    this, ensure auto-insertion works well with it.)

> We can easily add a numerical priority and sort on that.  I guess we could
> just default the priority to allow easy insertion at a low priority.

Ah, yes, a column with a low default should serve as well. Patch
attached, please check.

> > 5. [nitpick] Could we do without listing all package sections? They'll
> >    be auto-inserted anyway.

> Yes, I already added the auto insertion, which adds package sections and
> priorities if missing:

> ============================================================================
[snip]
> ============================================================================

> I'll remove the redundant list.

Thanks, though I included this on my patch for item 13 below, rather
than changing all those lines to later remove them.

> > 6. Maintainer and Uploaders handling. I would like for "uploaders" to be
> >    a proper one-to-many relationship, because the new web frontend will
> >    be querying by uploader.

> Agreed.  This is already how dak does it, and it's the best way to do it.
> We should just copy what dak does here.

Great, see patch.

> >    Also, I don't think we need per-version maintainer/uploaders
> >    information. Instead, the tool that populates the DB should just use
> >    the information from the most recent version across suites.

> This might make insertion of data quite slow.  The current schema would
> allow one to write everything from Pacages/Sources into binaries/sources
> and then update the dist_binaries/dist_sources to update the current
> packages in a given component/section.

Well, I thought over this again, and I see no reason not to keep the
Uploaders per-version, so let's drop this idea.

> On the other hand, if we were to write the entire Packages/Sources into
> a temporary table, it could be fairly simple to do the version checks.

Well, this wouldn't matter much with the above point being moot.
However, in case you're interested, I'll mention that I was writing some
importers of Sources -> DB to do some testing, and actually I couldn't
find a more efficient and simple way to do that than a temporary table.

Once eg. all of Sources is imported into tmp_sources, I just did:

        => INSERT INTO sources
               SELECT * FROM tmp_sources
               EXCEPT SELECT * FROM tmp_sources

(Later I realized this assumes that tuples in sources are static. Which
is sort of true, except that AFAIK ftpmaster can adjust the priority and
section for a same version. Btw this seems as good as any other place to
mention that we have stanzas in Sources without a Priority field.)

> >    (Oh, and we should just keep the maintainter/uploaders *addresses*.)

> What is the reasoning for this?

Nothing, that I thought we'd need it so that queries for a particular
address would be reasonably fast, but it seems I was wrong and LIKE
'%<address@example.com>' performs very well.

> > 7. [distant future, no action needed now, but see the last paragraph in
> >    this item] I think we should consider killing the P-a-s file as
> >    input, and put that information in the DB. (P-a-s or similar would
> >    still be created as output for third parties.)

> >    Then, source_architectures would be a view of (eg.)
> >    source_arch_from_control and source_arch_override (P-a-s).

> >    (Or, well: without the above, where is P-a-s information merged into
> >    the DB? Directly into source_arch? Maybe we can retain P-a-s as
> >    input, but use already the two-table scheme I mentioned in the
> >    paragraph above, writing P-a-s into source_arch_override.)

> I think having a general mechanism to dump/restore P-a-s would be useful.
> I think having a separate list of arches (control and override) is
> sensible.  I also like the idea of having a view on this, but depending on
> what we want to do, we might again need to check about foreign keys and
> views.

I chatted with Philipp Kern a bit about this. We agreed that having the
Sources Architecture field available in order not to unnecessarily send
packages to the buildds is a good thing. However, he mentioned that many
times the Architecture field in Sources (the first paragraph in
debian/control) is set to any, and then the architecture field of the
rest of stanzas in debian/control is restricted. At the moment, the DB
has no place to reflect this (because that information is not in
Packages field, of course).

And then we have the fact that P-a-s is not only source based, but also
deals with binary packages.

I think all this will need quite a rework, and I suggest (despite my
initial enthusiasm) we let source_architectures be like you initially
suggested, and come back to it some time after having Postgres deployed.

> > 10. Build logs. I'd really like for the DB to store a table with all the
> >     available logs: source, version, arch, date, state, and path.

> This could be stored as a separate table, or in build_job_properties.
> A separate table would probably be cleaner.

Great. I also think a separate table is better. I'm attaching a patch.
For obvious reasons, there is no referential integrity in this table.
Also, I think having the path be the primary key is sensible, please let
me know what you think.

> I'd also like to store the build time and space statistics (currently
> stored by sbuild in a GDBM database) stored here along with the logs.

Ok, I added those as well, but without a NOT NULL restriction, otherwise
it'll be impossible to import old logs. If somebody cares enough, once
we have this DB up and the old logs imported we can run a daily job to
slowly fill in that information, if it's easily retrievable from the log
itself (which I think it is).

Also, please check that build_time = interval, and used_space => integer
are appropriate names and types. (In particular, I'm not sure if "space
statistics" is just a number, or more.)

> > 13. The _name suffix everywhere annoys me a bit, since I like having
> >     tables that can be natural join'ed, or using(foo) join'ed. Maybe
> >     this is just me, and the _name suffix is excelent SQL practice, but
> >     I thought I'd mention to see what other think.

> If it makes queries simpler, and more understandable, then I'm all for it.

Okay, I went ahead and did this. I think it results in a sensibly more
pleasant query experience. Compare:

  => SELECT * FROM uploaders u (source, version, maintainer)
         NATURAL JOIN sources
         NATURAL JOIN suite_sources
         WHERE suite = 'unstable' ORDER BY 3;

to the old:

  => SELECT * FROM uploaders u JOIN sources s
         ON (s.name = u.source_name AND
             s.version = u.source_version AND
             s.maintainer = u.uploader)
         NATURAL JOIN suite_sources
         WHERE suite_name = 'unstable' ORDER BY 3;

or the traditional and horrid:

  => SELECT * FROM sources s, suite_sources ss, uploaders u
         WHERE s.name = u.source_name AND
               s.version = u.source_version AND
               s.maintainer = u.uploader AND
               ss.source_name = s.name AND
               ss.source_version = s.version AND
               ss.suite_name = 'unstable' ORDER BY 3;

> > 14. I'm attaching 3 trivial patches that hopefully nobody will find
> >     objectionable.

> These are now applied.

Thanks!

> > 15. Could we name the database "wannab"?

> I'm not too bothered what it's called.  Any other suggestions?
> On the wanna-build side, this will be made configurable in any case,
> so you can name it whatever you like in the connect string.

Ok. I just find sbuild-packages a bit long and not really appropriate,
so I'd like for the DB in raff to be called "wannab" if nobody objects.

Cheers,

-- 
Adeodato Simó                                     dato at net.com.org.es
Debian Developer                                  adeodato at debian.org
 
Que no te vendan amor sin espinas
                -- Joaquín Sabina, Noches de boda

Reply to:

Follow-Ups:
- [PATCH 02/12] [db] Create language "plpgsql", needed by the package_checkrel() function
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 04/12] [db] Ensure names for all PRIMARY KEY constraints end with _pkey
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 01/12] [db] Source debversion.sql and not version.sql from db.sql
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 07/12] [db] Distinguish between arches with buildds, and the Architecture field
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 05/12] [db] Add an "address" column to the builders table
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 06/12] [db] Add a "value" column to package_priorities so that they are sortable
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 09/12] [db] Add a build_logs table to store a list of all available build logs
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 12/12] [db] Add table to list the admin(s) of each buildd
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 11/12] [db] Add an "arch" column to the builder table
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 08/12] [db] Make "uploaders" be a one-to-many mapping, rather than a single value
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 03/12] [db] Rename "distribution" to "suite" (to be consistent with dak)
  - From: Adeodato Simó <dato@net.com.org.es>
- [PATCH 10/12] [db] Prefer column names derived from table name, and drop _name suffix
  - From: Adeodato Simó <dato@net.com.org.es>
- Re: RFC: Creation of a PostgreSQL database schema for wanna-build data
  - From: Marc Brockschmidt <marc@marcbrockschmidt.de>

References:
- RFC: Creation of a PostgreSQL database schema for wanna-build data
  - From: Roger Leigh <rleigh@codelibre.net>
- Re: RFC: Creation of a PostgreSQL database schema for wanna-build data
  - From: Adeodato Simó <dato@net.com.org.es>
- Re: RFC: Creation of a PostgreSQL database schema for wanna-build data
  - From: Roger Leigh <rleigh@codelibre.net>

Prev by Date: [dak/master] [various]
Next by Date: [PATCH 02/12] [db] Create language "plpgsql", needed by the package_checkrel() function
Previous by thread: Re: RFC: Creation of a PostgreSQL database schema for wanna-build data
Next by thread: [PATCH 02/12] [db] Create language "plpgsql", needed by the package_checkrel() function
Index(es):
- Date
- Thread