Adding more data to UDD

To: debian-qa@lists.debian.org, lucas@debian.org
Subject: Adding more data to UDD
From: Helmut Grohne <helmut@subdivi.de>
Date: Wed, 6 Aug 2025 19:17:57 +0200
Message-id: <[🔎] 20250806171757.GA2988685@subdivi.de>
Mail-followup-to: Helmut Grohne <helmut@subdivi.de>, debian-qa@lists.debian.org, lucas@debian.org

Hi Lucas et al,

in Brest we talked about getting more data into UDD in several contexts.
Let me give an overview of what data I have and we may look into which
of it poses most beneficial to UDD.

crossqa.debian.net is running on a sqlite3 database whose schema is to
be found at
https://git.subdivi.de/~helmut/crossqa.git/tree/schema.sql .
The contents are of varying interest:
 * depstate
   * This mostly duplicates the data available from
     https://qa.debian.org/dose/debcheck/cross_unstable_main_amd64/
   * It presently has 164 rows (one for each source package times
     architectures) and the data is highly fluctuating. For instance,
     a glibc or gcc upload may render all packages unsatisfiable. From
     a user's point of view what would be more interesting here is an
     aggregate over time, but that's not something I have. The dose page
     has a "Summary by duration".
   * The most similar thing in UDD is the wannabuild table.
 * depcheck internally manages when to rerun dose. Boring.
 * builds
   * We've crossed one million builds now and each of them has a row
     here. This is an append-only table, but I occasionally delete rows
     referencing corrupted logs.
   * Interpreting the data is difficult. A package may or may not have
     recent builds (e.g. old successful builds may indicate regressed
     B-Ds). A package may have successful or unsuccessful builds and
     this may even be per architecture. Telling whether a package has a
     problem from this data is a science and guesswork. Still, this is
     what has been requested in the cross build bof to be fed into DDPO
     or the dashboard.
 * buildrequests and building is about scheduling and boring from a UDD
   pov.
 * bugs is mirroring data from the BTS. UDD already has this.
The complete database file is 150MB. Size likely is not a problem here.

Then there is the dedup.debian.net/multiarch-hints/dumat/conflict
analyzer family of archive analyzers. They run related code bases with
different analyzers and different target package sets. What they have in
common is that they continuously download binary packages from the
archive, perform an extraction step specific to the use case and store
the result in a sqlite3 database for offline analysis. These databases
range from 2GB (dedup.d.n) to 20GB (experiment with conflict analysis
involving multiarch).

Features:
 * Checksums of files.
   * dedup.d.n only does it for amd64, but also has gunzipped checksums
     and checksums of decompressed image files to compare differently
     compressed PNGs against each other.
   * The multiarch hinter only has checksums for Arch:any packages.
 * File metadata (owner, group, filetype, link target):
   * This is not obtainable via the Contents indices.
 * DEBIAN/control metdata (relations, Multi-Arch)
   * Depends are relevant in several situations to judge whether
     installing one package requires installing another or whether they
     are mutually exclusive.
   * The multiarch hinter of course looks at Multi-Arch, but so does
     dumat.
   * Quite a few of these fields are already present in the packages
     table, but it stuffs the entire Depends field into one column.
 * trigger interest is extracted by dumat.
 * diverted files are extracted by dumat as this is necessary for
   rigorous conflict analysis.

Schemas:
 * dedup.d.n https://git.subdivi.de/~helmut/debian-dedup.git/tree/schema.sql
 * multiarch https://git.subdivi.de/~helmut/debian-dedup.git/tree/schema.sql?h=multiarchhints
 * dumat https://salsa.debian.org/helmutg/dumat/-/blob/master/schema.sql
 * When you look into these schemas, ignore bugs as UDD has these
   already.

I observe that multiarch hints already are imported into the
multiarch_hints table. Is that a new thing?

I'm not sure any of these tools should feed into UDD, but if we can come
up with features that UDD is interested in, it would be relatively easy
to write a custom extractor for the features of interest. Running a full
import typically completes within two days even if you import all
release architectures for several releases assuming good connectivity
and a fast machine. There is no benefit in backing up these tables as
recreating them from scratch is less error prone.

>From my pov, the most beneficial feature in UDD would be basic checksums
for installed files to correlate and search for them optionally with
file metadata. The second feature I'd be interested in is package
implication and coinstallability, but this is very difficult to model in
SQL so we may be best off leaving such questions to apt.

I propose that we start with looking into UDD schema extensions and
figure out what we want to capture there. From there, I can look into
providing suitable exporters (crossqa.d.n) or extractors (dedup family).

Possible tables:

CREATE TABLE cross_build_history (
    source TEXT NOT NULL,
    version DEBVERSION NOT NULL,
    build_architecture TEXT NOT NULL,
    host_architecture TEXT NOT NULL,
    started TIMESTAMP NOT NULL,
    successful BOOLEAN NOT NULL
    -- let's not link to build logs
);

For package contents, it would be useful to have an integer id column on
the packages table. A package_contents table could then reference that
id (with ON DELETE CASCADE). Without an id column, those 5e7 rows would
each have to list package, version and architecture.

CREATE TABLE package_contents (
    package_id INTEGER NOT NULL REFERENCES packages(id),
    filename TEXT NOT NULL,
    filetype INTEGER NOT NULL, -- use S_IF* numeric values?
    owner TEXT, -- use NULL for root?
    group TEXT, -- use NULL for root?
    mode INTEGER, -- use NULL for symlinks?
    link_target TEXT, -- use NULL for non-links?
    sha256 BIT(256) -- use NULL for non-files?
);

If we want to support multiple hashes or replacing the hash function,
we could separate hashes to another table (see dedup.d.n schema).

Now this mail got quite long. Hence, I suggest ignoring much of it and
focusing on the aspect of most interest.

Thanks for considering

Helmut

Reply to:

Prev by Date: Bug#1110429: qa.debian.org: 404 Not found due to trailing slash breaks codesearch.d.o
Next by Date: Bug#1110515: "versions" shows outdated testing information
Previous by thread: Bug#1110429: marked as done (qa.debian.org: 404 Not found due to trailing slash breaks codesearch.d.o)
Next by thread: Bug#1110515: "versions" shows outdated testing information
Index(es):
- Date
- Thread