[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Debian derivatives census: patch generation runs daily

Hi all,

Since the debdiff security issues got fixed I've now enabled daily
generation of patches between Debian and our derivatives. If you are
interested in helping fix some of the issues with this process, please
take a look at the FIXMEs in the script. If you are a Debian member you
can look at the raw data on stabile or if not you can look at the daily
rsynced output of patches smaller than 15MB on alioth.


This is enabled by the existence of snapshot.debian.org, which uses
PostgreSQL database for metadata and a hash-based (SHA-1) filesystem
structure to store all source and binary packages uploaded to Debian as
well as all the apt metadata.

The patch generation works like this:

Download the Sources files using apt-get run on the sources.list
snippets on the census wiki pages of all derivatives.

For each source package in each derivative:

Check if the dsc has ever been in Debian, if not, check if the other
parts have and therefore decide if the package is unmodified or not.
Unmodified source packages are skipped and include those with the exact
same dsc file or those where all the non-dsc parts are identical.

Try some heuristics (name, version, changelog entries) to find out if
the package could be based on some package that is or was in Debian.

If it was not then skip to the next one and make a note, since Debian
might want to know about source packages that are missing from Debian.

If it was then use debdiff to create a diff and filterdiff to create a
diff of the debian/ dir. Use the lsdiff cache to decide if the patch
should be displayed (for eg on the PTS) or not. I think I will drop this
lsdiff bit and move it to a future to-be-worked on interface to the

Here are some stats about the last run:

Ubuntu took 3 hours, all the rest finished in less than 1 hour, mainly
due to the extensive caching done by the script:

3.0M symlinks mapping between MD5/SHA-256 hashes and SHA-1 hashes for
those files where the apt metadata for derivatives do not have any SHA-1
hashes. If you are responsible for the archives of any derivatives that
are missing SHA-1 hashes in your apt metadata, we would greatly
appreciate it if you could fix your metadata 

27M symlinks mapping between human-readable patch names and the patches
directory, which uses SHA-1 hashes for file/dir names. The human
readable names look like this:


145M changelog source package, version number cache for the modified
packages from derivatives (JSON format).

1.1G lsdiff output for all the patches.

57G of files that were never in Debian (according to the snapshots
database), including orig.tar.gz/diff.gz etc.

164G of patches, most of this is 204 patches larger than 100MB each that
are created due to deficiencies in the script (see the FIXMEs) and also
in some cases unnecessary divergence or changes.



Attachment: signature.asc
Description: This is a digitally signed message part

Reply to: