[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [RFC] Proposal for new source format



Russell Stuart <russell-debian@stuart.id.au> writes:

> Has it been done?  Given this point has been raised several times before
> if it hasn't been done by now I think it's reasonable to assume it's
> difficult, and thinking that it's so is not excessively pessimistic.

Oh, it's news to me that anyone has raised this before.  I was assuming no
one had bothered to try yet because it wasn't relevant.

Intuitively it feels like a much easier problem than reproducible binaries
given the nature of a Git repository.  The hardest part is probably the
same as with tar: how to keep the output reproducible over time as Git
changes.  I haven't tried it, though.

> I personally wonder how the mirrors are expected to handle .git
> repositories.  That would increase the number of files they have to
> handle by a couple of orders of magnitude.  What are the plans for that?
> Maybe you think that can handle it?  Maybe you plan to abandon the
> mirror network in favour of something else like the CDN?  Maybe you plan
> to remove the source from the mirrors?

I was implicitly assuming the actual source format would be some archive
of the Git repository rather than the raw Git repository.  I agree that
distributing raw Git repositories and thus tons of separate files per
source package doesn't sound like a good idea.  Although that does mean
that one can't just point a Git client at the archive, which would have
been neat.  More on that below.

Agreed that it's worth saying that explicitly, and it might be worth some
thought on what the best archive format would be, since tar has proven
troublesome for reproducibility.

I gave this some more thought over dinner and realized that my previous
message wasn't very constructive.  Let me try to make up for that by
describing what my goals are.  In writing this up, I realized that these
goals may not need to be met by the archive.  It feels awkward and less
than ideal to me to have multiple distribution points for source packages
in different formats, but it could be less awkward than the alternatives,
I suppose.

My goals (some of which are already met by dgit) are:

1. Every package in Debian has a canonical representation of its source
   history in Git, with a branch structure that reflects the divergence
   between different archive suites.  This history has at least one commit
   per upload, although ideally has the package maintainer's full revision
   history and upstream's full revision history.

2. This representation is readily available in some straightforward way
   (git clone would be ideal, some equivalently simple tool would be
   fine).

3. Every uploaded package clearly and unambiguously maps to a signed tag
   in the Git repository in the appropriate place in the revision
   history.

4. It's possible to upload a new version of a package to Debian (if one
   has the relevant permissions) by adding a signed tag and pushing to
   some Git remote.  If that upload is successful (which at least involves
   permission and sanity checks and ideally involves a test suite), that
   new upload appears in the canonical Git repository.  This should not
   require rewriting the branch or tag relative to the maintainer's local
   repository; in other words, it should match the Git tree that the
   maintainer tagged.

All of these together allow us to interact with the archive the way that
is now common to interact with other large Git projects, following any of
the standard Git workflows and using Git as the native tool for expressing
changes and tagging releases.  (At this point, I think it's safe to say
that Git has sufficiently won the VCS wars that any future wildly popular
VCS will have some mechanism to bidirectionally interact with Git
repositories.)

I believe dgit already does 1-3.  tag2upload would achieve 4.

In looking this over, none of this precludes the source format 4.0 that
Bastian proposed, provided that there was some way to export that source
format easily and simply from point 4.  Maybe it doesn't matter what's
published in the source repository if everyone who wants this workflow
uses some other service to interact with the Git repositories instead.  If
this were available, I personally would stop using Debian source packages
entirely and forget that they even exist, and would only use the above
workflow.  Source packages then become an internal implementation detail
of the archive that no one needs to care about unless they want to, or
unless they're maintaining the dgit import service.

It feels inelegant to me to have multiple publication mechanisms and
multiple canonical formats and the ongoing cost of conversion from one to
the other, but maybe that's already a sunk cost and it's worth paying it
to avoid having tedious arguments?

That said, Bastian's point about what we should do if we find that the Git
repository contains something that isn't distributable is valid and needs
to be dealt with regardless.  I think one of our points of disagreement is
that I don't see how this is a concern specific to the archive; we already
have this problem because Salsa is an official project service, so we need
to solve this problem for arbitrary Git repositories already.

I realize there are technical reasons why, given the current software
implementations, rewriting a Salsa repository is far easier than redacting
source packages, and since there's more "stuff" in a Git repository, there
are more opportunities for things to go poorly.  However, I think it's
excessively optimistic to believe that no one will ever accidentally add
undistributable work to a maintainer upload of a package in a change that
didn't need to go through NEW, at which point we will have this problem
with source packages anyway.

> Finally, there are more consumers of the source format than the Debian
> packagers.  For example, I regularly download Debian source packages
> just to figure why the hell something isn't working as I expect.  When
> I do that, there are two things that are important to me:

> 1.  The download is as small as possible, and doesn't require a
>     specialised tool.  (Github and gitlab go to the trouble of 
>     providing just such as thing, which I think is evidence it's
>     needed.)  The current format is pretty good in this area.  At
>     a pinch you can get away without using deb-source to unpack it. 

I agree this is desirable but disagree that the current format is very
good at all.  Unpacking the current format in all of its generality
requires either rather arcane steps or a specialized tool.

I think it's a matter of opinion whether the current 3.0 (quilt) format
with all of its complexity is better or worse on this point than a
(possibly shallow) Git repository in a tarball.  I personally think it's
worse, but I can see arguments either way.

You're on somewhat stronger ground with 3.0 (native), which I think meet
point 1 quite well, and 1.0, which isn't great but which is somewhat
better than 3.0 (quilt) on this specific metric.

> 2.  The point that has been raised here - reproducible builds of the
>     source package.  By that I mean a reproducible build should be
>     pure function that is given the upstream source package and some
>     data in the form of patches or whatever, and ends up with the
>     source and build instructions.  Being a pure function it always
>     produces the same outputs give the same inputs.

I don't agree with this definition of reproducibility.  You're defining
reproducibility from inputs that I consider build artifacts, which to me
is rather weird.

The canonical source representation of all of my Debian packages is a
packaging Git repository plus, for non-native packages, one or more
upstream release artifacts.  I define reproducibility as generating the
same Debian source package from a signed Git tag of my packaging
repository plus, for non-native packages, whatever release artifacts
upstream considers canonical (which may be a signed tarball or may be a
Git tag or may be something else entirely).  All of this business with
patches and whatnot is an implementation detail.

-- 
Russ Allbery (rra@debian.org)              <https://www.eyrie.org/~eagle/>


Reply to: