[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Validating tarballs against git repositories



On Sat, 30 Mar 2024 at 14:16:21 +0100, Guillem Jover wrote:
> in my mind this incident reinforces my view that precisely storing
> more upstream stuff in git is the opposite of what we'd want, and
> makes reviewing even harder, given that in our context we are on a
> permanent fork against upstream, and if you include merge commits and
> similar, there's lots of places to hide stuff. In contrast storing
> only the packaging bits (debian/ dir alone) like pretty much every
> other downstream is doing with their packaging bits, makes for an
> obviously more manageable thing to review and not get drown into,
> more so if we have to consider that next time perhaps the long-game
> gets played within Debian.

I'd like to push back against this, because I'm not convinced by this
reasoning, and I'd like to provide another point of view to consider.

I find that having the upstream source code in git (in the same form that
we use for the .orig.tar.*, so including Autotools noise, etc. if present,
but excluding any files that we exclude by repacking) is an extremely
useful tool, because it lets me trace the history of all of the files
that we are treating as source - whether hand-written or autogenerated -
if I want to do that. If we are concerned about defending against actively
malicious upstreams like the recent xz releases, then that's already a
difficult task and one where it's probably unrealistic to expect a high
success rate, but I think we are certainly not going to be able to achieve
it if we reject tools like git that could make it easier.

Am I correct to say that you are assuming here that we have a way to
verify the upstream source code out-of-band (therefore catching the xz
backdoor is out-of-scope here), and what you are aiming to detect here
is malicious changes that exist inside the Debian delta, more precisely
the dpkg-source 1.0 .diff.gz or 3.0 (quilt) .debian.tar.*? If that's your
threat model, then I don't think any of the modes that dgit can cope with
are actually noticeably more difficult than a debian/-only git repo.

As my example of a project that applies patches, I'm going to use
bubblewrap, which is a small project and has a long-standing patch that
changes an error message in bubblewrap.c to point to Debian-specific
documentation; this makes it convenient to tell at a glance whether
bubblewrap.c is the upstream version or the Debian version.

There are basically three dgit-compatible workflows, with some minor
adjustments around handling of .gitignore files:

- "patches applied" (git-debrebase, etc.):
  This is the workflow that proponents of dgit sometimes recommend,
  and dgit uses it as its canonicalized internal representation of
  the package.
  The git tree is the same as `dpkg-source -x`, with upstream source code
  included, debian/ also included, and any Debian delta to the upstream
  source pre-applied to those source files.
  In the case of bubblewrap, if we used this workflow, after you clone
  the project, bubblewrap.c would already have the Debian-specific error
  message.
  (dgit --split-view=never or dgit --quilt=dpm)

- "patches unapplied" (gbp pq, quilt, etc.):
  This is the workflow that many of the big teams use (at least Perl,
  Python, GNOME and systemd), and is the one that bubblewrap really uses.
  The git tree is the same as `dpkg-source -x --skip-patches`, with
  upstream source code included, and debian/ also included.
  Any Debian delta to the upstream source is represented in debian/patches
  but is *not* pre-applied to the source files: for example, in the case
  of bubblewrap, after you clone
  https://salsa.debian.org/debian/bubblewrap.git and view bubblewrap.c,
  it still has the upstream error message, not the Debian-specific one.
  (dgit --quilt=gbp or dgit --quilt=unapplied; I use the latter)

- debian/ only:
  This is what you're advocating above.
  The git tree contians only debian/. If there is Debian delta to the
  upstream source, it is in debian/patches/ as usual.
  (dgit --quilt=baredebian* family)

In the "patches applied" workflow, the Debian delta is something like
`git diff upstream/VERSION..debian/latest`, where upstream/VERSION must
match the .orig.tar.* and debian/latest is the packaging you are reviewing.
Not every tree is a valid one, because if you are using 3.0 (quilt),
then there is redundancy between the upstream source code and what's in
debian/patches: it is an error if the result of reverting all the patches
does not match the upstream source in the .orig.tar.*, modulo possibly
some accommodation for changes to **/.gitignore being accepted and ignored.
To detect malicious Debian changes in 3.0 (quilt) format, you would want
to either check for that error, or review both the direct diff and the
patches.

Checking for that error is something that can be (and is) automated:
I don't use this workflow myself, but as far as I'm aware, dgit will
check that invariant, and it will fail to build your source package
if the invariant doesn't hold. dpkg-source in 3.0 (quilt) format will
also make your source package fail to build if the desired invariant
isn't true, except where intentionally ignored (ignoring deleted files,
ignoring chmod +x, etc.).

In the "patches unapplied" workflow, the Debian delta is exactly the
contents of debian/, including debian/patches. There is redundancy between
the upstream source code and the Debianized branch: it is an error for
`git diff upstream/VERSION..debian/latest` to show any changes outside
debian/, except for possibly **/.gitignore.
To detect malicious Debian changes, you would want to check for
that error.

Again, checking for that error is something that can be (and is)
automated: I use this workflow myself (e.g. in bubblewrap), so I know from
experience that dgit *does* check for that error, and will fail to build
the source package if the invariant does not hold. Again, dpkg-source
in 3.0 (quilt) format will also make your source package fail to build
if that error exists, except in the cases that it intentionally ignores.

In the "debian/ only" workflow, the Debian delta is exactly the contents
of debian/. There is no redundancy, so every tree is in some sense a
valid one (although of course sometimes patches will fail to apply, or
whatever).

In summary, my claim is that having the upstream source code in git is a
good thing because it makes detecting malicious upstreams less difficult
than in a debian/-only workflow, while not making detecting malicious
packagers noticeably harder: the only extra thing that is necessary is to
carry out a check that can be done automatically, to assert that the tree
is internally-consistent for whichever workflow is in use.

    smcv


Reply to: