Re: Validating tarballs against git repositories

To: debian-devel@lists.debian.org
Subject: Re: Validating tarballs against git repositories
From: Antonio Russo <antonio.e.russo@gmail.com>
Date: Fri, 29 Mar 2024 23:53:20 -0600
Message-id: <[🔎] 61e2abfa-b041-4f8d-af9b-a7183cc653e1@gmail.com>
In-reply-to: <[🔎] ZgeYAQuIz7DBtunP@thunder.hadrons.org>
References: <[🔎] 2fbfdafc-31f1-43df-be49-f6b0725bfb3e@aerusso.net> <[🔎] ZgeYAQuIz7DBtunP@thunder.hadrons.org>

On 2024-03-29 22:41, Guillem Jover wrote:
> Hi!
> 
> On Fri, 2024-03-29 at 18:21:27 -0600, Antonio Russo wrote:
>> This is a vector I've been somewhat paranoid about myself, and I
>> typically check the difference between git archive $TAG and the downloaded
>> tar, whenever I package things.  Obviously a backdoor could have been
>> inserted into the git repository directly, but there is a culture
>> surrounding good hygiene in commits: they ought to be small, focused,
>> and well described.
> 
> But the backdoor was in fact included in a git commit (it's hidden
> inside a test compressed file).
> 
> The part that was only present in the tarball was the code to extract
> and hook the inclusion of the backdoor via the build system.

Yes. The "test compressed file" needs to be massaged via:

  >  tr "	 \-_" " 	_\-" | xz -d

That code comes out of the m4 file, which is not present in git source
code.  I'm unaware at this point of any direct evidence that the git
source code alone is in any way dangerous (aside from the fact that we
cannot trust the developer at all!).

>> People are comfortable discussing and challenging
>> a commit that looks fishy, even if that commit is by the main developer
>> of a package.  I have been assuming tooling existed in package
>> maintainers' toolkits to verify the faithful reproduction of the
>> published git tag in the downloaded source tarball, beyond a signature
>> check by the upstream developer.  Apparently, this is not universal.
>>
>> Had tooling existed in Debian to automatically validate this faithful
>> reproduction, we might not have been exposed to this issue.
> 
> Given that the autogenerated stuff is not present in the git tree,
> a diff between tarball and git would always generate tons of delta,
> so this would not have helped.

I may not have been clear, but I'm suggesting scrubbing all the
autogenerated stuff, and comparing that against a similarly scrubbed
git tag contents.  (But you explain that this is problematic.)

>> Having done this myself, it has been my experience that many partial
>> build artifacts are captured in source tarballs that are not otherwise
>> maintained in the git repository.  For instance, in zfs (which I have
>> contributed to in the past), many automake files are regenerated.
>> (I do not believe that specific package is vulnerable to an attack
>> on the autoconf/automake files, since the debian package calls the
>> upstream tooling to regenerate those files.)

(Hopefully the above clears up that I at least have some superficial
awareness of the build artifacts showing up in the release tarball!)

>> We already have a policy of not shipping upstream-built artifacts, so
>> I am making a proposal that I believe simply takes that one step further:
>>
>> 1. Move towards allowing, and then favoring, git-tags over source tarballs
> 
> I assume you mean git archives out of git tags? Otherwise how do you
> go from git-tag to a source package in your mind?

I'm not wed to any specific mechanism, but I'd be content with that.  I'd
be most happy DD-signed tags that were certified dfsg, policy compliant
(i.e., lacking build artifacts), and equivalent to scrubbed upstream source.
(and more on that later, building on what you say).

Many repositories today already do things close to this with pristine-tar,
so this seems to me a direction where the tooling already exists.

I'll add that, if we drop the desire for a signed archive, and instead
require a signed git-tag (from which we can generate a source tar on
demand, as you suggest), we can drop the pristine-tar requirement.  If we
are less progressive, but move to exclusively with Debian-regenerated
.tar files, we can probably avoid many of the frustrating edge cases that
pristine-tar still struggles with.

>> 2. Require upstream-built artifacts be removed (instead, generate these
>>    ab-initio during build)
> 
> The problem here is that the .m4 file to hook into the build system was
> named like one shipped by gnulib (so less suspicious), but xz-utils does
> not use gnulib, and thus the autotools machinery does not know anything
> about it, so even the «autoreconf -f -i» done by debhelper via
> dh-autoreconf, would not regenerate it.

The way I see it, there are two options in handling a buildable package:

1. That file would have been considered a build artifact, consequently
removed and then regenerated.  No backdoor.

2. The file would not have been scrubbed, and a difference between the
git version and the released tar version would have been noticed.
Backdoor found.

Either of these is, in my mind, dramatically better than what happened.

One automatic approach would be run dh-autoreconf and identify the
changed files.  Remove those files from both the distributed tarball and
git tag.  Check if those differ. (You also suggest something very similar
to this, and repacking the archive with those debian-generated build
artifacts).

I may be missing something here, though!

> Removing these might be cumbersome after the fact if upstream includes
> for example their own maintained .m4 files. See dpkg's m4 dir for an
> example of this (although there it's easy as all are namespaced but…).

I am not an m4 expert (in fact, I have specifically tried to avoid
learning anything more about auto(make/reconf) than absolutely necessary.

My point is just: either those files are needed, or not.  If they're
needed, they need to not differ.  And if they're not, they should
be scrubbed.

I think you are saying that doing this automatically is going to be
hard/impossible.  Is that fair?

> Not using an upstream provided tarball, might also mean we stop being
> able to use upstream signatures, which seems worse. The alternative
> might be promoting for upstreams to just do the equivalent of
> «git archive», but that might defeat the portability and dependency
> reduction properties that were designed into the autotools build
> system, or increase the bootstrap set (see for example the
> pkg.dpkg.author-release build profile used by dpkg).

Ok, so am I understanding you correctly in that you are saying: we do
actually want *some* build artifacts in the source archives?

If that's the case, could make those files at packaging time, analogous
to the DFSG-exclude stripping process?

> (For dpkg at least I'm pondering whether to play with switching to
> doing something equivalent to «git archive» though, but see above, or
> maybe generate two tarballs, a plain «git archive» and a portable one.)

That also makes sense; I don't know enough about bootstrapping or
storage requirements to argue for one vs. the other.

>> 3. Have tooling that automatically checks the sanitized sources against
>>    the development RCSs.
> 
> Perhaps we could have a declarative way to state all the autogenerated
> artifacts included in a tarball that need to be cleaned up
> automatically after unpack, in a similar way as how we have a way to
> automatically exclude stuff when repackaging tarballs via uscan?

Yes, and I yet hold out hope that there might be automatic ways to
identify these. This is a significant task, and I think this step is one
of the major things that will require human time.

> (.gitignore, if upstream properly maintains those might be a good
> starting point, but that will tend to include more than necessary.)
> 
>> 4. Look unfavorably on upstreams without RCS.
> 
> Some upstreams have a VCS, but still do massive code drops, or include
> autogenerated stuff in the VCS, or do not do atomic commits, or in
> addition their commit message are of the style "fix stuff", "." or
> alike. So while this is something we should encourage, it's not
> sufficient. I think part of this might already be present in our
> Upstream Guidelines in the wiki.

I somehow doubt that wagging our fingers is going to do much. (So yeah,
I definitely think we wholeheartedly agree here!)

>> In the present case, the triggering modification was in a modified .m4 file
>> that injected a snippet into the configure script.  That modification
>> could have been flagged using this kind of process.
> 
> I don't think this modification would have been spotted, because it
> was not modifying a file it would usually get autogenerated by its
> build system.

If we look at what ./autogen.sh would have changed, and scrub those
files from the release archive, wouldn't that mean that the malicious
m4 file would have been spotted, since it would NOT have been autogenerated?

>> While this would be a lot of work, I believe doing so would require a
>> much larger amount of additional complexity in orchestrating attacks
>> against Debian in the future.
> 
> It would certainly make it a bit harder, but I'm afraid that if you
> cannot trust upstream and they are playing a long game, then IMO they
> can still sneak nasty stuff even in plain sight with just code commits,
> unless you are paying extreme close attention. :/
> 
> See for example <https://en.wikipedia.org/wiki/Underhanded_C_Contest>.

I take a look at these every year or so to keep me terrified of C!
If it's a single upstream developer, I absolutely agree, but if there's an
upstream community reviewing the git commits, I really do believe there is
hope (of them!) identifying bad(tm) things.

I just want to make sure that what we actually pull in is what the
community is actually reviewing.  I feel like anything less gets dangerous.
(Given few enough eyeballs, all bugs are deep!)

But, I will definitely concede that, had I seen a commit that changed
that line in the m4, there's a good chance my eyes would have glazed over
it.

> 
> Thanks,
> Guillem
> 

Thank you,
Antonio

[1] https://gitlab.archlinux.org/archlinux/packaging/packages/xz/-/commit/881385757abdc39d3cfea1c3e34ec09f637424ad

Reply to:

Follow-Ups:
- Re: Validating tarballs against git repositories
  - From: Russ Allbery <rra@debian.org>
- Re: Validating tarballs against git repositories
  - From: Aníbal Monsalve Salazar <anibal@debian.org>
- Re: Validating tarballs against git repositories
  - From: Guillem Jover <guillem@debian.org>

References:
- Validating tarballs against git repositories
  - From: Antonio Russo <aerusso@aerusso.net>
- Re: Validating tarballs against git repositories
  - From: Guillem Jover <guillem@debian.org>

Prev by Date: Re: Validating tarballs against git repositories
Next by Date: Re: Validating tarballs against git repositories
Previous by thread: Re: Validating tarballs against git repositories
Next by thread: Re: Validating tarballs against git repositories
Index(es):
- Date
- Thread