[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Include git commit id and git tree id in *.changes files when uploading?



Hi all,

On Mon, Dec 15, 2025 at 08:26:51PM -0800, Otto Kekäläinen wrote:
> To be better able to audit the software supply-chain I have been
> thinking that we should have more git info in the changes file, namely
> the git commit id it was generated from, and just in case also the git
> tree id as well.

I love the idea Otto! Finally an incrementally implementable way to get
more insight into where we're at with divergence between git and the
archive. Awesome!


Santiago,

On Tue, Dec 16, 2025 at 09:39:31AM +0100, Santiago Vila wrote:
> > [...] have more git info in the changes file [...]
> 
> Your proposal would prevent lost git histories to be reconstructed
> and wrong git histories to be fixed.

I don't quite see how this would prevent anything. Perhaps we should think
of this proposal more as a divergence monitoring system?

I don't 100% understand your use case yet, can you maybe show more clearly
what you mean? The linked repo doesn't have any commits by you that I saw.

Keep in mind that by including the git tree-id at least pushing a 1:1 copy
of the uploaded dsc back into git should (hopefully) result in the same git
tree-id since tree-ids are just a deterministic hash over all the
files+directories unlike the commit-id which will obviously change due to
the included time+date.

This should IMO cover most of what we need when it comes to git history
fixups, no?

> So, before implementing your idea or even thinking about it, I'd like
> to see a greater effort in keeping the archive and the git histories
> in sync project-wide.

Right now we have no idea what the Debian wide git<>archive diffs even look
like in detail or how big they are.

Otto's proposal would allow us to measure this, build an automated
monitoring system around it and then work on driving the size of the diffs
down over time. Eg. by introducing a new "smell"
https://trends.debian.net/#smells.

This is how we can collectively make the effort you want to see actually
happen!

Adrian,

On Thu, Dec 18, 2025 at 10:56:42AM +0200, Adrian Bunk wrote:
> If you want to actually be able to use that for audit purposes, you
> might not want to work with the maintainer-specific mess that Salsa is.
> 
> Only debian/ or complete sources?
> debian/patches/ or patches applied?
> One git repository per package, or 1k packages in one git repository?
> The contents of a git tag/commit does sometimes not match the
> contents of the package in the archive with the matching version.

While that does sound dire at first glance there are only so many workflows
and we might just be able to work through a large portion of them and teach
the monitoring system to recognize such divergences and (if they are
innocent) reflect this in smaller and smaller diffs as support becomes
better.

> And a git repository might disappear, or the commit might disappear,
> or the commit was never pushed anywhere.

We should be able to migate this by having infra that pulls and archives
the git commits right after it shows up in the archive (ala
snapshots.debian.org).

I could also imagine the repo/commit being acessible in this way (when the
Otto's metadata fields are present) being an acceptence criteria for FTP
uploads in the future to keep things consistent.

> The proper solution would be if we had the git trees in the archive,
> in a modern setup where the buildds are integrated in the git hosting
> runner infrastructure so that the git CI tests the actual packages.

Agreed. That's the best long-term solution. Personally I'm hoping the
recent changes in FTP team structure will actually allow this to happen
now. Working on Source format 3.0 (git) has been on my TODO list for quite
a while now, but was always blocked by perceived FTP team disinterest.

On Thu, Dec 18, 2025 at 09:05:43PM +0200, Adrian Bunk wrote:
> the "To be better able to audit the software supply-chain" is the part
> I disagreed with [...]

Given what I've written above do you still think so?


Gunnar,

On Thu, Dec 18, 2025 at 10:26:48AM -0600, Gunnar Wolf wrote:
> The points you mention are all valid. However, I support Otto's idea here —
> Git repoistories might disappear, or their history might be rewritten. It
> _most often_, however, does not happen — sharing the specific commit from
> which a given tree was built costs us _very_ little, and can provide
> important information for many use cases.

Exactly! Except we can even mitigate dissapering repos see above.

> Right, and I completely also support tag2upload as one of the most
> important steps forward in Debian usability and modernization!

Absolutely! Thanks again to Sean, Ian and all who contributed to
implementing it. Best thing to happen for Debian's developer
approachability in a good while <3.


Guillem,

On Fri, Dec 19, 2025 at 12:30:21PM +0100, Guillem Jover wrote:
> On Mon, 2025-12-15 at 20:26:51 -0800, Otto Kekäläinen wrote:
> > Has somebody else already been thinking about the same? Do others see
> > value in this?
> 
> [...] let me try to do a shallow pass over it (which means I might miss
> stuff!), to see how this could look like.

Thanks for having a look at this.

> If this was to be added, I think .dsc would be the more appropriate
> file, because .changes is a file that gets processed during uploads
> (including binary-only ones) and its information then gets set aside.
> Also the file that contains the Vcs-* fields is .dsc not .changes.
> 
> If dpkg-source were to add that kind of information, it should be
> reliable and usable. But my hunch is that this tool cannot easily
> guarantee that. Things that come to mind (some of which have already
> been mentioned in the thread):
> 
>   - If you keep your home under git, doing a «dpkg-source -x» under it
>     and then a «git rev-parse» will print an ID for a repo that has
>     nothing to do with the source package. I think this also means
>     that monorepos cannot be supported, because trying to find their
>     root, and not confuse it with something else it is going to be
>     tricky. And anything that is not going to end up as part of .dsc
>     (or its referenced files), cannot be validated.

Agreed. Do you think a new dpkg-source option to pass the monorepo root
would be a viable solution here? We don't have *that* many monorepos in
Debian so I expect it would be reasonably easy to ask all of them to plumb
this down to dpkg-source.

>     (I guess the equivalent of --git-dir=srcpkg-root/ and/or
>     --git-dir=srcpkg-root/debian/ should be used.)

Right that should do the right thing, also turning off git's up-traversal
repo discovery logic. Except you probably need /.git at the end:

       --git-dir=<path>
           [...]
           Specifying the location of the ".git" directory using this option
           (or GIT_DIR environment variable) turns off the repository discovery

Possibly needs to be combined with --work-tree? Maybe not since we're not
committing anything? Either way something like --git-dir=srcpkg-root/.git
--work-tree=srcpkg-root/ should be fully specified.

Quick test:

    $ mkdir -p /tmp/top/bot
    $ git init -C /tmp/top/bot
    $ git init -C /tmp/top/
    $ cd /tmp/top/; touch top; git add top; git commit -m top
    $ cd bot/;      touch bot; git add bot; git commit -m bot

    $ git --git-dir=/tmp/top/ log --oneline
    fatal: not a git repository: '/tmp/top/'

    $ git --git-dir=/tmp/top/.git log --oneline
    5d36f8b (HEAD -> master) top

>   - If you do variants/equivalent of «apt source --download-only»,
>     «dpkg-source --skip-patches -x», «git init», «git add -A»,
>     «git commit -m Import», to avoid the mess that is dealing with
>     random git workflows. Then you'd get information for a local
>     throwaway repo.

Why would anyone even be doing this in the first place? I don't quite
understand the motivation.

>     (I guess the code should check whether there's a remote that
>     matches the Vcs-Git field, and whether the upstream branches
>     match the local one.)

Right. Should be easy enough to detect by such repos not having any
remotes. Alternatively infra can flag/reject such uploads later.

What do you think of using the Vcs-Git<>git-remote cross-check as a policy
tool in the future? Start with a warning and tighten the screws once we see
the project is moving in this direction.

>   - The code would need to check that the repo is clean, and that's
>     going to be annoying to do with a mix of patches applied/unapplied
>     git workflows, and dpkg-source only being called to build the
>     source (but obviously not to extract it).

I don't think this is necessary. We can moitor and enforce this at the
project level, doesn't (necessarily) have to be enforced by dpkg-source
locally. Do you see a problem with that?

>     (I guess repos with patches applied could be declared
>     unsupported, and then dpkg-source could check for cleanliness
>     before preparing the source tree and record that somewhere.)

We should be able to reproduce the patches-applied repo (at least the
tree-id) on the monitoring infra side, no? Even if not exactly we can
measure and look at the diffs we end up with and get to work from there.

Since all of this is off the security critical path to the archive the
monitoring system could do all sorts of shananigans to arrive at the right
sequence of workflow steps to get the hashes to reproduce. Hell we could
even randomy try different things and just record what was needed if it
comes to that.

On Fri, Dec 19, 2025 at 01:01:35PM +0100, Guillem Jover wrote:
> Hmm, I think the code would also need to track the status/hashes of
> debian/patches/ and then check that these have not been modified
> between --before-build and --build, which might be a bit annoying.
> 
> As an interface I think this also has the potential for being an
> unreliable generator, because I don't think it should ever fail if it
> finds any unsupported state where it could not add the data. And that
> would mean that starting from, say, an unclean state (patches applied)
> then no git commit data gets recorded.

Right, good catch. I disagree that this should never fail. I see this as
another useful policy leaver. We can start with warning and then enforce it
if we find no tools/workflows that need this - or none remain in use after
we start this work ;-).

Even if it's unreliable until this is enforced that's still useful to get
us to a point where we *can* start enforcing.

On Fri, Dec 19, 2025 at 12:30:21PM +0100, Guillem Jover wrote:
> So, barring other problems I might have missed (and happy to hear them
> if someone can come up with new ones), I guess it might not be too
> onerous after all to add this kind of information for a specific set of
> git workflows, but certainly not in a universal way. I think it would
> also need to be added in a new field, because the way the tag2upload
> ones are specified they do not allow other such generators.

ACK. Happy to hear your on-board with the general idea. Thanks again
Guillem for diving into the dpkg side of this :-).

--Daniel

Attachment: signature.asc
Description: PGP signature


Reply to: