Re: limits for package name and version (MBF alert: ... .deb filenames)

To: Uoti Urpala <uoti.urpala@pp1.inet.fi>
Cc: debian-devel@lists.debian.org
Subject: Re: limits for package name and version (MBF alert: ... .deb filenames)
From: Henrique de Moraes Holschuh <hmh@debian.org>
Date: Wed, 27 Apr 2011 15:11:14 -0300
Message-id: <[🔎] 20110427181114.GB7102@khazad-dum.debian.net>
In-reply-to: <[🔎] loom.20110426T233140-621@post.gmane.org>
References: <[🔎] 1303567600.4264.5.camel@localhost> <[🔎] 20110424003151.GB7323@angband.pl> <[🔎] 1303617286.3032.83.camel@localhost> <[🔎] 20110425202520.GA15812@angband.pl> <[🔎] 1303766993.3032.301.camel@localhost> <[🔎] 20110425222012.GA20724@angband.pl> <[🔎] 20110426022453.GA6256@khazad-dum.debian.net> <[🔎] loom.20110426T062535-615@post.gmane.org> <[🔎] 20110426110254.GA31356@khazad-dum.debian.net> <[🔎] loom.20110426T233140-621@post.gmane.org>

On Tue, 26 Apr 2011, Uoti Urpala wrote:
> This branch of the thread was NOT about packages that use date ONLY. Maybe
> that's what you were confused about above? The version would still need the
> last release name too, as in 15.3.2~rc3+svn20050101120000.

The two possibilities showed up in the thread: base version + commit
reference, and just commit reference (upstream "never releases" or something
to that effect).

> > > So you'll have the latest upstream version tag, followed by a long
> > > timestamp. That's no shorter than typical 'git describe' output, just a
> > > lot less functional.
> > 
> > It is *bounded*, and it can be a LOT shorter.
> 
> Typically it is not a "LOT" shorter. And as I explained in the part you
> snipped, a timestamp with one-second precision may not be enough to
> adequately identify a version in some not-particularly-rare use cases.

Well, at which point you [are supposed to] have the full information in the
changelog.  My point is that the version string needs to be short, AND it
does *not* have to extremely uniquely identify a commit to upstream: that is
NOT its primary function, although we _always_ try to do that when feasible.
And it usually _is_ feasible, at least when upstream does named releases.

But at 30 characters, you do not have space to spare to add much of a hash.

> > > Your above "tell upstream when you checked out his tree and he can locate
> > > the commit by date/time" would only work properly for timestamps of type
> > > 1). But that's not an at all realistic alternative.
> > 
> > You have the full commit info in the changelog, where you can specify
> > branch, etc. when best practice is being followed.  Use it.
> 
> If you have recorded the exact hash that will work (of course!). But what you
> were saying about timestamps would not work.

I suppose you're right, you did come up with scenarios where it would not be
enough (but I did mention you would also tell upstream the *branch* you used
to release.  Still, not even that might be enough).

> some of the issues with timestamps (though I think the explanations in my
> previous mail should already have addressed that). The commit date is when

Not really, but this one did.  Yes, there can be commit time overlaps
easily, and lots of situations where upstream would not know where you
pulled from even if you tell them the branch.

> each branch changed at a quite different time. If the top commit of the
> master branch has a commit timestamp from a month ago that means the branch
> could have been modified a minute ago.

And if upstream likes to shuffle things too much, he not be able to locate
that commit anymore.  I get it.

> > > What you wrote about identifying branches in your other mail ("and you
> > > already know which branch of which tree because that information must be
> > > available and up-to-date in debian/copyright") is also wrong or at least
> > > meaningless. Maybe you'll know that the code was available on the project's
> > > public repository under the branchname "fixes-for-debian" at the time it
> > > was downloaded. But what good will that information do for you later, if
> > > the contents of that branch were merged to another and the obsolete branch
> > > name then deleted two days after being created? In the typical case branch
> > > names are not persistent information.
> > 
> > It is at least as future-proof as hashes.  If your upstream is messy and
> > likes to rebase and lose past history, only the full commit info
> 
> You're mixing up completely different things. Nothing in my example involved
> rebasing or losing history. That's the point: branch names are not a part of
> stored history, and can disappear/change even if there is no "messiness".

Indeed. I was illustrating that even the full hash is not enough to identify
a commit (but then I was talking about a commit outside of its parent
history, which is a major misuse of the git term -- I did should have called
it a "patch" equivalent to a certain commit -- i.e. what you get when you
cherry-pick or rebase).

> > Full hash colisions are impossible, because, well, the basic constraints

[inside a DVCS repository, to identify an object in that repository. You
lost the context when quoting]

> > the VCS depends upon *BREAKS* when that happen.  That commit never gets
> > accepted into the repository because the VCS aborts/abends.  You try
> > again, get a different commit date/time and thus a different hash, the
> > colision condition is gone if you're lucky enough not to get a new one,
> > and life continues.  I really should not have to explain *THIS*.
> 
> You really should not try to explain something you clearly have no clue about.
> You get a 160-bit hash match, say "damn, bad luck there", change things a bit
> and move on? I hope some readers can at least appreciate your explanation for
> the comedy value :)

I suppose I might also want to take a screenshot for bragging rights :)
I don't think anyone managed to observe a real collision yet.

I mean you never have to care about full hash collision when trying to
identify a commit you got from upstream, because colliding objects just
cannot exist in a repository, and therefore the full hash really always map
to exactly one object upstream.

I certainly did not mean that hashes cannot collide.  They can, but in that
case the collision happens when a object is being created/imported, and the
operation should be detected and aborted immediately by the DVCS before the
collision arrives at the repository (otherwise it corrupts something).

Partial hashes *can* collide in a repo, however.

There is a thinko in this argument, see end of the email.

As for the comedic value of retrying a git commit to avoid a collision,
well, it might even work.  The boundary conditions are: the collision is
caused by the commit object, and not by one of the tree objects or one of
the blob objects, and you wait a bit before retrying: commit time
information is part of the information hashed in commit objects, so the hash
of the commit object changes and might not collide with anything anymore.
Certainly useless if you're trying a git fetch, though.

> > 0. This is about package versioning;
> > 1. You do not have space for the full hash in the version string;
> > 2. such hash alone is useless for the packaging system in the first place,
> >    it does not work as a package version by itself at all;
> > 3. the shortened hash is of limited value for "upstream identification"
> >    purposes when things get difficult, and wastes precious space;
> 
> It's of high value for "upstream identification" purposes when things are
> NOT difficult. And it's also of high value in the difficult cases as it'll
> normally make it obvious that there ARE difficulties such as changed
> upstream history; with only a timestamp you could easily make a dangerous
> mistake without realizing there's anything special to watch out for.

I'd say you better know beforehand that your upstream is playing history
rewriting games if you're going to release from their repo, but still...
yes, it is an use mode I had not thought of.

It is still not a good reason to waste part of a draconian 30 chars of space
with hash information.

> > 4. you're supposed to put lots of meta information about the top commit in
> >    the changelog to actually have something that is guaranteed to work well
> >    for "upstream identification" purposes.  That includes the full hash and
> >    more;
> > 5. using unbounded methods of identifying the upstream release is never
> >    going to be a best practice because you have to manually check it every
> >    time to not have exceeded the maximum length and when it does, you
> >    will have to fudge it and break the pattern anyway.
> 
> There's no bounded method that's guaranteed to adequately identify the
> upstream revision. If you want to restrict length to a particular limit,
> checking that would be easy to automate (you would not need to "manually
> check it every time"). On the other hand, checking whether a timestamp
> meaningfully identifies a revision is much harder.

Well, let's change my objection, then:  I do not object to using
git-describe-like paths and hashes in the version string, provided that:

1. it usually doesn't require manual handling in the first place;

2. that the pattern used for manual handling is guaranteed to be functional
   enough that sane ordering is ensured for any number of either
   automatic or manual versions existing before and after this one;

3. that the procedure/manual partterns is documented either project-wide or
   in debian/README.source.

I.e. someone who knows best, please document the safe correction pattern for
git-describe-like version strings when they exceed the maximum allowed
length.

> I think the main difficulty is that you lack understanding and/or experience
> about the practical issues and use cases that can come up in DVCS development.
> You clearly lack the needed mathematical understanding to assess hash
> uniqueness properties too. Hopefully the people who end up setting Debian
> practices will be better informed.

I do think you misunderstood my point in the hash issue.  My point is not
that a full hash will not collide.  The point is that the full hash as seen
in a tree received from the upstream DVCS should not see colisions, because
the collision would have happened before the colliding object was visible to
anyone retrieving that tree (and abort the operation that was trying to add
the colliding object/corrupt the repository/whatever).

There is no mathematical misunderstanding in that AFAIK (please explain if
there is one.  By private mail, if necessary).

There are two bad assumptions I made:

1. that the object with the hash was NOT dropped at some point in the future,
   and a different one with the same hash was added later, you didn't drop
   the object, and thus got a collision during fetch.

2. that you're not doing local merges and releasing from THAT, and have a
   local object that is clashing with some object from upstream (good luck,
   this is likely to be painful to work around).

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Reply to:

Follow-Ups:
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Osamu Aoki <osamu@debian.org>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Uoti Urpala <uoti.urpala@pp1.inet.fi>

References:
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Ben Hutchings <ben@decadent.org.uk>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Ben Hutchings <ben@decadent.org.uk>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Ben Hutchings <ben@decadent.org.uk>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Adam Borowski <kilobyte@angband.pl>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Henrique de Moraes Holschuh <hmh@debian.org>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Uoti Urpala <uoti.urpala@pp1.inet.fi>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Henrique de Moraes Holschuh <hmh@debian.org>
- Re: limits for package name and version (MBF alert: ... .deb filenames)
  - From: Uoti Urpala <uoti.urpala@pp1.inet.fi>

Prev by Date: Re: Bits from the Release Team - Kicking off Wheezy
Next by Date: Re: [RFC] Changing APT to pre-depend on ${shlibs:Depends}
Previous by thread: Re: limits for package name and version (MBF alert: ... .deb filenames)
Next by thread: Re: limits for package name and version (MBF alert: ... .deb filenames)
Index(es):
- Date
- Thread