Re: Preferred form of modification for binary data used in unit testing?

To: debian-devel@lists.debian.org
Subject: Re: Preferred form of modification for binary data used in unit testing?
From: smcv@debian.org
Date: Fri, 17 Jul 2020 22:50:18 +0100
Message-id: <[🔎] 20200717215018.GA577767@espresso.pseudorandom.co.uk>
In-reply-to: <[🔎] 20200717144424.fsrnromddsfyh5nr@basil.wdw>
References: <[🔎] a39eaca0-ab6d-f35c-b5b9-b5d01b280f92@pmhahn.de> <[🔎] 87mu3zihzz.fsf@iris.silentflame.com> <[🔎] 20200716174209.d7zu7uhv6mwzor4q@shell.thinkmo.de> <[🔎] 87zh7zgfh8.fsf@iris.silentflame.com> <[🔎] 87r1tbgf43.fsf@iris.silentflame.com> <[🔎] 20200717144424.fsrnromddsfyh5nr@basil.wdw>

On Fri, 17 Jul 2020 at 10:44:24 -0400, Marvin Renich wrote:
> I think, instead of pedantically applying the wording of the DFSG, we
> should be pedantically applying the intended purpose of the DFSG.

I think this is a good way to frame questions about the DFSG, and
particularly the requirement for source code. The DFSG is a set of
guidelines, not a deterministic algorithm for mapping inputs to their
freedom status, and the reasons why we want source code are important.

Also note that "preferred form for modification" does not appear
anywhere in the DFSG: that wording is specific to the *GPL family of
licenses. However, we often find it a useful tool for interpreting and
applying the DFSG, because the DFSG and the *GPL licenses are trying to
achieve the same or similar goals, so what's good for one is often good
for the other.

(We do need to be a bit more careful with preferred forms for modification
when we are assessing whether a work under a *GPL license is compliant
or non-compliant with that license, because that's about whether we are
behaving in a way that is legally allowed, not just about whether we
are following our own self-imposed guidelines.)

> The intended purpose is to ensure that the recipient has every
> reasonable opportunity to modify the software in any reasonable way the
> recipient desires.  The sole purpose of the requirement for source is to
> protect this freedom, and the requirement should not be applied
> independently from this purpose.

I mostly agree, and I do agree with the resulting conclusion, but I
don't think this is *quite* the whole story. What you said here maps
to the FSF's "Freedom 3" and half of "Freedom 1", and also matches the
justification given for the source code requirement in the annotated
Open Source Definition.

As with the *GPL licenses, the FSF's four freedoms and Free Software
definition and the OSI's Open Source Definition are not part of the DFSG,
but they can be useful tools for interpreting and applying the DFSG,
because we're trying to achieve the same or similar goals, so what's
desirable for them is probably also desirable for us.

In addition to freedom to modify, I think we also want to make sure a
sufficiently knowledgeable recipient can inspect the unmodified software;
that's the other half of the FSF's "Freedom 1" (freedom to study).
However, I don't think considering freedom-to-study actually changes
the conclusion in this case.

For a generated or hand-crafted binary blob that is used to reproduce
a specific bug or test a particular error-recovery path, inspecting
it would tend to consist of noting that it resembles a keepassx vault
(or whatever the binary blob is in this case); that, as intended, it has
one of the required patterns that reproduces that bug or triggers that
error-recovery; and that it doesn't have lots of unexplained content
that is not required for its purpose. Confirming that this is the case
might require a specialized program (keepassx or whatever), a hex-editor,
or even single-stepping in a debugger; I don't see that as a problem,
and I certainly wouldn't expect maintainers to do that work proactively
(other than checking that it isn't excessively large and isn't obviously
non-Free).

Note that I'm not saying that it would be OK for test data to contain
copyrightable works that are not freely licensed or have undergone a
lossy transformation from a source form. For example, test data for a
tar implementation shouldn't be a tar file containing object code that
was compiled from C source, without that source also being included;
it would usually be better to use a tar file containing some zeroes,
or some random numbers, or something that meets whatever other
requirements the test has (for example size or level of compressibility)
while being Freely licensed and obviously its own preferred form for
modification.

More generally, it's best if test data is either so trivial that
questions of copyright and preferred forms are somewhat irrelevant,
or is clearly Free.

As an example of trivial test data, the pre-generated valid and invalid
D-Bus messages in the GLib test suite consist of just enough of a message
to make them suitable for the test in question, with the parts that are
not fixed by the test's requirements taking short non-meaningful values
like /foo.

As an example of non-trivial Free test data, the rgain3 source package
needs non-trivial sound files with known/fixed content in a supported
format for its autopkgtest, so I included some short sound clips taken
from sound-theme-freedesktop (which are compressed, but would be easy to
modify by decompressing, editing and re-compressing, and do not appear
to have a separate lossless source form available).

On Wed, 15 Jul 2020 at 09:45:18 +0200, Philipp Hahn wrote:
> PS: This question is motivated while working on a private build of
> > E: keepassxc source: source-is-missing tests/data/keepassxc.opvault/default

Lintian cannot judge context or intent, and most Lintian checks are
imperfect heuristics. It's a tool to improve our software, not something
that should be obeyed unquestioningly.

I think it would be appropriate to override this with a comment that
documents that this is effectively its own source. For example:

# Test data for recovery from upstream bug 1234, manually generated
# using an older version of keepassxc where bug 1234 was not fixed
keepassxc source: source-is-missing tests/data/keepassxc.opvault/default

> * Should I include a copy of the *broken code* to generate that data?
[or]
> * Include instructions on how to re-build the broken version and give
> instructions on how to maybe rebuild a similar broken file.

I don't think these are required, and I don't think they should be
required. Our source packages aren't required to contain everything
that could conceivably be useful when modifying that piece of software,
and in fact they are usually required *not* to.

The complete upstream and/or downstream revision history of the
software would certainly be useful for making modifications; but if
that was included, the package would be excessively large, and it
would take a prohibitively large amount of effort to review it all for
DFSG-compliance. My understanding is that this is a significant reason why
"3.0 (git)"-format source packages aren't accepted in Debian: reviewing
the legality and DFSG-compliance of every commit in a non-trivial
package's history would put an unreasonable burden on the maintainer
and the ftp team.

Similarly, the contents of the package's upstream and downstream bug
tracking and patch review systems (bug reports, comments, old versions
of patches undergoing review, etc.) are often useful when understanding
why its code is the way it is and making correct modifications; but
these systems are typically too large to be convenient to include in
the package, and the bug reports and comments are rarely submitted under
Free Software licenses (or even under a license that would be considered
sufficiently clearly distributable to be allowed in the non-free archive
area), so they cannot be included as part of Debian even if we wanted to.

    smcv

Reply to:

Follow-Ups:
- Re: Preferred form of modification for binary data used in unit testing?
  - From: Marvin Renich <mrvn@renich.org>

References:
- Preferred form of modification for binary data used in unit testing?
  - From: Philipp Hahn <pmhahn@pmhahn.de>
- Re: Preferred form of modification for binary data used in unit testing?
  - From: Sean Whitton <spwhitton@spwhitton.name>
- Re: Preferred form of modification for binary data used in unit testing?
  - From: Bastian Blank <waldi@debian.org>
- Re: Preferred form of modification for binary data used in unit testing?
  - From: Sean Whitton <spwhitton@spwhitton.name>
- Re: Preferred form of modification for binary data used in unit testing?
  - From: Sean Whitton <spwhitton@spwhitton.name>
- Re: Preferred form of modification for binary data used in unit testing?
  - From: Marvin Renich <mrvn@renich.org>

Prev by Date: Re: Preferred form of modification for binary data used in unit testing?
Next by Date: Re: IPv6-only buildds and AI_ADDRCONFIG
Previous by thread: Re: Preferred form of modification for binary data used in unit testing?
Next by thread: Re: Preferred form of modification for binary data used in unit testing?
Index(es):
- Date
- Thread