[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Automated copyright reviews using REUSE/SPDX as alternative to DEP-5



On Tue, Feb 8, 2022 at 8:45 PM Russ Allbery <rra@debian.org> wrote:
>
> I recommend thinking about how to generate an existing debian/copyright
> file and putting the SPDX-formatted one in a different location.  You're
> going to want to decouple the new work from any transition that requires
> other people change tools.  There are a lot of tools that make assumptions
> about debian/copyright, and trying to track them all down will be
> counterproductive for trying to make forward progress on exposing
> information in a more interoperable format.
>
> The way I see this, there are three different things that have been
> discussed on this thread:
>
> 1. Consuming upstream data in SPDX or REUSE format and automating the
>    generation of required Debian files using it.
>
> 2. Exposing SPDX data for our packages for downstream consumption.
>
> 3. Aligning the standard way to present Debian copyright information with
>    other standards.
>
> I can tell you from prior experience with DEP-5 that 3 is wildly
> controversial and will produce the most pushback.  There are a lot of
> packagers who flatly refuse to even use the DEP-5 format, and (pace
> Jonas's aspirations) I don't expect that to change any time soon.
>
> I think that's fine for what you're trying to accomplish, because I think
> 1 and 2 are where the biggest improvements can be found.  As long as your
> system for managing copyright and license information can spit out a DEP-5
> debian/copyright file (in its current or a minorly modified format, not
> with new files elsewhere that would have to be extracted from the
> package), you are then backward-compatible with the existing system and
> that takes 3 largely off the table (which is what you want).  Then you can
> demonstrate the advantages of the new system and people can choose to be
> convinced by those advantages or not, similar to DEP-5, and we can reach
> an organic consensus without anyone feeling like they're forced to change
> how they do things.

Thanks for this input, Russ! I think you're right: it will be easier
to output the DEP5 format in addition to SPDX at the beginning, and
see from there how it works. I would install the source SPDX document
then to /usr/share/doc/PACKAGE/copyright_spdx in addition to the DEP5
file in the usual place.

I will write a SPDX -> DEP5 tool for this, which should be "fairly
trivial". Regarding concerns about the different file formats SPDX can
come in: for us only the tag:value format makes sense, I don't want to
support other formats.

On Tue, Feb 8, 2022 at 8:36 PM Jonas Smedegaard <jonas@jones.dk> wrote:
>
> For starters, the format adds one SHA1 hash per source file, right?

Yes, one checksum per file.

> Sure I can "just" ignore all FileChecksum: lines, but anyone working
> with XML will know that plaintext does not equal human-readable.

This comparison is a bit off, XML is a representation. The SPDX format
I want to use is tag:value just like DEP5, so in this regard
"human-readable". There is more cruft content, but it takes less than
5 minutes to understand where the per file copyright and license
information is.

> > However, I also think the human-readable aspect is less important here
> > because it is an output format. What I mean with this is that the
> > information is already there in a human readable way: either via REUSE
> > or in the file headers directly. While it is theoretically possible to
> > write SPDX documents by hand, I would not treat them with the same
> > trust as one created by REUSE.
>
> Here you seem to assume that humans need not be involved in authoring
> the contents or at least that human-facing interfaces for smart tools
> exist and is expressive enough to cover all that is needed.
>
> That is quite an assumption, I dare say.

I think this is a misconception: I don't want people to write SPDX
documents by hand at all. IMHO for this scenario, DEP5 is still
superior (that's e.g. why REUSE can also use DEP5).

> Writing the debian/copyright file for ghostscript took quite some time.
> Singularity is imminent, I know, and I wouldn't mind machines taking
> over the task of classifying tights statements, when they are up to the
> task - but until then I will want to proof-read and intervene as needed.
> My own experience is that they are not yet there - you seem to claim
> they have already surpassed humans for this task...
>
> Can you show me (off list if too long for an attachment) how your new
> not-really-needing-manual-editing file for ghostscript looks like, so I
> can compare with my lesser trusted human-laboured product?

No, because if ghostscript doesn't have the information to
automatically generate a SPDX document, don't do it by hand, use DEP5
instead.

What you can do is to put your DEP5 in .reuse/dep5 in the top-level
dir and run "reuse spdx" if you want to see how it looks.

> > Regarding reviews: I plan to write a SPDX-to-DEP5 converter anyway to
> > get a better feel for the spec. I will probably also write a copyright
> > review tool that will show you the copyright header of each file based
> > on DEP5 or SPDX information for validation / manual review. This will
> > make proof-reading copyright information much easier.
>
> Seems to are now talking not about a format, but a detection mechanism.

Exactly! This entire thing is not about format really, but detection
mechanisms. And the standard format (outside of Debian) for "detected"
upstream copyright information is SPDX, that's why I want to use it.

Regarding the review tool: being able to have the checksums from the
previous version makes it easy to only review the files where the
checksum changed. Cool, right?

> So new format is at best "equally good" as current format, except that
> outperforms current format by adding file hashes.
>
> That is probably a simplification. Ok, let's then use it as an example:
>
> You can add file hashes to debian/copyright files *today* - the standard
> permits unofficial fields, and we could then elevate certain fileds to
> make them official in a later revision of the current format.
>
> Adding hashes would clutter the files, making them less readable, but in
> your argument that's a feature with no real drawback, so let's play
> along for now.

Yes I agree that adding hashes to DEP5 makes it unreadable and utterly
annoying to maintain, that's why I don't want to add it. DEP5 is
designed to be written by humans, SPDX is not. That's why SPDX can add
hashes without any drawbacks.

> > I don't see the problem with machine parsers. We already use a lot of
> > different tools for our processes (git, dput, dpkg, debhelper,
> > lintian, uscan, a mail program, a text editor, ...), adding one more
> > shouldn't be a big deal. It needs to be provided of course, but I plan
> > to do that.
>
> Only 2 of those you list are mandatory: dpkg and RFC822 email - the rest
> are optional, some quite popular but even then routinely bypassed.

I mean if you want you can write SPDX files by hand, it's not a binary
format. Same as you can write a Debian package without debhelper.

> How do you know that SPDX already cover all the features we want?

What do we need? File based copyright and license information. SPDX
offers that, and so does DEP5. In this regard, both specs have all the
things we _need_.

What would be nice to have? Something that allows us to do more
automation. SPDX includes file hashes, so it can be very easily
checked if a SPDX copyright document is valid (ignoring mistakes in
copyright and license assertions). For DEP5, this is impossible
without cluttering the spec with hashes.

> And if if does, then how is SPDX not a simple superset of current
> format, and therefore a simple matter of identifting and adding missing
> pieces?

Again, we could add them, but it would make DEP5 nearly impossible to
write by hand, something I don't want. I still have packages that I
wouldn't convert to SPDX (by hand) anytime soon, because they don't
offer the information to automatically create the required
information.

DEP5 could be used via reuse as an intermediate representation for
developers if no REUSE information is available, but let's ignore this
for now.

> I would be quite happy if our work on evolving debian/copyright would
> result in a future revision being identical to REUSE format.
>
> What I dislike is requiring all developers to master 3 formats instead
> of currently only two: freeform-human-only and (also-)machine-readable.

No, you don't have to master SPDX! That's the point: you don't
interact with it at all. It's created by tools, and shipped to satisfy
the legal obligation to provide copyright information. Users don't
care how the copyright information is shipped. As a developer, you
just have one less thing to care about, namely writing
debian/copyright by hand.

> Current format was designed to a) cover the existing needs of Debian,
> and b) not discourage too many developers from using it - to raise the
> likelihood of a future possibility that we fully embrace it as the one
> single format for us all to use.

I don't want to force people to use this new format if they don't want
to. I really don't care if others want to put a lot of work in
debian/copyright, but I want to use tools so that I (and others that
feel the same) don't have to handle it anymore. DEP5 is just not
designed for this automated use case, and that's totally fine. It's
good at what it does now, but extending it to an automated use case
would make it bad at what it was good at: being simple (and all the
points you mentioned).


Regards,
Stephan


Reply to: