[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Automated copyright reviews using REUSE/SPDX as alternative to DEP-5



Hi Stephan,

Quoting Stephan Lachnit (2022-02-08 18:53:22)
> On Tue, Feb 8, 2022 at 4:39 PM Jonas Smedegaard <jonas@jones.dk> 
> wrote:
> >
> > I am sceptical towards this proposal.
> >
> > An important feature to me with current machine-readable format is 
> > that really it is machine-and-human-readable.
> 
> Thank you for your input! I'm aware of this concern, however I think
> it is not something that can't be solved.
> 
> For one, while not as trivial to under as the current machine-readable 
> copyright, it's still "human-readable" (i.e. a tag:value style text 
> file). I would do the following comparison: if you only know Python 
> (DEP-5), C++ (SPDX) might look a bit weird, but you can get the gist 
> of it.

For starters, the format adds one SHA1 hash per source file, right?

Sure I can "just" ignore all FileChecksum: lines, but anyone working 
with XML will know that plaintext does not equal human-readable.


> However, I also think the human-readable aspect is less important here 
> because it is an output format. What I mean with this is that the 
> information is already there in a human readable way: either via REUSE 
> or in the file headers directly. While it is theoretically possible to 
> write SPDX documents by hand, I would not treat them with the same 
> trust as one created by REUSE.

Here you seem to assume that humans need not be involved in authoring 
the contents or at least that human-facing interfaces for smart tools 
exist and is expressive enough to cover all that is needed.

That is quite an assumption, I dare say.

Writing the debian/copyright file for ghostscript took quite some time.  
Singularity is imminent, I know, and I wouldn't mind machines taking 
over the task of classifying tights statements, when they are up to the 
task - but until then I will want to proof-read and intervene as needed.  
My own experience is that they are not yet there - you seem to claim 
they have already surpassed humans for this task...

Can you show me (off list if too long for an attachment) how your new 
not-really-needing-manual-editing file for ghostscript looks like, so I 
can compare with my lesser trusted human-laboured product?


> > Another important feature to me is that there is only one format (in 
> > addition to unformatted content, which hopefully we can put past us 
> > at some point).
> >
> > Today, I can as DD help proof-read and change *any* package in 
> > Debian.
> 
> Regarding reviews: I plan to write a SPDX-to-DEP5 converter anyway to 
> get a better feel for the spec. I will probably also write a copyright 
> review tool that will show you the copyright header of each file based 
> on DEP5 or SPDX information for validation / manual review. This will 
> make proof-reading copyright information much easier.

Seems to are now talking not about a format, but a detection mechanism.


> But to stress this again: the goal is to *replace* the manual 
> copyright reviews by something much better: automatic copyright 
> reviews.

Great.  But orthogonal to switching format: detection tools can 
serialize their findings in current machine-readable format.  Either by 
themselves, or for tools that can only output REUSE format *AND* if that 
output fully covers Debian needs, then othr tools can reformat that to 
current format.

My point here is not that there is no benefit in using REUSE.  My point 
is that detecting rights information is orthogonal to serializing it.


> There are three areas of interest for copyright information:
> a) for developers writing it b) for the user receiving it and c) the
> legal side.
> 
> Regarding a: From hand DEP5 is better, but for automation SPDX is equally good.
> Regarding b: I think they don't care anyway. Like which user reads the
> debian/copyright really? If at all, you are interested in the
> copyright of a certain library you wish to use, but this doesn't
> require the extensive file-by-file information of DEP5. Most likely
> the documentation provides much clearer information.
> Regarding c: SPDX is as good as DEP5 if not even better due to file hashes.

So new format is at best "equally good" as current format, except that 
outperforms current format by adding file hashes.

That is probably a simplification. Ok, let's then use it as an example:

You can add file hashes to debian/copyright files *today* - the standard 
permits unofficial fields, and we could then elevate certain fileds to 
make them official in a later revision of the current format.

Adding hashes would clutter the files, making them less readable, but in 
your argument that's a feature with no real drawback, so let's play 
along for now.

Any feature improvements that cannot be an evolution of current format?


> > If we permit a debian/copyright format that is not human-readable, 
> > it means that I cannot confidently proof-read and change the 
> > contents of the debian subdir without the help of machine-parsers, 
> > and I would need to know two formats with different goals.
> 
> I don't see the problem with machine parsers. We already use a lot of
> different tools for our processes (git, dput, dpkg, debhelper,
> lintian, uscan, a mail program, a text editor, ...), adding one more
> shouldn't be a big deal. It needs to be provided of course, but I plan
> to do that.

Only 2 of those you list are mandatory: dpkg and RFC822 email - the rest 
are optional, some quite popular but even then routinely bypassed.


> > I would like to instead welcome the REUSE developers in helping 
> > Debian evolve next version of the existing machine-readable format 
> > to better align with SPDX.
> 
> While this would be nice, I think this is just unrealistic. While I 
> may implement DEP5 output to REUSE, I still want to use SPDX because 
> it is already an existing industry standard having all the "features" 
> we want. Adding things like file hashes and referencing / merging 
> other DEP5 documents is certainly possible, it would make the format 
> less readable and in the end just SPDX looking differently.

How do you know that SPDX already cover all the features we want?

And if if does, then how is SPDX not a simple superset of current 
format, and therefore a simple matter of identifting and adding missing 
pieces?

I would be quite happy if our work on evolving debian/copyright would 
result in a future revision being identical to REUSE format.

What I dislike is requiring all developers to master 3 formats instead 
of currently only two: freeform-human-only and (also-)machine-readable.

Current format was designed to a) cover the existing needs of Debian, 
and b) not discourage too many developers from using it - to raise the 
likelihood of a future possibility that we fully embrace it as the one 
single format for us all to use.

As for a) it might turn out that Debian is not special - i.e. that all 
our needs are fully covered by the industry standard that sprung up 
inspired by our earlier work.  That would be great.  Let's explore if 
that is a fact.  I invite to exploring that by taking our existing 
format and morphing it step by step, checking at each step if we loose 
something and if so if that is acceptable.

As for b) I highly doubt that those insisting on writing their copyright 
files by hand would embrace a lesser-human-readable format instead. 
Please note that those that would happily embrace *any* format which 
would relieve them of doing work themselves do not count here - they 
already use "licensecheck --deb-machine *" or one of the wrappers for 
that command.


 - Jonas

-- 
 * Jonas Smedegaard - idealist & Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private

Attachment: signature.asc
Description: signature


Reply to: