[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Idea: frontend tool for more efficient license reviewing based on tree-structured IR



Quoting Mo Zhou (2019-12-27 02:56:07)
> I created an amount of NEW packages as a DD, and reviewed an amount of 
> NEW packages in the NEW queue as FTP trainee.

Great.  Also because your experience as FTP trainee sheds some light on 
what may actually aid ftpmaster processing (rather than guessing blindly 
from the outside).


> Existing tools, workflows; And limitations
> ----------------------------------------------------------
> 
> ## Tools
> 
> https://wiki.debian.org/CopyrightReviewTools
> 
> I'm unfamiliar with most of them. I'm only describing the two I'm 
> familiar with.  Both licensecheck (Jonas) and debmake (Osamu) do 
> template/regex matching.

Beware that debmake pattern matching and debian/copyright file 
serialization is far inferior to that of licensecheck.

Long description of debmake claims it "does more than what 
licensecheck(1) offers" but I am puzzled what that sentence means - more 
polished experience (even if less accurate), perhaps?


>   I personally do `licensecheck -r --deb-machine . > debian/copyright`
>   and manually tweak the content.

Beware that licensecheck by default omits some files, and for the files 
it does check by default it inspects only the top of the file (and the 
bottom, but only if nothing was found at the top).

Maybe that is what you want as uploader.  Mentioning in case you (or 
others following along here) wants as complete checking as possible.


> ftp-master: possibly manually reviewing with MC + custom plugin
> 
>   I didn't follow the recommended way.

What is "the recommended way"?


> * Tree structure is always missing (and actually not possible to 
>   present) in debian/copyright.

Not sure what you mean above.  The format supports wildcards but that's 
optional - if you want you can write path for each file explicitly, and 
it should be simple to write a tool that converts a copyright file with 
wildcards to one without wildcards.


> * Tree structure is always missing. after importing a new upstream 
>   release with significant directory layout change, it will be 
>   inconvenient to locate the parts of debian/copyright should be 
>   updated. Things will become more complex when new 
>   licenses/copyrights emerged.

copyright file format permits additional fields.  If you want to track 
files moving around, I suggest adding (e.g. in same script expanding 
wildcards as mentioned above) field FileChecksum as defined by SPDX: 
https://spdx.github.io/spdx-spec/4-file-information/#44-file-checksum

> * licensecheck dumps garbage when it encounters a binary file, e.g. 
>   PNG image. This is not a BUG, as ftp-masters indeed checks the 
>   possible metadata in a binary file to make sure whether there is 
>   extra copyright/license info. But this is something needs to be 
>   improved...

See "data-miner" added to https://wiki.debian.org/CopyrightReviewTools 
earlier today.



> The core of my idea is a tree-structured intermediate representation 
> (IR) for the "license reviewing tree". The IR is basically a directory 
> tree with annotations on the file nodes. The IR can be stored as a, 
> say, JSON file.

It should be easy to extend licensecheck to support a JSON output of 
each file.

I'd be happy to do that, as soon as there is some rough consensus on the 
format of such output.


> To build such an tree-shaped IR, we need a couple of "backend" tools for
> checking the copyright & license info for a SINGLE file. Such "backend" includes
> but not limited to:
> 
>  * `licensecheck`. Given a file FILE, `licensecheck FILE` produces the license
>     name.

I disagree with "the license name" above - it is not that simple: One 
file can be covered by multiple licenses - OR'ed or AND'ed or 
uncertain-how-they-relate or uncertain-what-they-are or 
uncertain-if-none-found-means-none-there or certainly-none-there.

And speaking of uncertainty, several steps in the parsing of 
human-written comments can contain different kinds of uncertainty, which 
for some use-cases of licensecheck is sensible to err on either side.

Example: A project containing a minxture of BSD-2 and BSD-3 files which 
has historically contained BSD-4 files is important to err on the side 
of BSD-4 whenever there is doubt, whereas a project written from scratch 
in recent times might make sense to err on the side of "they probably 
meant BSD-3" unless certain that it is BSD-4.

...or some would argue that it is never sensible to err on either side, 
but instead whenever a fuzzy parser is uncertain it must flag the 
uncertainty and hand it over to human inspection.

Another example: Copyright statement for the year 198 might in one 
context "obviously" a typo for 1998, in another context most likely be 
no-a-copyright-statement-at-all, and in a third context better handed 
over to human inspection.

To summarize, I'd say that as minimum it need to provide a license 
_expression_ and a certainty expression.

For license expression, see 
https://spdx.github.io/spdx-spec/4-file-information/#45-concluded-license 
and 
https://spdx.github.io/spdx-spec/appendix-IV-SPDX-license-expressions/

For certainty expression I guess we need to evolve some semantics which 
covers the tools we use.


>  * `grep` or `ripgrep`. For example, `rg -i copyright FILE` always 
>    works well.

Matching purely by keywords should be tracked as an uncertainly.


>  * "neighbor". For example, given a source file "F/I/L/E" without any 
>     copyright & license info, looking for F/I/L/LICENSE, F/I/LICENSE, 
>     ..., etc like git does for the ".git" directory will help.

Looking outwards will not help to find _file_ license, only to find 
_package_ license (by considering some subset of the Debian package a 
virtual "package" in the SPDX terms).  Unless you also parse that 
LICENSE file to identify which exact files it covers.

Also, expanding outwards should be tracked as an uncertainty.


> The formated+filtered output of any combination of these backends can 
> be attached to the corresponding IR.
> 
> In contrast, a "frontend" tool is also needed for dealing with such IR 
> in a higher level. My imagined "frontend" tool is a `ranger`-like file 
> browser with specific designs.
> 
>  * the user can choose what backend(s) to use. If none is chosen, the 
>    frontend tool falls back into a general file browser with a preview 
>    panel.

Such flexibility makes good sense to me, and echoes what we agreed among 
license tool developers would be nice to have when we met briefly at 
Debconf in Montreal.

I suggest to introduce new field "Generator" to indicate which tool was 
used to resolve each Files section in copyright file - where omitting 
that field implies that section was done (either from scratch or 
verified) by a human.


>  * the frontend invokes various backend to generate a template IR, and
>    store it to debian/copyright.json. No wildcard or regex in file path
>    is allowed in this file.

I dislike introducing yet another file in source package¹ - please let's 
store such notices in debian/copyright file itself, as unofficial extra 
fields.

¹ I also dislike my own current practice adding debian/copyright_hints 
and do *not* recommend that as a general approach.


>  * when viewing files, the suggestions from various backends are 
>    shown. the user could choose to accept of override the suggestion. 
>    These choices will also be recorded in the json file. Of course, 
>    when various backends do not agree with each other, the user has to 
>    override the suggestion, and manually annotate the node.
>  * when finished reviewing/annotating the whole directory tree, the 
>    frontend will translate the IR (d/copyright.json) into 
>    machine-readable format. (d/copyright)

Again, I would prefer that even intermediary notes are stored in 
debian/copyright file, as additional fields, rather than a separate file 
in yet another format.


> How to proceed
> --------------
> 
> * a group of interested contributors.

For making a UI frontend, I guess it is more about finding people 
interested in UI design and skilled in Python, and less about interest 
in license checking.  I suggest to make a wiki page about this draft 
idea, and advertise it in the multiple Python packaging teams.


> * GSoC / Outreachy sounds good.

Perhaps some existing project matches closely enough to instead expand: 
https://wiki.spdx.org/view/GSOC/GSOC_ProjectIdeas


> Several months ago I've already started a python script based on this 
> idea. I'm struggling with UI programming (I'm really not good in this 
> area). Specifically, when I found myself stuck at adding custom 
> keybinding under the urwid framework, I postponed the idea 
> indefinitely.

Sorry, I cannot help write a UI frontend in Python.

I might be intrigued to try put together a competing frontend in Perl, 
but I have too much on my plate already, so likely wouldn't make enough 
time for that.


 - Jonas

-- 
 * Jonas Smedegaard - idealist & Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private

Attachment: signature.asc
Description: signature


Reply to: