Re: Idea: frontend tool for more efficient license reviewing based on tree-structured IR
On Fri, Dec 27, 2019 at 04:54:32PM +0100, Jonas Smedegaard wrote:
> Long description of debmake claims it "does more than what 
> licensecheck(1) offers" but I am puzzled what that sentence means - more 
> polished experience (even if less accurate), perhaps?
 
IIRC it appends the license texts to the generated d/copyright, such as
License: Expat
 <license content>
 
> >   I personally do `licensecheck -r --deb-machine . > debian/copyright`
> >   and manually tweak the content.
> 
> Beware that licensecheck by default omits some files, and for the files 
> it does check by default it inspects only the top of the file (and the 
> bottom, but only if nothing was found at the top).
I'm aware of that. Basically every template generated in this way will
be significantly reorganized before uploading.
 
> > ftp-master: possibly manually reviewing with MC + custom plugin
> > 
> >   I didn't follow the recommended way.
> 
> What is "the recommended way"?
 
IIRC using midnight-commander (MC) + custom plugin, according to an
introductory document written by the ftp team.
 
> Not sure what you mean above.  The format supports wildcards but that's 
> optional - if you want you can write path for each file explicitly, and 
> it should be simple to write a tool that converts a copyright file with 
> wildcards to one without wildcards.
 
I'm preparing some illustrative examples and more design details.
 
> copyright file format permits additional fields.  If you want to track 
> files moving around, I suggest adding (e.g. in same script expanding 
> wildcards as mentioned above) field FileChecksum as defined by SPDX: 
> https://spdx.github.io/spdx-spec/4-file-information/#44-file-checksum
Does not sound like what I wanted.
 
> >  * `licensecheck`. Given a file FILE, `licensecheck FILE` produces the license
> >     name.
> 
> I disagree with "the license name" above - it is not that simple: One 
> file can be covered by multiple licenses - OR'ed or AND'ed or 
> uncertain-how-they-relate or uncertain-what-they-are or 
> uncertain-if-none-found-means-none-there or certainly-none-there.
This will be fixed in refined design details.
You may have noticed that my intent of the original post is introducing
some core (and coarse) concepts/ideas. There were some inaccurate
details or even discrepancies, but distracting readers by describing
too much in such minor detail is not what I meant to do.
 
> And speaking of uncertainty, several steps in the parsing of 
> human-written comments can contain different kinds of uncertainty, which 
> for some use-cases of licensecheck is sensible to err on either side.
>
> Example: A project containing a minxture of BSD-2 and BSD-3 files which 
> has historically contained BSD-4 files is important to err on the side 
> of BSD-4 whenever there is doubt, whereas a project written from scratch 
> in recent times might make sense to err on the side of "they probably 
> meant BSD-3" unless certain that it is BSD-4.
> 
> ...or some would argue that it is never sensible to err on either side, 
> but instead whenever a fuzzy parser is uncertain it must flag the 
> uncertainty and hand it over to human inspection.
> 
> Another example: Copyright statement for the year 198 might in one 
> context "obviously" a typo for 1998, in another context most likely be 
> no-a-copyright-statement-at-all, and in a third context better handed 
> over to human inspection.
> 
> To summarize, I'd say that as minimum it need to provide a license 
> _expression_ and a certainty expression.
> 
> For license expression, see 
> https://spdx.github.io/spdx-spec/4-file-information/#45-concluded-license 
> and 
> https://spdx.github.io/spdx-spec/appendix-IV-SPDX-license-expressions/
Undoubtedly this will be adopted.
 
> >  * `grep` or `ripgrep`. For example, `rg -i copyright FILE` always 
> >    works well.
> 
> Matching purely by keywords should be tracked as an uncertainly.
 
The usage of `grep` outputs is not the same as you thought.
Please look forward to my update.
 
> >  * "neighbor". For example, given a source file "F/I/L/E" without any 
> >     copyright & license info, looking for F/I/L/LICENSE, F/I/LICENSE, 
> >     ..., etc like git does for the ".git" directory will help.
> 
> Looking outwards will not help to find _file_ license, only to find 
> _package_ license (by considering some subset of the Debian package a 
> virtual "package" in the SPDX terms).  Unless you also parse that 
> LICENSE file to identify which exact files it covers.
> 
> Also, expanding outwards should be tracked as an uncertainty.
 
Ditto.
 
> > The formated+filtered output of any combination of these backends can 
> > be attached to the corresponding IR.
> > 
> > In contrast, a "frontend" tool is also needed for dealing with such IR 
> > in a higher level. My imagined "frontend" tool is a `ranger`-like file 
> > browser with specific designs.
> > 
> >  * the user can choose what backend(s) to use. If none is chosen, the 
> >    frontend tool falls back into a general file browser with a preview 
> >    panel.
> 
> Such flexibility makes good sense to me, and echoes what we agreed among 
> license tool developers would be nice to have when we met briefly at 
> Debconf in Montreal.
Great :)
 
> I suggest to introduce new field "Generator" to indicate which tool was 
> used to resolve each Files section in copyright file - where omitting 
> that field implies that section was done (either from scratch or 
> verified) by a human.
I'll leave this to fellow developers. Any change about d/copyright
specification looks good to me as long as the others agree.
 
> >  * the frontend invokes various backend to generate a template IR, and
> >    store it to debian/copyright.json. No wildcard or regex in file path
> >    is allowed in this file.
> 
> I dislike introducing yet another file in source package¹ - please let's 
> store such notices in debian/copyright file itself, as unofficial extra 
> fields.
TBH this is a minor issue at the current stage. Let's mainly focus on how to
help volunteer save time in the license reviewing process...
> > * a group of interested contributors.
> 
> For making a UI frontend, I guess it is more about finding people 
> interested in UI design and skilled in Python, and less about interest 
> in license checking.  I suggest to make a wiki page about this draft 
> idea, and advertise it in the multiple Python packaging teams.
 
In terms of programming, there will be two components in the design:
the UI (simultaneously browses the directory tree and the IR),
and the IR organizer. This idea is not purely about UI programming.
I'm preparing more design details about the IR organizer.
 
> I might be intrigued to try put together a competing frontend in Perl,
> but I have too much on my plate already, so likely wouldn't make enough 
> time for that.
That reminds me to write the IR organizer in a higher modularized
manner, as it may have to work with different UI implementations.
BTW, I feel supporting json output might become a important feature I need
from licensecheck. Will get back to you when I refined the design in
detail.
Reply to: