Idea: frontend tool for more efficient license reviewing based on tree-structured IR
Hi fellow devs,
I created an amount of NEW packages as a DD, and reviewed an amount of NEW
packages in the NEW queue as FTP trainee. Both of the two kinds of work
involves an important part -- sometimes annoying -- license checking. People
keeps complaining about it, and recently there were some related
discussions[1][2] on -project, talking about possible ways to improve -- some
in the workflow aspect, the other in the tooling aspect. In this mail I have an
idea about tooling.
This is a long mail. I've alreay organized it in a structured format so
you can apply more fast reading tricks.
The problem we are trying to solve
----------------------------------
Given a arbitrary source tree, we shall examine the copyright & license
information for each file node, make sure each node complies with DFSG, and
make an overall assessment to the whole tree: ACCEPT/REJECT. Subsequently, the
tree will be flattened (the tree structure being removed) and written into
debian/copyright in machine-readable format.
Note that, automatically parsing a machine-UNreadable debian/copyright requires
a delicate recurrent neural network. That machine-UNreadable case is too
complex, so let's ignore it for now.
Existing tools, workflows; And limitations
----------------------------------------------------------
## Tools
https://wiki.debian.org/CopyrightReviewTools
I'm unfamiliar with most of them. I'm only describing the two I'm familiar
with. Both licensecheck (Jonas) and debmake (Osamu) do template/regex
matching.
## workflows
uploader: ??? there doesn't seem to be a standard process to generate
debian/copyright for all uploaders.
I personally do `licensecheck -r --deb-machine . > debian/copyright`
and manually tweak the content.
ftp-master: possibly manually reviewing with MC + custom plugin
I didn't follow the recommended way. I use `ranger` (vim keybinding,
fluent file browsing with preview panel) for reviewing packages on
ftp-master.d.o.
## Limitations
* Tree structure is always missing (and actually not possible to present)
in debian/copyright. When reviewing other's NEW package as trainee, I
feel torturous to locate the license information for a single file in
debian/copyright.
* Tree structure is always missing. after importing a new upstream release
with significant directory layout change, it will be inconvenient to
locate the parts of debian/copyright should be updated. Things will become
more complex when new licenses/copyrights emerged.
* licensecheck dumps garbage when it encounters a binary file, e.g. PNG image.
This is not a BUG, as ftp-masters indeed checks the possible metadata in
a binary file to make sure whether there is extra copyright/license info.
But this is something needs to be improved...
* Generic file browsers are not designed for our special purpose, neither does
the commercial tools.
* etc.
My idea
-------
## Motivations
License reviewing is certainly inevitable. Even if we can improve the
efficiency of this process a tiny bit, it will greatly improve the efficiency
of the community on the specific task we are talking about.
I have a couple of other motivations but the above one is already strong enough.
## Core
The core of my idea is a tree-structured intermediate representation (IR) for
the "license reviewing tree". The IR is basically a directory tree with
annotations on the file nodes. The IR can be stored as a, say, JSON file.
To build such an tree-shaped IR, we need a couple of "backend" tools for
checking the copyright & license info for a SINGLE file. Such "backend" includes
but not limited to:
* `licensecheck`. Given a file FILE, `licensecheck FILE` produces the license
name.
* `grep` or `ripgrep`. For example, `rg -i copyright FILE` always works well.
* "neighbor". For example, given a source file "F/I/L/E" without any copyright
& license info, looking for F/I/L/LICENSE, F/I/LICENSE, ..., etc like git
does for the ".git" directory will help.
The formated+filtered output of any combination of these backends can be
attached to the corresponding IR.
In contrast, a "frontend" tool is also needed for dealing with such IR
in a higher level. My imagined "frontend" tool is a `ranger`-like file
browser with specific designs.
* the user can choose what backend(s) to use. If none is chosen, the frontend
tool falls back into a general file browser with a preview panel.
* the frontend invokes various backend to generate a template IR, and
store it to debian/copyright.json. No wildcard or regex in file path
is allowed in this file.
* when viewing files, the suggestions from various backends are shown.
the user could choose to accept of override the suggestion. These choices
will also be recorded in the json file. Of course, when various backends
do not agree with each other, the user has to override the suggestion,
and manually annotate the node.
* when finished reviewing/annotating the whole directory tree, the frontend
will translate the IR (d/copyright.json) into machine-readable format.
(d/copyright)
* ...
>From ftp-master's perspective:
* can review the uploader's IR with the frontend. the good things is that
ftp-master can collect all the informations for one file at a glance:
file path, file preview (the header part), backend suggestion, human
annotation, override or manual annotation history.
* don't have to suffer from the "locating file in d/copyright"
>From uploader's perspective:
* in the past the IR is built in our mind. Instead we transform the raw
directory and file data into the final flattened d/copyright. That means
we have to build the IR everytime when we want to change/review d/copyright.
Explicitly write that IR down may make the process more efficient.
This frontend-backend design somehow resembles our apt+dpkg, where apt deals
with the dependency tree, while dpkg deals with the nodes.
How to proceed
--------------
* a group of interested contributors.
* GSoC / Outreachy sounds good.
Several months ago I've already started a python script based on this idea.
I'm struggling with UI programming (I'm really not good in this area).
Specifically, when I found myself stuck at adding custom keybinding under the
urwid framework, I postponed the idea indefinitely.
[1] -project: "Do we still value contributions?"
[2] -project: "possibly exhausted ftp-masters"
Reply to: