[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Idea: frontend tool for more efficient license reviewing based on tree-structured IR



Hi fellow devs,

I created an amount of NEW packages as a DD, and reviewed an amount of NEW
packages in the NEW queue as FTP trainee. Both of the two kinds of work
involves an important part -- sometimes annoying -- license checking. People
keeps complaining about it, and recently there were some related
discussions[1][2] on -project, talking about possible ways to improve -- some
in the workflow aspect, the other in the tooling aspect. In this mail I have an
idea about tooling.

This is a long mail. I've alreay organized it in a structured format so
you can apply more fast reading tricks.

The problem we are trying to solve
----------------------------------

Given a arbitrary source tree, we shall examine the copyright & license
information for each file node, make sure each node complies with DFSG, and
make an overall assessment to the whole tree: ACCEPT/REJECT. Subsequently, the
tree will be flattened (the tree structure being removed) and written into
debian/copyright in machine-readable format.

Note that, automatically parsing a machine-UNreadable debian/copyright requires
a delicate recurrent neural network. That machine-UNreadable case is too
complex, so let's ignore it for now.

Existing tools, workflows; And limitations
----------------------------------------------------------

## Tools

https://wiki.debian.org/CopyrightReviewTools

I'm unfamiliar with most of them. I'm only describing the two I'm familiar
with.  Both licensecheck (Jonas) and debmake (Osamu) do template/regex
matching.

## workflows

uploader: ??? there doesn't seem to be a standard process to generate
debian/copyright for all uploaders.

  I personally do `licensecheck -r --deb-machine . > debian/copyright`
  and manually tweak the content.

ftp-master: possibly manually reviewing with MC + custom plugin

  I didn't follow the recommended way. I use `ranger` (vim keybinding,
  fluent file browsing with preview panel) for reviewing packages on
  ftp-master.d.o.

## Limitations

* Tree structure is always missing (and actually not possible to present)
  in debian/copyright. When reviewing other's NEW package  as trainee, I
  feel torturous to locate the license information for a single file in
  debian/copyright.
* Tree structure is always missing. after importing a new upstream release
  with significant directory layout change, it will be inconvenient to
  locate the parts of debian/copyright should be updated. Things will become
  more complex when new licenses/copyrights emerged.
* licensecheck dumps garbage when it encounters a binary file, e.g. PNG image.
  This is not a BUG, as ftp-masters indeed checks the possible metadata in
  a binary file to make sure whether there is extra copyright/license info.
  But this is something needs to be improved...
* Generic file browsers are not designed for our special purpose, neither does
  the commercial tools.
* etc.

My idea
-------

## Motivations

License reviewing is certainly inevitable. Even if we can improve the
efficiency of this process a tiny bit, it will greatly improve the efficiency
of the community on the specific task we are talking about.

I have a couple of other motivations but the above one is already strong enough.

## Core

The core of my idea is a tree-structured intermediate representation (IR) for
the "license reviewing tree". The IR is basically a directory tree with
annotations on the file nodes. The IR can be stored as a, say, JSON file.

To build such an tree-shaped IR, we need a couple of "backend" tools for
checking the copyright & license info for a SINGLE file. Such "backend" includes
but not limited to:

 * `licensecheck`. Given a file FILE, `licensecheck FILE` produces the license
    name.
 * `grep` or `ripgrep`. For example, `rg -i copyright FILE` always works well.
 * "neighbor". For example, given a source file "F/I/L/E" without any copyright
    & license info, looking for F/I/L/LICENSE, F/I/LICENSE, ..., etc like git
    does for the ".git" directory will help.

The formated+filtered output of any combination of these backends can be
attached to the corresponding IR.

In contrast, a "frontend" tool is also needed for dealing with such IR
in a higher level. My imagined "frontend" tool is a `ranger`-like file
browser with specific designs.

 * the user can choose what backend(s) to use. If none is chosen, the frontend
   tool falls back into a general file browser with a preview panel.
 * the frontend invokes various backend to generate a template IR, and
   store it to debian/copyright.json. No wildcard or regex in file path
   is allowed in this file.
 * when viewing files, the suggestions from various backends are shown.
   the user could choose to accept of override the suggestion. These choices
   will also be recorded in the json file. Of course, when various backends
   do not agree with each other, the user has to override the suggestion,
   and manually annotate the node.
 * when finished reviewing/annotating the whole directory tree, the frontend
   will translate the IR (d/copyright.json) into machine-readable format.
   (d/copyright)
 * ...

>From ftp-master's perspective:

 * can review the uploader's IR with the frontend. the good things is that
   ftp-master can collect all the informations for one file at a glance:
   file path, file preview (the header part), backend suggestion, human
   annotation, override or manual annotation history.
 * don't have to suffer from the "locating file in d/copyright"

>From uploader's perspective:

 * in the past the IR is built in our mind. Instead we transform the raw
   directory and file data into the final flattened d/copyright. That means
   we have to build the IR everytime when we want to change/review d/copyright.
   Explicitly write that IR down may make the process more efficient.

This frontend-backend design somehow resembles our apt+dpkg, where apt deals
with the dependency tree, while dpkg deals with the nodes.

How to proceed
--------------

* a group of interested contributors.

* GSoC / Outreachy sounds good.

Several months ago I've already started a python script based on this idea.
I'm struggling with UI programming (I'm really not good in this area).
Specifically, when I found myself stuck at adding custom keybinding under the
urwid framework, I postponed the idea indefinitely.

[1] -project: "Do we still value contributions?"
[2] -project: "possibly exhausted ftp-masters"


Reply to: