GSoC status: classification, output format and more

To: debian-lint-maint@lists.debian.org
Subject: GSoC status: classification, output format and more
From: Jordà Polo <jorda@ettin.org>
Date: Fri, 18 Jul 2008 15:30:24 +0200
Message-id: <[🔎] 20080718133023.GA28332@hinnom>
I have not been explaining much about the Lintian GSoC project. In the
following paragraphs I'll try to summarize how it is coming along, as
well as what are the current issues and future directions.

Tag classification
------------------

So far, more than 50% of all the tags in checks/*.desc have been
classified using Severity and Certainty headers. Even though I have been
following the descriptions[1] and have checked the documentation
(policy, devref, etc.), this initial classification isn't necessarily
perfect. In some cases it isn't that clear what is the most appropriate
value, which has led me to be influenced by other factors (such as the
value of Type, the wording of the description, or the number of
tagged/overridden packages). Also, I have tried to be consistent between
tags of the same check, but there may be some inconsistencies between
tags of different checks.

In order to have an idea of how Severity and Certainty values are
distributed, there is a script (`private/transtats') that displays some
stats. See the simplified output below or here[2], or a more detailed
output here[3].

New output format
-----------------

Displaying the new information provided by Severity/Certainty headers
requires a new output format. Implementing new formats isn't that
difficult, as they're almost "pluggable", but deciding what to show (and
how) is more problematic and requires some consensus.

The current format uses a one-letter code to differentiate kinds of
tags, so a possible solution that has already been suggested[4] is using
a two-letter code. This solution could work to generate long lintian
reports, but it isn't very "human readable". Example output:

  MW: package: no-upstream-changelog
  IP: package: binary-with-bad-dynamic-table extra-info
  NP: package: changelog-news-debian-mismatch extra-info
  SC: package: debian-changelog-file-missing

Another experimental format that was already available in lintian[5] is
using qualifiers for significance (!/ /?/??) combined with a one-letter
code for severity (E/W/I). This could be easily mapped to the new
classification:

  --------------------
  |   | C  | P  | W  |
  |---|----|----|----|
  | S | S! | S  | S? |
  | I | I! | I  | I? |
  | N | N! | N  | N? |
  | M | M! | M  | M? |
  | W | W! | W  | W? |
  --------------------

  M?: package: no-upstream-changelog
  I : package: binary-with-bad-dynamic-table extra-info
  N : package: changelog-news-debian-mismatch extra-info
  S!: package: debian-changelog-file-missing

It is probably an improvement over the previous format in terms of
readability, but this "direct" mapping can be misleading if it leads
people to focus on the qualifier only. Another option would be to use
less accurate mappings, but based on overall relevance, where the
qualifier is a function of both certainty and severity:

  --------------------   --------------------
  |   | C  | P  | W  |   |   | C  | P  | W  |
  |---|----|----|----|   |---|----|----|----|
  | S | S! | S! | S  |   | S | S! | S! | S  |
  | I | I! | I  | I  |   | I | I! | I! | I  |
  | N | N! | N  | N? |   | N | N! | N  | N? |
  | M | M  | M  | M? |   | M | M  | M? | M? |
  | W | W  | W? | W? |   | W | W  | W? | W? |
  --------------------   --------------------

A format based only on symbols should also be possible, but it would
make lines slightly larger and not necessarily more understandable:

     ·· package: no-upstream-changelog
   ---- package: binary-with-bad-dynamic-table extra-info
    --- package: changelog-news-debian-mismatch extra-info
  +++++ package: debian-changelog-file-missing

There aren't a lot more options if keeping exactly one line per tag is
important. But perhaps it would be interesting to have some kind of
--human-readable output for maintainers that only want to check a small
number of packages. This includes more than one possible variant:

  minor, wild-guess: no-upstream-changelog
    * package
  binary-with-bad-dynamic-table [important, possible]
    * package: extra-info
  normal: changelog-news-debian-mismatch [possible]
    * package: extra-info

It should also be possible to reorganize the output, but that would
require displaying everything at the end instead of printing tags as
they're found.

  serious: debian-changelog-file-missing [certain]
    * package1: dpatch
    * package2: dpatch

Backwards compatibility
-----------------------

With only ~54% of tags classified, it is still soon to define a mapping
to the old E/W/I classification. But I have been a bit conservative with
the new classification, and a simple mapping like the one in the table
below would classify most tags correctly (the frontier between error and
warning wouldn't be that meaningful, but the tags displayed with and
without -I would be almost the same).

  -----------------
  |   | C | P | W |
  |---|---|---|---|
  | S | E | E | E |
  | I | E | E | W |
  | N | W | W | W |
  | M | W | I | I |
  | W | I | I | I |
  -----------------

As soon as more tags are classified and reviewed I'll try to provide
stats to see how much the Type header and this kind of mapping diverge.

Other ideas (not necessarily related to the GSoC project)
---------------------------------------------------------

One of the things I'm not happy about is that all checks are executed
and non-requested tags are simply hidden. This made sense before since
there were only a small number of optional tags (Type: info), but with
the new classification it will be possible to request only a small
number of tags (e.g. serious & certain). I would like to take a look at
this issue before the end of the GSoC project to see how much time is
spent running collection/checks scripts.

After the redesign of lintian.d.o, people have suggested some
interesting improvements for the website. For instance, adding line
charts to display the evolution of the number of tags (for each
maintainer and tag), and even the kinds of tags (for maintainers).
Another idea to improve lintian.d.o would be to provide a more stable
and parseable output (e.g. YAML) so that other services such as qa.d.o
and packages.qa.d.o could easily use the data. Do you think it is worth
it?

Finally, the changes to the code are available at git.d.o[7][8], but
note that it is not yet ready to be merged and I may rearrange commits
if needed (so don't be surprised if you see weird things in the log
after pulling new stuff, a fresh clone will probably fix that).

Comments, suggestions, thoughts? I would like to know what do you think,
specially about the new output format.

Thanks!


 1. http://lists.debian.org/debian-lint-maint/2008/06/msg00275.html
 2. http://ettin.org/tmp/lintian/transtats.out
 3. http://ettin.org/tmp/lintian/transtats-vvv.out
 4. http://wiki.debian.org/SummerOfCode2008/lintian
 5. http://git.debian.org/?p=lintian/lintian.git;a=commit;h=6f8da79744dfca6a
 6. http://wiki.debian.org/Teams/Lintian
 7. http://git.debian.org/?p=users/jorda-guest/lintian.git
 8. git://git.debian.org/git/users/jorda-guest/lintian.git

--

Output of `private/transats':

Number of classified tags
  390/713 (54.70%)

Severity
  serious: 34
  important: 142
  normal: 181
  minor: 19
  wishlist: 13

Certainty
  certain: 284
  possible: 104
  wild-guess: 2

Severity/Certainty
  serious/certain: 33
  serious/possible: 1
  important/certain: 121
  important/possible: 20
  important/wild-guess: 1
  normal/certain: 103
  normal/possible: 78
  minor/certain: 13
  minor/possible: 5
  minor/wild-guess: 1
  wishlist/certain: 13

Type error Severity
  serious: 34
  important: 135
  normal: 2

Type warning Severity
  important: 7
  normal: 179
  minor: 6

Type info Severity
  minor: 13
  wishlist: 13

Type error Severity/Certainty
  serious/certain: 33
  serious/possible: 1
  important/certain: 119
  important/possible: 16
  normal/certain: 2

Type warning Severity/Certainty
  important/certain: 2
  important/possible: 4
  important/wild-guess: 1
  normal/certain: 101
  normal/possible: 78
  minor/certain: 5
  minor/wild-guess: 1

Type info Severity/Certainty
  minor/certain: 8
  minor/possible: 5
  wishlist/certain: 13
Reply to:
Follow-Ups:
- Re: GSoC status: classification, output format and more
  - From: Russ Allbery <rra@debian.org>
Prev by Date: Bug#491302: lintian: exclude Format-Specification from the copyright-line-too-long check
Next by Date: Processed: tagging 471263
Previous by thread: Bug#491302: lintian: exclude Format-Specification from the copyright-line-too-long check
Next by thread: Re: GSoC status: classification, output format and more
Index(es):
- Date
- Thread