Re: GSoC status: classification, output format and more

To: debian-lint-maint@lists.debian.org
Subject: Re: GSoC status: classification, output format and more
From: Russ Allbery <rra@debian.org>
Date: Sun, 20 Jul 2008 21:13:25 -0700
Message-id: <[🔎] 87iquzg9i2.fsf@windlord.stanford.edu>
In-reply-to: <[🔎] 20080718133023.GA28332@hinnom> ("Jordà Polo"'s message of "Fri\, 18 Jul 2008 15\:30\:24 +0200")
References: <[🔎] 20080718133023.GA28332@hinnom>

Jordà Polo <jorda@ettin.org> writes:

> I have not been explaining much about the Lintian GSoC project. In the
> following paragraphs I'll try to summarize how it is coming along, as
> well as what are the current issues and future directions.

Thank you for the update!

> So far, more than 50% of all the tags in checks/*.desc have been
> classified using Severity and Certainty headers. Even though I have been
> following the descriptions[1] and have checked the documentation
> (policy, devref, etc.), this initial classification isn't necessarily
> perfect. In some cases it isn't that clear what is the most appropriate
> value, which has led me to be influenced by other factors (such as the
> value of Type, the wording of the description, or the number of
> tagged/overridden packages). Also, I have tried to be consistent between
> tags of the same check, but there may be some inconsistencies between
> tags of different checks.

I would expect that.  In a lot of cases, the current severities are just a
guess, or haven't really been reviewed.  I'm happy to take any consistent
review, double-check it, and then go from there and tweak it based on
later reports.

> There aren't a lot more options if keeping exactly one line per tag is
> important. But perhaps it would be interesting to have some kind of
> --human-readable output for maintainers that only want to check a small
> number of packages. This includes more than one possible variant:
>
>   minor, wild-guess: no-upstream-changelog
>     * package
>   binary-with-bad-dynamic-table [important, possible]
>     * package: extra-info
>   normal: changelog-news-debian-mismatch [possible]
>     * package: extra-info
>
> It should also be possible to reorganize the output, but that would
> require displaying everything at the end instead of printing tags as
> they're found.
>
>   serious: debian-changelog-file-missing [certain]
>     * package1: dpatch
>     * package2: dpatch

I'm not sure that this is a good idea, but the thought I had in the back
of my mind was to keep the current E/W/I code on the one-line output
format and only show the derivation of that status if -i was used.  With
-i, we would display the severity and certainty along with the source.

So a tag like:

W: openafs-fileserver: binary-without-manpage usr/sbin/fssync-debug

would stay the same, but become:

W: openafs-fileserver: binary-without-manpage usr/sbin/fssync-debug
N:
N:   Each binary in /usr/bin, /usr/sbin, /bin, /sbin or /usr/games should
N:   have a manual page
N:   [...]
N:   Refer to Policy Manual, section 12.1 for details.
N:
N:   Severity: normal, Certainty: probable, Source: policy

when displayed with -i.  So you'd have to use -i to see the full
classification or to know how to tune Lintian to show only things like
this.

The plus is that the basic format uses the same terms that people are
already familiar with, even though we also have support for tuning the
output for things like ftp-master.  The drawback is that we're not pushing
people towards the new, granular way of thinking about tag severity.  But
I'm not sure that's necessary.

> With only ~54% of tags classified, it is still soon to define a mapping
> to the old E/W/I classification. But I have been a bit conservative with
> the new classification, and a simple mapping like the one in the table
> below would classify most tags correctly (the frontier between error and
> warning wouldn't be that meaningful, but the tags displayed with and
> without -I would be almost the same).
>
>   -----------------
>   |   | C | P | W |
>   |---|---|---|---|
>   | S | E | E | E |
>   | I | E | E | W |
>   | N | W | W | W |
>   | M | W | I | I |
>   | W | I | I | I |
>   -----------------

The only thing that I might change there is to make N/W an I instead of W.
Otherwise, that looks great to me.

> One of the things I'm not happy about is that all checks are executed
> and non-requested tags are simply hidden. This made sense before since
> there were only a small number of optional tags (Type: info), but with
> the new classification it will be possible to request only a small
> number of tags (e.g. serious & certain). I would like to take a look at
> this issue before the end of the GSoC project to see how much time is
> spent running collection/checks scripts.

In practice, there are a few checks that take up a *lot* of time (man page
processing, for example), and most checks are fairly fast once you have
all the data collected anyway.  Currently, the split between checks/*
scripts is a bit arbitrary and is mostly based on convenience and what
someone was thinking when they wrote the checks.

I'm hoping that the new Lintian::Data interface will provide a much better
way of getting the same data in all theh checks/* scripts and make it
easier to put things into different checks/* scripts based on where they
logically belong, or on other concerns, instead of based on what scripts
have the right information available (although there's still going to be
issues around unpack levels and the like).

I think moving the most time-consuming checks into separate checks scripts
is potentially a very good idea, since then we can selectively exclude
particular check scripts, or even classify check scripts based on how
heavy they are and provide options to only run quick checks.

> After the redesign of lintian.d.o, people have suggested some
> interesting improvements for the website. For instance, adding line
> charts to display the evolution of the number of tags (for each
> maintainer and tag), and even the kinds of tags (for maintainers).

That would be very neat.  We'd need more driver support so that, when
upgrading or testing new web page generation, we didn't mess up the
database or add a bunch of pointless data points, but I think there's a
ton of potential here.

> Another idea to improve lintian.d.o would be to provide a more stable
> and parseable output (e.g. YAML) so that other services such as qa.d.o
> and packages.qa.d.o could easily use the data. Do you think it is worth
> it?

Yes; we already produce a separate output file for QA purposes (although
it's not used right now, so far as I know), and more things along those
lines would be very useful for doing things like showing the Lintian
status of a package on its QA page or PTS page for the maintainer.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

Follow-Ups:
- Re: GSoC status: classification, output format and more
  - From: Jordà Polo <jorda@ettin.org>
- Re: GSoC status: classification, output format and more
  - From: Jordà Polo <jorda@ettin.org>

References:
- GSoC status: classification, output format and more
  - From: Jordà Polo <jorda@ettin.org>

Prev by Date: Re: Usertagging bugs
Next by Date: [SCM] Debian package checker branch, master, updated. 1.24.2-11-gcced3c4
Previous by thread: GSoC status: classification, output format and more
Next by thread: Re: GSoC status: classification, output format and more
Index(es):
- Date
- Thread