Re: Restructuring check scripts

To: debian-lint-maint@lists.debian.org
Subject: Re: Restructuring check scripts
From: Russ Allbery <rra@debian.org>
Date: Sat, 26 Dec 2009 21:57:52 -0800
Message-id: <[🔎] 87pr61ot5r.fsf@windlord.stanford.edu>
In-reply-to: <guoc00$b2o$1@ger.gmane.org> (Raphael Geissert's message of "Sun, 17 May 2009 01:49:26 -0500")
References: <gunt25$ie2$1@ger.gmane.org> <87hbzkmm5v.fsf@windlord.stanford.edu> <guoc00$b2o$1@ger.gmane.org>

Raphael Geissert <atomo64+debian@gmail.com> writes:
> Russ Allbery wrote:
>> Raphael Geissert writes:

>>> On another different but not too distant topic, I'd like to propose
>>> adding per-tag needs-info. Of course a global needs-info would still
>>> be allowed to declare collection scripts needed by most/all the tags.

>> The concern I have with this is that it seems really difficult to
>> maintain and test.  We already have a problem with people forgetting to
>> add need-info to whole check scripts, leading to hidden bugs later.
>> This multiplies that problem by a lot, and testing it would require
>> some fairly massive expansion of our test suite (and would be really
>> slow).

> I wouldn't consider it much problem as long as the test and changes I
> made to include that information in every Lintian::Collect::* method are
> included and the usage of Lintian::Collect becomes the canonical data
> accessor.

Oh, good point.  That would make it way easier to deal with this, and I
think that's the right approach to take anyway.  In that case, we should
push forward with making Lintian::Collect the canonical interface to the
lab data, since I think that's the right move regardless.  Once we're
there, it will be easier to evaluate whether this would be useful.

>>> The idea is to later introduce an easy-to-use method to Tags that
>>> would allow a check script to know whether a given tag would ever be
>>> printed. If it is never going to be printed, why care about processing
>>> some data? why care to collect unused information?

>> I think the amount of time we're saving here isn't worth the
>> complexity.

> I'm hesitant about this. frontend/lintian already does something similar
> regarding running complete check scripts if no tag will be printed,
> which is a good idea in general, but a performance killer on ordinary
> runs.

Yeah, I was noticing that code; I think it could be made a lot more
efficient.  I bet we almost never exclude a check script.  That option (to
run only specific checks) is one of those things that Lintian's supported
forever but which I doubt people use.

There's now a method that will tell you whether or not a tag would be
displayed (Lintian::Tags->displayed).  It's not yet easily available to
the check scripts, since they don't have a Lintian::Tags object.  But
that's obviously fixable.

Note the difference between displayed() and suppressed(), though:
currently, tags have to be run in some cases even if they won't be
displayed because they affect the exit status and tag statistics for
things like overrides.  That's something we probably should reconsider,
since it would simplify things if displayed() and suppressed() became
synonymous.  I just kept it that way so as not to change existing
behavior.

It will be a lot easier to do all of this as more code is moved out of the
hairy frontend/lintian script into documented modules with better-defined
APIs.

>>> A perfect example for this is spelling-error-in-binary, which needs -I
>>> and -E to be displayed. If the tag would never be displayed, and it is
>>> the only one requiring the 'strings' collection script (oops, it ain't
>>> the best example after all, since we now have embedded-zlib) then that
>>> collection script is not run and therefore the check script doesn't
>>> spend time on it (which is the only benefit it would gain in this
>>> case.)

>> Right, remember 95% of our optimization work should be on making
>> running the full set of checks faster, since that's almost the only way
>> that lintian is called in practice.  If we can make other things fast
>> in the process, that's nice, but not particularly important.

> What I'm suggesting is more likely to happen, since not everyone runs
> lintian with -I, nor -E, and many less with --pedantic.

I think it's an interesting question how often not using those flags can
change what collect scripts we run.  Right now, I'm not sure we have a
good way of figuring out that information, so we're somewhat guessing.  It
may be that we will be able to dump some collect scripts in the normal
case, and certainly possible in the new -F case to check only ftp-master
tags.  On the other hand, it may be that we find that each of the
time-consuming collect scripts are required by some serious tag, and we
end up not really saving much.

We do have one check script that we can skip entirely unless -I was given,
and which can consume some time (huge-usr-share).

> This is an idea I've been playing around in my mind for a while, and
> always ended up with the same dilemma: how to determine what is less
> expensive between running some code, or determining whether that code
> should be run.  I always though some sort of Weight complementary field
> could help, but again, evaluating the weight of a tag could be more
> expensive than running the code that would produce the tag.

At least in theory, Perl's profiling support should help, but I still
haven't had time to investigate it in any detail.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

Prev by Date: Re: Dependency-based running of collection scripts
Next by Date: Processed: tagging 359059, retitle 359059 to [general] add module system to load additional non-standard checks
Previous by thread: Re: Dependency-based running of collection scripts
Next by thread: Re: Restructuring check scripts
Index(es):
- Date
- Thread