Re: RFC: further parallelisation (dependency-based collection and check scripts)

To: debian-lint-maint@lists.debian.org
Subject: Re: RFC: further parallelisation (dependency-based collection and check scripts)
From: Raphael Geissert <geissert@debian.org>
Date: Fri, 01 Jan 2010 18:29:07 -0600
Message-id: <[🔎] hhm40o$qt6$1@ger.gmane.org>
References: <h9er7b$nse$1@ger.gmane.org> <878wcoonc6.fsf@windlord.stanford.edu> <hh76be$pp8$1@ger.gmane.org> <878wcono3c.fsf@windlord.stanford.edu> <hhhi8u$df$1@ger.gmane.org> <874on64t4h.fsf@windlord.stanford.edu>
Russ Allbery wrote:
> Raphael Geissert writes:
> 
> Some definitions:
> 
> * A tag is a specific diagnosed issue with a Lintian package that Lintian
>   wants to output.

We need to find a better word other than "package." It doesn't fit for all
of its uses.

> 
> To display a tag, we need, in addition to all the information above:
> 
> * From the override processing:
>   - The override affecting this tag, if any (used in some output formats)

We should also finally add:
    - The override comment, if there is one

> * From the scheduling code:
>   - The file name (for changes tags), version, and architecture of the
>     package or changes file currently being processed.

Why not use the Source name on changes tags? it makes more sense than the
current format.
E.g. checking our changed-by-localhost.changes test would instead output:

E: changed-by-malformed changes: changed-by-address-is-on-localhost Someone
<someone@localhost.localdomain>

And now that we are also considering changes files as another type of
examination objects, I think it makes sense to treat dsc files as such too.

> 
>> * A tags container and manager (Lintian::Tags):
>> + It is a singleton.
>> + It would be responsible of creating tags (create_tag()) and setting
>> their properties.
>> - Overrides parsing should be moved somewhere else.
>> Lintian::Tag::Overrides
> 
> Agreed on moving override parsing.  I think we would instantiate a new
> Lintian::Tag::Overrides object for each new file that we process and load
> into it whatever overrides are found (possibly with the optimization of
> not creating an object if no override files are present).

I'd prefer if the object is always created, to avoid adding extra code that
skips all the calls to that object.

> The scheduling 
> code or whatever code handles running the collection scripts, would be
> responsible for creating this object and loading overrides into it where
> required.

I think we need some sort of process inter-communication here. Otherwise we
need to special case the override-file collection script.

We probably need to:
* Always run the override-file collection script (less special-casing)
* Add an overrides check which would query Lintian::Tag::Overrides to report
any failure while parsing _a given_ override file (the one(s) shipped by
the package, not the one provided by a user, if we ever allow that).
* Find a way to let the collection script tell Lintian::Tag::Overrides it
should load the files it finds (here's where the process
inter-communication parts comes into play) to finally remove any
special-casing.

> 
> I went back and forth on whether or not Lintian::Tags should just be
> rolled into Lintian::Output, but it does seem cleaner to have a separate
> interface for setting output options from setting options that affect
> whether tags are issued.

Agreed.

> 
>> + It knows about two contexts, the container context (i.e. the check
>> script) and the tags context (i.e. $package-$type).
>> + Calling new() with a tags context destroys its Lintian::Tag::Override
>> reference.
>> + The tags context is shared with Lintian::Output when it interacts with
>> it.
> 
> new() in Lintian::Tags?  Since it's a singleton, wouldn't we call that
> only once?  We don't want to reload the configuration information.

Right, new() isn't the best name (I was thinking of taking advantage of
Perl's style of objects which require bless()).

What I was thinking was letting new() take the new context, drop the
Lintian::Tag::Overrides ref and still return a reference to the
already-created object.
As the settings would not be passed on the second to nth call to new() they
would not be overridden. The frontend would create a new() Lintian::Tags
object with an empty context *but* with the settings.

Sounds like we actually want two separate methods (one being new(), the
other being something that allows the context to be changed).

> 
> I think the easiest way of handling the tags context would be to create a
> separate object that represents that information, Lintian::Tag::Package
> perhaps (although there will also be *.changes files, so it's not,
> strictly speaking, only a package).  That object would be created by the
> scheduler and would contain a Lintian::Tag::Override object if appropriate

Sounds like a good idea, but I think it should be an object of the Lintian::
namespace and not Lintian::Tag since it doesn't contain anything related to
tags (yes, Lintian::Tag* would use it, but not the way around).

> (or we could just merge the two objects). 

I'd prefer if we do not.

> It would be passed into 
> Lintian::Tags when the tags context changes, and Lintian::Tags would then
> pass it along to the Lintian::Output layer, which would query it for
> necessary information for the current output method.  Which I think is
> roughly what you're saying above.
> 

Yes.

> Currently, we have global statistics for the 
> override summary and to determine the exit status, but I think the
> scheduling code should retain only the specific information that it needs
> for the summaries that it wants and destroy the rest of the tags context
> object when moving to a new file.

Agreed.

> 
> I suspect that at least some of our continuing memory consumption when
> running Lintian across the entire archive is that all file metadata, all
> overrides for that file, and all statistics are accumulated across the
> entire run and never cleared out of memory.  In the archive-wide mode, I
> think the scheduler should not keep any cumulative statistics at all,
> since we don't use them; that's handled by post-processing the output log.

Right. A new command-line option could act as a switch.

> 
>> - Lintian::Tags should not know about files, versions, architecture, etc.
> 
> Agreed, I think this should be encapsulated in a separate object that's
> passed into Lintian::Tags, associated with tags, and then only asked
> questions where needed.

Yes, it should be encapsulated in Lintian::Tag::Package.

> 
>> + The container context descriptor (.desc file) should only be read when
>> requested. Issuing a tag() loads it.
>> + Check scripts runner (Lintian::Check atm) should switch the "context()"
>> of the Lintian::Tags object, so it knows what .desc file it should be
>> looking at.
> 
> Note that we have to read all of the *.desc files on every Lintian run now
> (except in the unusual case where the user tells is exactly which check
> scripts to run) because we need to know which collection scripts to run.
> If the user only tells us what tags to issue, we have to find those tags
> and determine what that means for which check scripts to run.  In that
> case, we also have to read the entire *.desc files to find the tags.

You got a point. I did not consider that.

> 
> We want to avoid unnecessary duplication of tag description, and we want
> to avoid duplication of effort in loading it.  Currently, the
> Lintian::Tag::Info implementation goes to some work to reuse data, but it
> doesn't provide any interface to the other data in the *.desc file needed
> to determine what collection scripts to run.
> 
> I suspect it's going to be faster or just as fast, once we've opened all
> of the *.desc files anyway, to just finish reading them and caching their
> content.  We want to do that once.  The model that comes to mind is:
> 
> * A Lintian::Check::Data object which loads all the *.desc files on new()
>   and has methods to answer the check dependency questions that we have
>   now, like what collection scripts are needed for a given set of checks
>   or tags.
> 
> * A singleton instance of that object which is passed into Lintian::Tags
>   so that it has a reference to it.
> 
> * A method in Lintian::Check::Data that's a generator function for
>   Lintian::Tag::Single objects.  It would create a new
>   Lintian::Tag::Single object and associate with it a reference to the
>   cached metadata for that tag, which would be reused for all tag objects
>   for that tag.  Lintian::Tags would call this method to create tag
>   objects when needed, and then query those tag objects for metadata
>   information to know whether to issue a tag.
> 

What about this:

* One Lintian::Check::Data object is created per check script, which will
answer questions like check name, abbreviation and package types it
operates on.
* After reading the .desc header, each instance will continue reading the
file, create new Lintian::Tag::Single objects and set their tag name,
description, severity, certainty, experimental flag, references, check
script it belongs to and _dependencies_ (taking the Needs-Info of the
header + an optional Needs-Info it finds on that tag).

The Lintian::Tag::Single class needs to be a per-tag singleton for this to
work.

This would accomplish multiple things:
a) We only read the .desc files once.
b) Lintian::Data::Check retains the necessary check information, but doesn't
duplicate any bit of of tags data.
c) By using the per-tag singletons every Lintian::Tag::Single created is
ready to be used, no need to query any other object. Creating tags would be
done directly, from wherever it is needed without having to request them to
another object.

> If we do that, we then don't need to have a check context, since
> Lintian::Check::Data will already have all the data loaded and can
> retrieve the appropriate tag metadata on demand.

Correct.

> 
>> + displayed() doesn't know about overrides but does know about suppressed
>> tags (i.e. tag() won't create a ::Single object if !displayed()).
>> + displayed() performs its checks by using a tag from create_tag()
>> without context.
> 
> In the above model, we would still ask Lintian::Check::Data for a tag
> object since we need access to the metadata about that tag to answer the
> displayed() question, similar to how we ask for a Lintian::Tag::Info
> object now.

By using this other model there's no need to ask Lintian::Check::Data for
anything, reducing the complexity of the model.

> 
> 
>> * A single tag (Lintian::Tag::Single):
>> + Based on the Lintian::Tag::Info code but without doing anything related
>> to references.
> 
> Do you mean the sources() method?  That's there because that's part of the
> tag issuing decision, so we'll need to keep that.  (But, in general, see
> above for how I would rework Lintian::Tag::Info.)

I was referring to the _format_reference, _manual_reference and
_load_manual_data methods, which don't really belong to this method.

> 
>> + They are per-tag singletons.
>> + It has an internal counter, every call to new() with a context resets
>> it. + The extra and overridden properties are cleared out on every call
>> to new(). Only if a context is defined.

The idea behind it was that each Lintian::Tag::Single object would know how
many times it was issued therefore solving the statistics problem. Now I
see that it doesn't really look like the right place to solve it.

To better clarify my idea of "per-tag singletons," here's what the creator
would look like:

our %TAGS;
sub create {
        my ($class, $tag, $data) = @_;

        croak('no tag specified') unless $tag;

        if (exists($TAGS{$tag})) {
                croak('attempted to create tag ',$tag,' more than once')
                        if defined($data);
                my $obj = $TAGS{$tag};
                # assuming overridden() and extra() are getters and setters at the same 
                # time:
                $obj->overridden(undef);
                $obj->extra(undef);
                return $obj;
        }
        croak('attempted to create unknown tag ',$tag)
                unless defined($data);

        my $self = {};
        bless($self, $class);
        # this should be done via a setter
        $self->{tag} = $tag;
        $self->{data} = $data;
        $TAGS{$tag} = $self;
        return $self;
}

Lintian::Check::Data would then create tags by calling
Lintian::Tag::Single->create('tag-name', $ref_to_data_hash);

And Lintian::Tags::tag(), Lintian::Tags::display() and friends would all do:
my $tag = Lintian::Tag::Single->create('tag-name');

> 
>> * Lintian::Tag::Info should then be moved to the output layer.
> 
> The only part of the current code that's really an output thing is the
> description() method.  We could modify that method in the replacement
> object to just return the raw data from the *.desc file and move all the
> formatting code into Lintian::Output.  I think that would make sense to
> do.  (And all of Text_utils.pm should move into Lintian::Output.)

Agreed.

> 
>> P.S. have a happy new year :)
> 
> You too!  And thank you for all the work you've done on Lintian over the
> last year!
> 

You are welcome :) you and the others have done a great work for years!.

Cheers,
-- 
Raphael Geissert - Debian Developer
www.debian.org - get.debian.net
Reply to:
Prev by Date: Bug#552052: lintian: add perl-modules to list of obsolete packages
Next by Date: [SCM] Debian package checker branch, master, updated. 2.3.0-46-gec1944c
Previous by thread: Processed: Re: Bug#552052: lintian: add perl-modules to list of obsolete packages
Next by thread: [SCM] Debian package checker branch, master, updated. 2.3.0-46-gec1944c
Index(es):
- Date
- Thread