Re: RFC: further parallelisation (dependency-based collection and check scripts)

To: debian-lint-maint@lists.debian.org
Subject: Re: RFC: further parallelisation (dependency-based collection and check scripts)
From: Russ Allbery <rra@debian.org>
Date: Thu, 31 Dec 2009 15:32:30 -0800
Message-id: <[🔎] 874on64t4h.fsf@windlord.stanford.edu>
In-reply-to: <[🔎] hhhi8u$df$1@ger.gmane.org> (Raphael Geissert's message of "Thu, 31 Dec 2009 01:01:55 -0600")
References: <h9er7b$nse$1@ger.gmane.org> <[🔎] 878wcoonc6.fsf@windlord.stanford.edu> <[🔎] hh76be$pp8$1@ger.gmane.org> <[🔎] 878wcono3c.fsf@windlord.stanford.edu> <[🔎] hhhi8u$df$1@ger.gmane.org>

Raphael Geissert <geissert@debian.org> writes:

> I was thinking we should have something like this:

Thank you for the analysis!  That helps me a lot.

Some definitions:

* A tag is a specific diagnosed issue with a Lintian package that Lintian
  wants to output.

* A tag definition is the static metadata about all tags of a particular
  name, which is contained in the *.desc file.

* Tags are "issued" if they add to the tag statistics or may contribute to
  the exit status of Lintian.  This means that overridden tags with the
  default options are issued, since they have to be recorded in the
  statistics so that we can summarize the number of overridden tags.

* Tags are "displayed" if they're actually sent to the output.  So
  overridden tags are issued but not displayed.  Tags outside the severity
  or certainty limits, from the wrong sources, not in a static tag list,
  or suppressed are neither issued nor displayed.

* The scheduling code is the part of Lintian that, given a list of file
  names obtained from somewhere, checks each of those files in turn and
  therefore knows what file is currently being checked, when that file
  changes, and so forth.

To determine whether to issue a tag, we need the following information:

* From the check code:
  - Tag name
* From the tag definition:
  - Severity
  - Certainty
  - Experimental flag
  - References

To determine whether a tag is overridden, we additionally need:

* From the check code:
  - Extra information
* From the scheduling code:
  - The package name and package type currently being processed (note that
    this could be provided by the check code, since it also has it)

To display a tag, we need, in addition to all the information above:

* From the override processing:
  - The override affecting this tag, if any (used in some output formats)
* From the scheduling code:
  - The file name (for changes tags), version, and architecture of the
    package or changes file currently being processed.
* From the tag definition:
  - The tag description

So I would say that in order to process a tag() call in a check script, we
need the following pieces:

* Translate the tag name into a rich object with additional metadata.

* Determine whether to issue the tag based on that metadata and
  configuration.  That configuration needs to be stored in some sort of
  global context, since it mostly comes from command-line options or
  Lintian's configuration.  If the tag is not issued, stop here.

* Determine whether the tag is overridden.  If so, record that in the
  statistics.  This is also all the additional information required to
  know whether to display it.

* Display the tag.

In addition, we need to, somewhere, record the package name, package type,
package version, file name, and architecture so that we can use them when
displaying tags.

So, given that, let me dive into the details.

> * A tags container and manager (Lintian::Tags):
> + It is a singleton.
> + It would be responsible of creating tags (create_tag()) and setting their
> properties.
> - Overrides parsing should be moved somewhere else. Lintian::Tag::Overrides

Agreed on moving override parsing.  I think we would instantiate a new
Lintian::Tag::Overrides object for each new file that we process and load
into it whatever overrides are found (possibly with the optimization of
not creating an object if no override files are present).  The scheduling
code or whatever code handles running the collection scripts, would be
responsible for creating this object and loading overrides into it where
required.

I went back and forth on whether or not Lintian::Tags should just be
rolled into Lintian::Output, but it does seem cleaner to have a separate
interface for setting output options from setting options that affect
whether tags are issued.

> + It knows about two contexts, the container context (i.e. the check script)
> and the tags context (i.e. $package-$type).
> + Calling new() with a tags context destroys its Lintian::Tag::Override
> reference.
> + The tags context is shared with Lintian::Output when it interacts with it.

new() in Lintian::Tags?  Since it's a singleton, wouldn't we call that
only once?  We don't want to reload the configuration information.

I think the easiest way of handling the tags context would be to create a
separate object that represents that information, Lintian::Tag::Package
perhaps (although there will also be *.changes files, so it's not,
strictly speaking, only a package).  That object would be created by the
scheduler and would contain a Lintian::Tag::Override object if appropriate
(or we could just merge the two objects).  It would be passed into
Lintian::Tags when the tags context changes, and Lintian::Tags would then
pass it along to the Lintian::Output layer, which would query it for
necessary information for the current output method.  Which I think is
roughly what you're saying above.

That object would also be responsible for accumulating statistics about
the tags issued for that package, currently tracked instead inside the
Lintian::Tags object.  Currently, we have global statistics for the
override summary and to determine the exit status, but I think the
scheduling code should retain only the specific information that it needs
for the summaries that it wants and destroy the rest of the tags context
object when moving to a new file.

I suspect that at least some of our continuing memory consumption when
running Lintian across the entire archive is that all file metadata, all
overrides for that file, and all statistics are accumulated across the
entire run and never cleared out of memory.  In the archive-wide mode, I
think the scheduler should not keep any cumulative statistics at all,
since we don't use them; that's handled by post-processing the output log.

> + When tags are created with a tag context the override property of
> Lintian::Tag::Single is set by checking if
> Lintian::Tag::Overrides::matches($tag). Where $tag is a
> Lintian::Tag::Single including $extra information.

Sounds right.

> + Tags are created as they are issued. Never issued -> never created.
> Suppressed -> never created.

As seen in the data model above, we need most of the information from the
*.desc file in order to determine whether or not to issue the tag, so we
still need to take whatever action loads that metadata.  Currently, that's
calling new() on Lintian::Tag::Info.  See below for a way to change that.

> - Lintian::Tags should not know about files, versions, architecture, etc.

Agreed, I think this should be encapsulated in a separate object that's
passed into Lintian::Tags, associated with tags, and then only asked
questions where needed.

> + The container context descriptor (.desc file) should only be read when
> requested. Issuing a tag() loads it.
> + Check scripts runner (Lintian::Check atm) should switch the "context()" of
> the Lintian::Tags object, so it knows what .desc file it should be looking
> at.

Note that we have to read all of the *.desc files on every Lintian run now
(except in the unusual case where the user tells is exactly which check
scripts to run) because we need to know which collection scripts to run.
If the user only tells us what tags to issue, we have to find those tags
and determine what that means for which check scripts to run.  In that
case, we also have to read the entire *.desc files to find the tags.

We want to avoid unnecessary duplication of tag description, and we want
to avoid duplication of effort in loading it.  Currently, the
Lintian::Tag::Info implementation goes to some work to reuse data, but it
doesn't provide any interface to the other data in the *.desc file needed
to determine what collection scripts to run.

I suspect it's going to be faster or just as fast, once we've opened all
of the *.desc files anyway, to just finish reading them and caching their
content.  We want to do that once.  The model that comes to mind is:

* A Lintian::Check::Data object which loads all the *.desc files on new()
  and has methods to answer the check dependency questions that we have
  now, like what collection scripts are needed for a given set of checks
  or tags.

* A singleton instance of that object which is passed into Lintian::Tags
  so that it has a reference to it.

* A method in Lintian::Check::Data that's a generator function for
  Lintian::Tag::Single objects.  It would create a new
  Lintian::Tag::Single object and associate with it a reference to the
  cached metadata for that tag, which would be reused for all tag objects
  for that tag.  Lintian::Tags would call this method to create tag
  objects when needed, and then query those tag objects for metadata
  information to know whether to issue a tag.

I think that would let us read the *.desc files only once, instead of
twice as we do right now, and keep all the data cached in memory.  The
parsing code between frontend/lintian and Lintian::Tag::Info would be
collected and merged in Lintian::Check::Data.

If we do that, we then don't need to have a check context, since
Lintian::Check::Data will already have all the data loaded and can
retrieve the appropriate tag metadata on demand.

> + displayed() doesn't know about overrides but does know about suppressed
> tags (i.e. tag() won't create a ::Single object if !displayed()).
> + displayed() performs its checks by using a tag from create_tag()
> without context.

In the above model, we would still ask Lintian::Check::Data for a tag
object since we need access to the metadata about that tag to answer the
displayed() question, similar to how we ask for a Lintian::Tag::Info
object now.

> * An overrides parser (Lintian::Tag::Overrides):
> + One or multiple files can be loaded()
> + Since it knows how to parse the files, it should know when a tag matches()
> an entry.

Yup, sounds right.

> * A single tag (Lintian::Tag::Single):
> + Based on the Lintian::Tag::Info code but without doing anything related to
> references.

Do you mean the sources() method?  That's there because that's part of the
tag issuing decision, so we'll need to keep that.  (But, in general, see
above for how I would rework Lintian::Tag::Info.)

> + They are per-tag singletons.
> + It has an internal counter, every call to new() with a context resets it.
> + The extra and overridden properties are cleared out on every call to
> new(). Only if a context is defined.
> + When all of their properties but extra an overridden are set they are said
> to be loaded().

I think the above model accomplishes roughly the same thing by embedding a
reference to the data in a newly created object.

> * Lintian::Tag::Info should then be moved to the output layer.

The only part of the current code that's really an output thing is the
description() method.  We could modify that method in the replacement
object to just return the raw data from the *.desc file and move all the
formatting code into Lintian::Output.  I think that would make sense to
do.  (And all of Text_utils.pm should move into Lintian::Output.)

> Using the architecture I described above, this could easily be done by
> caching all data until the tags context is changed. At which point it
> would go ahead and communicate with Lintian::Output.

Yup, exactly.

> P.S. have a happy new year :)

You too!  And thank you for all the work you've done on Lintian over the
last year!

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

References:
- Re: RFC: further parallelisation (dependency-based collection and check scripts)
  - From: Russ Allbery <rra@debian.org>
- Re: RFC: further parallelisation (dependency-based collection and check scripts)
  - From: Raphael Geissert <geissert@debian.org>
- Re: RFC: further parallelisation (dependency-based collection and check scripts)
  - From: Russ Allbery <rra@debian.org>
- Re: RFC: further parallelisation (dependency-based collection and check scripts)
  - From: Raphael Geissert <geissert@debian.org>

Prev by Date: DRAFT 2: Bits from the Lintian maintainers
Previous by thread: Re: RFC: further parallelisation (dependency-based collection and check scripts)
Next by thread: [SCM] Debian package checker branch, master, updated. 2.3.0-4-g146dd9a
Index(es):
- Date
- Thread