Re: Dependency-based running of collection scripts

To: debian-lint-maint@lists.debian.org
Subject: Re: Dependency-based running of collection scripts
From: Russ Allbery <rra@debian.org>
Date: Sat, 09 May 2009 23:10:47 -0700
Message-id: <[🔎] 874ovto5co.fsf@windlord.stanford.edu>
In-reply-to: <grrvv4$1s5$1@ger.gmane.org> (Raphael Geissert's message of "Sun\, 12 Apr 2009 00\:59\:37 -0500")
References: <grrvv4$1s5$1@ger.gmane.org>

Raphael Geissert <atomo64+debian@gmail.com> writes:

>     Test for POD coverage
>     Include pedantic tags in lintian.log, but don't display them in the html
>       reports
>     Make t/runtests call prove in recursive mode so that tests can be
>       separated

These are applied.

I adjusted the commit comments a bit, usually by writing a summary and
moving the existing summary into the body of the commit message.  The
first line of the Git comment message should be a summary and should be
short.  That means it shouldn't end in a period and it should be about
50 characters at most.  That's what the various output formats expect to
produce nice output for things like shortlog.

> Run the collection scripts based on their dependencies instead of the
> order.

> This introduces Lintian::DepMap, a simple dependencies map creator and
> solver; which should later be used to drive the package processing
> loop.

Wow, this looks like a lot of work.  Thank you very much for putting
this much work into Lintian!  But I'm concerned it may also be overkill
and pretty complicated for the problem we're solving.  I'm not sure that
this is the right approach, or at least I think it could be simplified a
lot.

Let's start at the basics and the requirements.

We want to run some set of tests.  Each of those tests depend on some
set of collection data.  So for a first pass we have a list of
collection scripts that we want to run.  Each of those scripts may have
dependencies on other scripts, including some already in the set and
some that we wouldn't otherwise run but are required as prerequisites.
For each new prerequisite, it may in turn depend on other scripts that
we've not already run.

So, in other words, the set of required collection scripts is a strict
partial order.  We start from an unodered list of collection scripts to
run, read dependency information about them to sort them into a DAG,
and then take the transitive closure of the DAG.  (It would be nice if
there were some nice DAG modules already packaged, but none of them seem
to do what I want.)

Now, when running collection scripts, we have two requirements:

1. Run prerequisite scripts before those that depend on those scripts.
2. Run all independent scripts in parallel.

So, the algorithm is as follows: Start by finding all collection scripts
with no dependencies and start them all in parallel.  As those
collection scripts finish, remove them from the graph.  If that removes
the last dependency of a node, start that collection script.  Continue
until we've started all collection scripts, then wait for all remaining
ones to finish.

So, basic code is something like:

    @collect = uniq(get_required_collections(@checks));
    for $collect (@collect) {
        add_node($collect, \%collect);
    }

    sub add_node {
        my ($collect, $graph) = @_;
        @deps = get_deps($collect);
        for $dep (@deps) {
            $graph->{$collect}{parents}{$dep} = 1;
            add_node($dep, $graph) unless exists $graph->{$dep};
            $graph->{$dep}{children}{$collect} = 1;
        }
        $graph->{TOP}{$collect} = 1 unless @deps;
    }

to add the nodes.  Then keys %{ $graph->{TOP} } are the starting points,
and you do:

    sub next_nodes {
        my ($collect, $graph) = @_;
        my @free;
        delete $graph->{TOP}{$collect};
        for $child (keys %{ $graph->{$collect}{children} }) {
            delete $graph->{$child}{parents}{$collect};
            push(@free, $child) unless $graph->{$child}{parents};
        }
        return @free;
    }

as each collect script finishes and then start all the scripts returned.

This feels like less code than Lintian::DepMap and less complexity.
It's missing:

* Allowing circular dependencies.  But I think this is an error case,
  and we can just write a separate test case to test for it rather than
  putting that in the main code.

* Co-dependencies.  But I don't understand what this is for.  I think
  the above does everything we need to do without having that concept.
  Am I missing something here that makes this worthwhile to have going
  forward?

Certainly, we'd want to restructure the above code to be object-oriented
and add some of the accessor functions like known, but I'm not sure why
we'd want the selected/selectable code or codependencies and I think the
code would end up being a lot shorter.

As an aside, I got kind of squeamish at using Data::Dumper to do a deep
copy of the data structure.  It would be nice if there were some
non-destructive way to do the traversal, but failing that, just doing
something like:

  sub deep_copy {
      my $this = shift;
      if (not ref $this) {
          return $this;
      } elsif (ref $this eq "ARRAY") {
          return [ map deep_copy($_), @$this ];
      } elsif (ref $this eq "HASH") {
          return {map { $_ => deep_copy($this->{$_}) } keys %$this };
      } else {
          die "unsupported type ", ref $this;
      }
  }

into our utility library.  This is based on code from:

    http://www.stonehenge.com/merlyn/UnixReview/col30.html

(and the idea is small enough I'm not worried about licensing there).

Looking at frontend/lintian modifications, we really need a
restructuring of how Lintian::Command is used when running things in
parallel.  IPC::Run is really a bad fit for this since you can't get the
PID file out of IPC::Run, and what we really want is something like:

    # start a bunch of children storing PIDs in %children
    while (%children) {
        my $child = wait;
        # finish up with PID $child
        delete $children{$child};
    }

which you can't do with IPC::Run.  But we're not using any of the cool
facilities of IPC::Run when running collection scripts anyway, so I
would just replace all that with fork and exec, but probably stuff the
code into Lintian::Command, giving it some functions to handle sets of
children that complete asynchronously like this.  Maybe using some sort
of callback facility?

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

Follow-Ups:
- Re: Dependency-based running of collection scripts
  - From: Russ Allbery <rra@debian.org>
- Re: Dependency-based running of collection scripts
  - From: Raphael Geissert <atomo64+debian@gmail.com>

Prev by Date: Bug#516530: [unpack,reporting] support multiple sections
Next by Date: [SCM] Debian package checker branch, master, updated. 2.2.10-30-g417acdd
Previous by thread: Bug#516530: [unpack,reporting] support multiple sections
Next by thread: Re: Dependency-based running of collection scripts
Index(es):
- Date
- Thread