[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Dependency-based running of collection scripts



Russ Allbery wrote:
[...]
> 
> So, in other words, the set of required collection scripts is a strict
> partial order.  We start from an unodered list of collection scripts to
> run, read dependency information about them to sort them into a DAG,
> and then take the transitive closure of the DAG.  (It would be nice if
> there were some nice DAG modules already packaged, but none of them seem
> to do what I want.)

I took a look at the available DAG modules, but none made me happy enough,
so I decided to write something simple; as simple as possible so that it
wouldn't make much sense to write a sub-module to handle the nodes.

> 
> Now, when running collection scripts, we have two requirements:
> 
> 1. Run prerequisite scripts before those that depend on those scripts.
> 2. Run all independent scripts in parallel.
> 
> So, the algorithm is as follows: Start by finding all collection scripts
> with no dependencies and start them all in parallel.  As those
> collection scripts finish, remove them from the graph.  If that removes
> the last dependency of a node, start that collection script.

+ If that removes the last dependency of a *check* script, start it *after*
starting the other collection scripts in parallel.

> Continue 
> until we've started all collection scripts, then wait for all remaining
> ones to finish.

And the word "wait" is the one I'm trying to eliminate. Simply waiting
instead of doing something in the meanwhile is a waste of time.

[...]
> 
> This feels like less code than Lintian::DepMap and less complexity.
> It's missing:
> 
> * Allowing circular dependencies.  But I think this is an error case,
>   and we can just write a separate test case to test for it rather than
>   putting that in the main code.
> 
> * Co-dependencies.  But I don't understand what this is for.  I think
>   the above does everything we need to do without having that concept.
>   Am I missing something here that makes this worthwhile to have going
>   forward?
> 

The idea of prioritising collection scripts in favour of check scripts
requires a way to indicate that. And since it is only relevant for the
first set of collection scripts and in order to avoid introducing the
concept of priorities in Lintian::DepMap I decided it was easier to
introduce the concept of co-dependencies.

If take a look at the modified reap_collect_jobs you will see that the goal
is to return as soon as a collection script is done, so that both
collection and check scripts can be processed as soon as possible.
Since lightweight check scripts don't require much information, they are
pretty fast and thus they are pretty good candidates for the "do something
instead of waiting" concept.


> Certainly, we'd want to restructure the above code to be object-oriented
> and add some of the accessor functions like known, but I'm not sure why
> we'd want the selected/selectable code or codependencies and I think the
> code would end up being a lot shorter.

selected() is atm unused, it is a simple complement that I though it would
be nice to have.
selectable() is the way to ask "what should I process now that am not
already processing?."

> 
> As an aside, I got kind of squeamish at using Data::Dumper to do a deep
> copy of the data structure. 

I didn't find an easier way to clone the object with so many references.

> It would be nice if there were some non-destructive way to do the
> traversal,

It is destructive mainly because it avoids a lot of extra code that would
otherwise be needed.

> but failing that, just doing  
> something like:
> 
>   sub deep_copy {
>       my $this = shift;
>       if (not ref $this) {
>           return $this;
>       } elsif (ref $this eq "ARRAY") {
>           return [ map deep_copy($_), @$this ];
>       } elsif (ref $this eq "HASH") {
>           return {map { $_ => deep_copy($this->{$_}) } keys %$this };
>       } else {
>           die "unsupported type ", ref $this;
>       }
>   }
> 

Which sounds pretty much what Data::Dumper internally does (not to mention
that it uses C code and therefore works at the lower level, which is
faster.)

[...]
> 
> Looking at frontend/lintian modifications, we really need a
> restructuring of how Lintian::Command is used when running things in
> parallel.  IPC::Run is really a bad fit for this since you can't get the
> PID file out of IPC::Run, and what we really want is something like:
> 
>     # start a bunch of children storing PIDs in %children
>     while (%children) {
>         my $child = wait;

The problem I see here is that we are again 'wait'ing, which in other words
means: blocking.

>         # finish up with PID $child
>         delete $children{$child};
>     }
> 
> which you can't do with IPC::Run.  But we're not using any of the cool
> facilities of IPC::Run when running collection scripts anyway, so I
> would just replace all that with fork and exec, but probably stuff the
> code into Lintian::Command, giving it some functions to handle sets of
> children that complete asynchronously like this.  Maybe using some sort
> of callback facility?
> 

If it is done so that it is async and nb (although one often implies the
other) I see no reason not to do it.

Cheers,
-- 
Raphael Geissert - Debian Maintainer
www.debian.org - get.debian.net



Reply to: