Re: Dependency-based running of collection scripts

To: debian-lint-maint@lists.debian.org
Subject: Re: Dependency-based running of collection scripts
From: Raphael Geissert <atomo64+debian@gmail.com>
Date: Sat, 16 May 2009 21:12:20 -0500
Message-id: <[🔎] gunroe$ft3$1@ger.gmane.org>
References: <grrvv4$1s5$1@ger.gmane.org> <[🔎] 874ovto5co.fsf@windlord.stanford.edu> <[🔎] gu5uto$kik$1@ger.gmane.org> <[🔎] 87bpq1l6xh.fsf@windlord.stanford.edu>

Russ Allbery wrote:
> Raphael Geissert writes:
>> Russ Allbery wrote:
[...]
> 
> Yeah, we could start running a check before all the collection scripts
> have finished.  But that shouldn't require anything different about the
> basic logic, just the ability to start running the odd check here and
> there while the collection scripts finish.

It is actually more a matter of the order of the elements of the array.

> 
>>> * Co-dependencies.

Now that I think about it, co-dependencies aren't the best solution for this
task. Am now thinking about writing another class that inherits
Lintian::DepMap's methods and provides some nice features such as:
* indicating whether a node is a collection script or a check
* sorting the output of selectable(), collection scripts first, checks
later.
* any other nifty feature we want that wouldn't fit in DepMap itself.

>>> But I don't understand what this is for.  I think 
>>>   the above does everything we need to do without having that concept.
>>>   Am I missing something here that makes this worthwhile to have going
>>>   forward?
> 
>> The idea of prioritising collection scripts in favour of check scripts
>> requires a way to indicate that.
> 
> Why do we need to do that?
> 
> Checks are done in the main Lintian process.  Collection scripts are
> done in the background.  If we've unblocked a check, the main Lintian
> process can just go do all unblocked checks right away.  When it comes
> back, it can see if any collection scripts have finished and therefore
> any more checks are unblocked.

The idea is:
1.- we are running nothing, kick off the first set of scripts (in the future
this should be the unpack scripts, and since all the checks and collection
scripts require at least unpack level one it blocks)
2.- so now we can start a bunch of collection scripts, let's do that.
3.- one of those collection scripts is done, it made available: 1 check, 0
collection scripts. Let's work on that check.
4.- We are done with that check, we come back and look at the list of
running jobs and we see that one is done, it made available: 1 check
script, 2 collection scripts; let's first kick off those collection scripts
and get back to the check script when we are sure *there's something else
running it parallel*.
5.- Repeat 3 and 4 over and over again.

Hope that explains the idea.

> 
>>> Certainly, we'd want to restructure the above code to be
>>> object-oriented and add some of the accessor functions like known,
>>> but I'm not sure why we'd want the selected/selectable code or
>>> codependencies and I think the code would end up being a lot shorter.
> 
>> selected() is atm unused, it is a simple complement that I though it
>> would be nice to have.  selectable() is the way to ask "what should I
>> process now that am not already processing?."
> 
> But that's exactly what the next_nodes function in my example gives you.

Yes, but what I mean is: I don't see the point in doing, basically, the same
thing DepMap is already doing (and is already coded.)

> 
>>> It would be nice if there were some non-destructive way to do the
>>> traversal,
> 
>> It is destructive mainly because it avoids a lot of extra code that
>> would otherwise be needed.
> 
> Yeah, agreed.  It would be nice, but I didn't see a way to do it either.

I could make it rebuild the map on the fly on another set of vars
(say %old_map, %old_nodes) as the existing map is being destroyed.
An initialise() method would make sure old_map and old_nodes are empty and
move them to map and nodes if needed (although we need to watch out for
partially resolved maps), making it the only call needed before we start
using the map. I.e. instead of calling export and eval, we would only call
initialise(), always.
The advantages are mainly that we would not be cloning the object, we would
only be playing with references.

>>>     # start a bunch of children storing PIDs in %children
>>>     while (%children) {
>>>         my $child = wait;
> 
>> The problem I see here is that we are again 'wait'ing, which in other
>> words means: blocking.
> 
> It will only block if no children are finished, which is the one case
> where we *do* have to block because we can't do anything else.

But that doesn't play nice with the idea of running checks in the meanwhile
on the main process. Unless we start using threads (at least to the extent
where the main lintian process coordinates the running of all check and
collection scripts and there's only one other thread running a check; but
we still need to make many classes such as Tags threads-friendly.)
I only wish perl did have a non-copy way to start a thread; that would add
more sense to running several threads for small tasks.

[...]
> 
> I think this is minimal -- you never wait when you could be doing
> something.  But you can't implement this with IPC::Run without calling
> reap_nb on every outstanding object each time, which is less efficient
> and (more importantly) harder to follow.
> 

Sure, IPC::Run might not be the best, but it works and its implementation is
irrelevant to the dependencies-driven processing change.

By the way, could you please elaborate the following a bit more?
> But I'm concerned it may also be overkill and pretty complicated for the
> problem we're solving.  I'm not sure that this is the right approach, or
> at least I think it could be simplified a lot.

If I remove the co-dependencies stuff would, in your opinion, make it less
complicated?

I mean, bits like:

>    unless ($parents || scalar %{$self->{'nodes'}{$node}->{'parents'}}) {
>        $self->{'map'}{$node} = $self->{'nodes'}{$node};
>    } elsif (exists $self->{'map'}{$node}) {
>        delete $self->{'map'}{$node};
>    } else { 1; }

Only make sure that no matter in which order nodes are added we always get a
consistent state. This eliminates the need of first gathering all the
information about collection and checks and later carefully enter them
(which is not only a waste of time, but complicated to achieve, since you
would actually need a dependency resolver to enter the data in the right
order).

Other than that I don't see what you think is an overkill; there are of
course some, unused but complementary, methods also implemented, but that's
because I wanted to write a full resolver that I could even tweak a bit and
make it a CPAN module.

Cheers,
-- 
Raphael Geissert - Debian Maintainer
www.debian.org - get.debian.net

Reply to:

Follow-Ups:
- Re: Dependency-based running of collection scripts
  - From: Russ Allbery <rra@debian.org>

References:
- Re: Dependency-based running of collection scripts
  - From: Russ Allbery <rra@debian.org>
- Re: Dependency-based running of collection scripts
  - From: Raphael Geissert <atomo64+debian@gmail.com>
- Re: Dependency-based running of collection scripts
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Bug#528975: lintian: Check for files in /usr/lib64
Next by Date: Restructuring check scripts
Previous by thread: Re: Dependency-based running of collection scripts
Next by thread: Re: Dependency-based running of collection scripts
Index(es):
- Date
- Thread