[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Deity and package ordering



Hi Ian, 

	I hear you asked about algorithms for package ordering. I am
 including what I sent to the Deity list below. (Also: my working
 prototype, pkg-order, encapsulates a number of user interface/api
 issues that may be of interest).

	I am including below my analysis of the DAG algorithms. (I've
 tried to coalesce a dialogue into a relevant message.

	Are you on the Deity list yet?

	manoj

Algorithms:
        I have been reading up on directed acyclic graphs and
 topological sorting. There are two methods to do the sort that I can
 see 
 a) Do a depth first search, printing out the node name after
    traversing the children (so each node occurs after all the
    dependents are listed; and reverse the order when printing)
  Algorithms [Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest]
  Algorithms in C [Robert Sedgewick]
   
 b) Look att all node with no dependencies, list them in any order,
    for each node listed, look for dependent packages and decrement the
    dependency count by one. repeat for all 0 order nodes after
    decrementing. 
  Algorithms [Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest]
  Discrete Mathematics and its applications [Kenneth H. Rosen]

        Both algorithms can work in O(V+E), where V is the number of
 nodes (packages), and E is the number of edges (dependencies). (I'm
 still struggling with an exercise in Algorithms [Thomas H. Cormen,
 Charles E. Leiserson, Ronald L. Rivest] which suggests that one may
 determine cycles in an undirected graph using an O(V) algorithm).

        Both these algorithms shall have to deal woth cycles, possibly
 by cutting links such that the cycle disappears. (In the forst
 algorithm, refuse to redo a node twice (color the node as needed), in
 the second node remove a dependency and decrement the indegree of the
 lowest degree's node, until it reaches 0 and one can try again.

        I'd like to also special case strongly connected areas of the
 graphs, since it is quite likey that the only cycles we may ever see
 may occur in strongly connected segments (package A depends on
 package B; package B also depends on package A).

        Most algorithmic methods for determining strongly connected
 regions of a graph are based on a depth first search, hence my
 predeliction for using this one algorithm, even though one gets less
 choices about the edge one snips to resolve cycles. I hope that if we
 coalesce strongly connected regions and install all those packages in
 one run of dpkg, we won't have that many cycles to contend with.

======================================================================
Algorithm A: Depth first search.

 1 For each package-versions which are candidates for installation; do
 2   Color each node white
 3 done

 4 For each package-versions which are candidates for installation; do
 5     if color of node is white; then
 6        Visit (Vertex)
 7     endif
 8 done

 Visit:
 9 color node Gray
10 for each dependency, predependency, and packages we are
       replacing, and, optionally, packages that have recommended or
       suggested us.
11     if dependency is predepends, 
12        add node to list of predepends
13     endif
14     if color of dependency not grey (else this was a cycle!)
15          visit (dependent)
16     endif
17 done
18 if we are in the list of predepends
19     add a break event on the ordered list
20  endif
21  add node to order list
22 Color node black
======================================================================

        What I like about algorithm A is that it is simpler to code
 (but not by much), what I like about the first is that one has some
 determination in how to break the cycle (search for the node with
 minimum indegrtee, and snip a link, preferably a suggests link or a
 recommends link). In the second algorithm one just follows predepends
 first, then depends, and so on, and hopes for the best about cycle
 removal. 

        Also, the this method does not call for reverse dependencies
 as algorithm below does.


======================================================================
 Algorithm b: (for packages being installed, removal is similar)
 # First generate the indegree of all package-versions
 1 For each package-versions which are candidates for installation; do
 2     for each dependency, predependency, and packages we are
       replacing, and, optionally, packages that have recommended or
       suggested us. set indegree
 3           package::indegree++;
 4     done
 5 done

 6 For each package-versions which are candidates for installation; do
 7     if indegree == 0; then
 8          push on todo list
 9     endif
10 done
 
        If there is nothing in the todo list, we had a cycle!! (handle
        it)
 
11 while there is a todo list; do
12     pop package off a to do list
13     if we are in uncleared predependent list; then
14        insert a break in processing marker on ordered list
15        clear uncleared predependent list
16     endif
17     push onto ordered list
18     remove from candidates list
19     for each target (reverse dependencies!) of depends and
               predepends, and packages we suggest or recommend
20         reduce indegree of target
21         if indegree of target == 0;
22             push target on todo list
23          endif
24     done
25 done

26 If not empty candidates list; there is a cycle!
======================================================================

Representation: 
        The data structures used for most graph algorithms is either:
 a) Adjacency matrix. Suppose the graph has V nodes and E edges. The
    adjacency matrix is a |V| x |V| array, with Ai,j = ! if an edge
    exists from vertex i to vertex j (or, in other words, it has a |E|
    non null elements). This representation is good for dense arrays
    (lots of edges). In our case, Vertexes are packages, and
    dependency relations are edges, and the matrix is quite sparse.

 b) adjacency list: This is an array (or linked list) of length
    |V|. Each node is the head of a linked list of dependencies. I
    think this is already how the backing store is represented (except
    that we have multiple kinds of dependencies, each of which gets a
    linked list (I think)

        Since the number of dependencies for any package are far less
 than the total number of packages, an adjacency list would be far
 more efficient way of representing the graphh. This is already very
 close to the data structure we use. 

        If we also construct the reverse depndency list in the backing
 store (cached list), then determining strongly connected regions
 becomes easier, since the reverse dependencies form a transposed
 directed graph of the original graph. Or else we may use tarjans
 variation of the depth first search to find strongly connected
 regions. 

        For dependency checking, we would need to have lists of
 packages installed, packages available, and packages which are
 candidates for installation.

Jason> This is unbelivably simple to code! You don't need any lists
Jason> but a single package extension structure containg the colour of
Jason> the package. It's also amazingly fast, the only loop is the
Jason> main one. The other one requires a reverse depends lookup,
Jason> which requires iterating over all packages depending on the
Jason> package, even if it is not a cand.

        But please note, each package has to have, in the relations
 list, any package that needs to be acted on before it. That includes,
 for installation:
 a) Anything we predepend on (simple)
 b) Anything we depend on (simple)
 c) Packages we conflict and replace, which have to be removed before
    we are installed. Requires a special case when building dependency
    lists, but still simple.
 d) packages that recommend us (*) reverse dependency, and should be
    marked as a weak link, so that it may be traversed last. 
 e) Packages that suggest us (*) reverse dependency

    This may require 3 passes over the list, one for normal relations,
    one for recommends, and one for suggests, which are weaker still.
        Or we may take care while building the list, at the cost of
        two extra pointers in the linked list header. Normal relatios
        are always inserted at the head of the list. Suggests are
        always inserted at the tail. recommends are always inserted at
        the "middle" pointer, so the ordering on the list is what we
        desire. 

 For removals (this seems simpler, we ignore recommends and suggest?)
 a) packages that pre-depend on us (what if they are not marked for
    removal?) (*) reverse dependency
 b) Packages that depend on us (*) reverse dependency

Jason> However, it's recursive, will the depth cause a problem for us?
Jason> I suppose I should go and calculate the worst case depth of the
Jason> routine (tomorrow I'll write a stats report into dpkg-cache)

        From my pkg-order work, it is rare to find a depth worse than
 6 in most cases, I have once seen a 10. Most of the dependencies are
 on a small number of packages (libc, for example), and once those
 nodes are painted black, they are ignored.

>> Provides should satisfy any non versioned dependency (I think
>> provides should only provide virtual packages, but I'm not sure
>> this is ever required anywhere)

Jason> Um, we are implementing enhanced provides, they have versions
Jason> and can match versioned depends with the 'provides'
Jason> version. Tom wants this.

Jason> I guess a provided targeted depends simply effects more than
Jason> one package.

        Hmmm. If we implement this, then provides become closer to
 first class citizens (less special cases required), but if we don't
 parse the versions on provided packages, then they should never match
 a versioned relationship. This shall have to have a special case in
 dependency determination code.

        It is not clear to me how one can actually implement versioned
 provides. Take gawk and mawk, which both provide awk. What version to
 apply to awk? It'll have to have a policy decicion, and the virtual
 package list should also list current versions of all virtual
 packages, and the criteria for bumping up a version number. (gawk's
 awk can do foo -- so it can provide awk 1.1, mawk still provides awk
 1.0) If both awks diverge, we run into problems. If mawk now provides
 feature bar, should it get 1.01? or 1.2? 

-- 
 Like a fine flower, beautiful to look at and scented too, fine words
 bear fruit in a man who acts well in accordance with them. 52
Manoj Srivastava  <srivasta@acm.org> <http://www.datasync.com/%7Esrivasta/>
Key C7261095 fingerprint = CB D9 F4 12 68 07 E4 05  CC 2D 27 12 1D F5 E8 6E


Reply to: