Deity and package ordering
Hi Ian,
I hear you asked about algorithms for package ordering. I am
including what I sent to the Deity list below. (Also: my working
prototype, pkg-order, encapsulates a number of user interface/api
issues that may be of interest).
I am including below my analysis of the DAG algorithms. (I've
tried to coalesce a dialogue into a relevant message.
Are you on the Deity list yet?
manoj
Algorithms:
I have been reading up on directed acyclic graphs and
topological sorting. There are two methods to do the sort that I can
see
a) Do a depth first search, printing out the node name after
traversing the children (so each node occurs after all the
dependents are listed; and reverse the order when printing)
Algorithms [Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest]
Algorithms in C [Robert Sedgewick]
b) Look att all node with no dependencies, list them in any order,
for each node listed, look for dependent packages and decrement the
dependency count by one. repeat for all 0 order nodes after
decrementing.
Algorithms [Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest]
Discrete Mathematics and its applications [Kenneth H. Rosen]
Both algorithms can work in O(V+E), where V is the number of
nodes (packages), and E is the number of edges (dependencies). (I'm
still struggling with an exercise in Algorithms [Thomas H. Cormen,
Charles E. Leiserson, Ronald L. Rivest] which suggests that one may
determine cycles in an undirected graph using an O(V) algorithm).
Both these algorithms shall have to deal woth cycles, possibly
by cutting links such that the cycle disappears. (In the forst
algorithm, refuse to redo a node twice (color the node as needed), in
the second node remove a dependency and decrement the indegree of the
lowest degree's node, until it reaches 0 and one can try again.
I'd like to also special case strongly connected areas of the
graphs, since it is quite likey that the only cycles we may ever see
may occur in strongly connected segments (package A depends on
package B; package B also depends on package A).
Most algorithmic methods for determining strongly connected
regions of a graph are based on a depth first search, hence my
predeliction for using this one algorithm, even though one gets less
choices about the edge one snips to resolve cycles. I hope that if we
coalesce strongly connected regions and install all those packages in
one run of dpkg, we won't have that many cycles to contend with.
======================================================================
Algorithm A: Depth first search.
1 For each package-versions which are candidates for installation; do
2 Color each node white
3 done
4 For each package-versions which are candidates for installation; do
5 if color of node is white; then
6 Visit (Vertex)
7 endif
8 done
Visit:
9 color node Gray
10 for each dependency, predependency, and packages we are
replacing, and, optionally, packages that have recommended or
suggested us.
11 if dependency is predepends,
12 add node to list of predepends
13 endif
14 if color of dependency not grey (else this was a cycle!)
15 visit (dependent)
16 endif
17 done
18 if we are in the list of predepends
19 add a break event on the ordered list
20 endif
21 add node to order list
22 Color node black
======================================================================
What I like about algorithm A is that it is simpler to code
(but not by much), what I like about the first is that one has some
determination in how to break the cycle (search for the node with
minimum indegrtee, and snip a link, preferably a suggests link or a
recommends link). In the second algorithm one just follows predepends
first, then depends, and so on, and hopes for the best about cycle
removal.
Also, the this method does not call for reverse dependencies
as algorithm below does.
======================================================================
Algorithm b: (for packages being installed, removal is similar)
# First generate the indegree of all package-versions
1 For each package-versions which are candidates for installation; do
2 for each dependency, predependency, and packages we are
replacing, and, optionally, packages that have recommended or
suggested us. set indegree
3 package::indegree++;
4 done
5 done
6 For each package-versions which are candidates for installation; do
7 if indegree == 0; then
8 push on todo list
9 endif
10 done
If there is nothing in the todo list, we had a cycle!! (handle
it)
11 while there is a todo list; do
12 pop package off a to do list
13 if we are in uncleared predependent list; then
14 insert a break in processing marker on ordered list
15 clear uncleared predependent list
16 endif
17 push onto ordered list
18 remove from candidates list
19 for each target (reverse dependencies!) of depends and
predepends, and packages we suggest or recommend
20 reduce indegree of target
21 if indegree of target == 0;
22 push target on todo list
23 endif
24 done
25 done
26 If not empty candidates list; there is a cycle!
======================================================================
Representation:
The data structures used for most graph algorithms is either:
a) Adjacency matrix. Suppose the graph has V nodes and E edges. The
adjacency matrix is a |V| x |V| array, with Ai,j = ! if an edge
exists from vertex i to vertex j (or, in other words, it has a |E|
non null elements). This representation is good for dense arrays
(lots of edges). In our case, Vertexes are packages, and
dependency relations are edges, and the matrix is quite sparse.
b) adjacency list: This is an array (or linked list) of length
|V|. Each node is the head of a linked list of dependencies. I
think this is already how the backing store is represented (except
that we have multiple kinds of dependencies, each of which gets a
linked list (I think)
Since the number of dependencies for any package are far less
than the total number of packages, an adjacency list would be far
more efficient way of representing the graphh. This is already very
close to the data structure we use.
If we also construct the reverse depndency list in the backing
store (cached list), then determining strongly connected regions
becomes easier, since the reverse dependencies form a transposed
directed graph of the original graph. Or else we may use tarjans
variation of the depth first search to find strongly connected
regions.
For dependency checking, we would need to have lists of
packages installed, packages available, and packages which are
candidates for installation.
Jason> This is unbelivably simple to code! You don't need any lists
Jason> but a single package extension structure containg the colour of
Jason> the package. It's also amazingly fast, the only loop is the
Jason> main one. The other one requires a reverse depends lookup,
Jason> which requires iterating over all packages depending on the
Jason> package, even if it is not a cand.
But please note, each package has to have, in the relations
list, any package that needs to be acted on before it. That includes,
for installation:
a) Anything we predepend on (simple)
b) Anything we depend on (simple)
c) Packages we conflict and replace, which have to be removed before
we are installed. Requires a special case when building dependency
lists, but still simple.
d) packages that recommend us (*) reverse dependency, and should be
marked as a weak link, so that it may be traversed last.
e) Packages that suggest us (*) reverse dependency
This may require 3 passes over the list, one for normal relations,
one for recommends, and one for suggests, which are weaker still.
Or we may take care while building the list, at the cost of
two extra pointers in the linked list header. Normal relatios
are always inserted at the head of the list. Suggests are
always inserted at the tail. recommends are always inserted at
the "middle" pointer, so the ordering on the list is what we
desire.
For removals (this seems simpler, we ignore recommends and suggest?)
a) packages that pre-depend on us (what if they are not marked for
removal?) (*) reverse dependency
b) Packages that depend on us (*) reverse dependency
Jason> However, it's recursive, will the depth cause a problem for us?
Jason> I suppose I should go and calculate the worst case depth of the
Jason> routine (tomorrow I'll write a stats report into dpkg-cache)
From my pkg-order work, it is rare to find a depth worse than
6 in most cases, I have once seen a 10. Most of the dependencies are
on a small number of packages (libc, for example), and once those
nodes are painted black, they are ignored.
>> Provides should satisfy any non versioned dependency (I think
>> provides should only provide virtual packages, but I'm not sure
>> this is ever required anywhere)
Jason> Um, we are implementing enhanced provides, they have versions
Jason> and can match versioned depends with the 'provides'
Jason> version. Tom wants this.
Jason> I guess a provided targeted depends simply effects more than
Jason> one package.
Hmmm. If we implement this, then provides become closer to
first class citizens (less special cases required), but if we don't
parse the versions on provided packages, then they should never match
a versioned relationship. This shall have to have a special case in
dependency determination code.
It is not clear to me how one can actually implement versioned
provides. Take gawk and mawk, which both provide awk. What version to
apply to awk? It'll have to have a policy decicion, and the virtual
package list should also list current versions of all virtual
packages, and the criteria for bumping up a version number. (gawk's
awk can do foo -- so it can provide awk 1.1, mawk still provides awk
1.0) If both awks diverge, we run into problems. If mawk now provides
feature bar, should it get 1.01? or 1.2?
--
Like a fine flower, beautiful to look at and scented too, fine words
bear fruit in a man who acts well in accordance with them. 52
Manoj Srivastava <srivasta@acm.org> <http://www.datasync.com/%7Esrivasta/>
Key C7261095 fingerprint = CB D9 F4 12 68 07 E4 05 CC 2D 27 12 1D F5 E8 6E
Reply to: