[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Misclassification of packages; "libs" and "doc" sections



Daniel Burrows wrote:
> 
> > Category: Lisp_Interpreter
> > IsA: Interpreter
> 
>   Ok, that looks nicer :)  I'm still missing one thing, maybe; it seems
> like maybe you're looking at having the nodes in the graph pointing
> at what they "inherit" from; I was looking at doing it the other way.  I'm not
> sure this is actually relevant, since most of the file-formats I've seen
> tossed around could be parsed either way depending on how you want to
> use them, though..

The idea is that once a class is stable, it might have as many subclasses
as possible. In our case, once a category is stable it should be easy to
add a subcategory. Also, it should be easy to define a package's category
without actually modifying the hierarchy. If the arcs were reversed, still
we'd have the same amount of information but then there'd be the overhead
of modifying the "hierarchy file" each time a new package is added, which
is not good. Because ultimately, we'd like to see a new "Category: "
or "Class:" tag in the package control files for modularity reasons.
Anyway, I've got some headache (drank a bit last night), but I think it's obvious.

If your point is the tree view, once you load the graph into memory you
should represent it in such a way that it's easier to provide the tree
view. You just run a simple algorithm over the graph and calculate
the reverse graph. 

> > Okay, you can make such a single "source" vertex to the entire graph.
> > But how do you guarantee that all vertices are reachable from it?
> > You just have to take into account all vertices that have no outgoing
> > edges. Those happen to be the "toplevel" categories, or the most
> > general classes in this hierarchy.
> 
>   Well, that was rather the point -- you can either store them as a "source",
> store them some other way, or recalculate them every time you want them..if
> you actually end up needing this information, it seems cleanest to me to view
> it as a special "source" node.  (since that lends itself to nice recursive
> algorithms without a special case for starting out)  This is a kind of picky
> point, though..
> 

You can cache the result of that computation; I presume for an interactive
tree view of the graph. These will form the first level of your tree view.
Then, since you have reversed the arcs, you can expand the tree just by
following the arcs. You don't have to make a special source node for this.
Simply store them as a list somewhere, or if you want you can add the source
node it's common practice to add source and sink nodes to a graph for
studying flow problems.

>   I'm hoping (cross your fingers!) I can get the next aptitude generation
> partly put together soon -- enough so that we could maybe build a "classifier"
> into it which partly automates this process (you get a list of unclassified
> packages and tell it which package goes in what categories, then it
> generates a big file with that info)  I'll make an announcement once 0.9.0 (the
> first prerelease for the next version, which will be a port of the current
> code to the new UI library -- there's a lot of work left to take full advantage
> of the new stuff) is working..

That's cool. I think aptitude is the best package UI in debian. I'm not
familiar with the code though. But if we're going to do classification
stuff, then we should implement it with a library approach.

> > not UI, but the formal definitions and a suitable category hierarchy.
> 
>   Ok.  UI will likely be trivial, actually; apt has a tagfile parser
> and aptitude is set up to allow flexible grouping..
> 

Nice. The first UI gets to be aptitude!

>   You mean autogenerate categories?  hmmmm...is NLP research really far enough
> along to do a reasonable job at this?
> 

Well, yes and no. I've talked to the NLP guys here and they told me that
we could do it simply by a hierarchical K-nearest clustering. Still a lot
of things have to be worked out, though. There's probably no freely available
tool for this task, but I could write an app. I already have a hierarchical
clusterer, and there're many ways to adapt text categorization to a general
clusterer. So it could be done. The problem is that the hiearchy I generate
is a tree and a language model that gives a similarity metric for words
is required (table and chair are more similar than space and lamb)

Among one of the popular things is I guess automatic categorization of
news articles according to subjects. There's even a standard data set
from Reuters to test the quality.

If I do this, it might even make a good conference paper. :)

Thanks,

-- 
Eray (exa) Ozkural
Comp. Sci. Dept., Bilkent University, Ankara
e-mail: erayo@cs.bilkent.edu.tr
www: http://www.cs.bilkent.edu.tr/~erayo



Reply to: