(Daniel: Note CC to you, I wouldn't mind further discussing this in private :) On Mon, 22 Apr 2002 14:13:07 -0400 Daniel Burrows <dburrows@debian.org> wrote: > Creating the initial set of data took me 5-10 hours a day for a week > or two, and it was incomplete. Keeping it mostly up-to-date > afterwards took about 5-10 minutes a day on average (0 minutes some > days, more others, depending on how many new packages there were) That's about what I figured. > The problem is that we have 9500 packages, and it's really hard > to classify all of them in a sane and consistent manner -- from the > sheer volume if nothing else. More than that, the problem is that > people would rather theorize about the best possible ontological > classification on mailing lists than sit down and categorize packages. I've dealt with large data sets like this before (specifically categorisation of some 14,000 error messages, which a tech support person would look up). There is no way to categorise them in a "sane and consistent manner". That would require a different heirarchy for each different cultural/moral/philisophical background. Categorisations basically depend on how the reader emphasises certain concepts/words. What we ended up doing was picking somebody who would, as part of their job description, categorise new error messages as they were created. They were staff and unionised, not contract, so they were likely to be around for years. She's the one who did the original categorisation. People who had to consult the list often may have categorised things differently than her, but over a (surprisingly short) period of time, they and their brains were able to take into account any number of a thousand variables and predict with a great degree of accuracy exactly where something would be in the tree. The moral is that you really *can't* categorise everything in such a way that everybody would know exactly where something is the first time they look at a list. The next best thing is having a single person do all the categorisation. Giving packages multiple places in the tree is *extremely* good, it almost eliminates the need to have a single person do it. (Though in my experience, it's still better done that way) -- ________________________________________________________________________ \ David B. Harris, Systems administrator | http://www.terrabox.com / / eelf@sympatico.ca, elf@terrabox.com | http://eelf.ddts.net \ \======================================================================/ / Clan Barclay motto: Aut agere, aut mori. (Either action, or death.) \ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Attachment:
pgpeS6jYdTjHW.pgp
Description: PGP signature