[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Too Many Packages [was Re: Splitting locales...]



On Mon, Oct 09, 2000 at 06:05:52PM -0400, Buddha Buck wrote:
> > On Mon, Oct 09, 2000 at 08:35:07AM -0500, David Starner wrote:
> > > Which changes nothing that I said. Directories are still going to be O(n)
> > > in total size, dselect is still going to be O(n) in memory, O(n log n) in
> > > sorting stuff. It's not a solution here.
> > Please, please, please read an algorithms book. Sorting strings of bounded
> > length is O(n).

Eh?

Comparison sorts are O(n lg(n)) in the limit. A comparison sort is one
where your only base operator is a strcmp() like function that takes a
constant amount of time.

If you have unbounded strings, you should include the length of the string
in your analysis, and you'll end up with O(n lg(n) k) as an upper bound,
where k is the length of your longest string.

If, on the other hand, you can do cleverer things than just a simple 
comparison (if you can create a set of O(n) buckets that will have an
upper bound of O(1) (ie a constant number) of collisions), and filling
those buckets takes O(1) (or O(k), k the maximum length of a string) time,
then you can trivially do the sort in O(n) (or O(nk)) time:

	create buckets (O(n) time, there are O(n) buckets)
	for each string:		(n * O(1))
		add to bucket x   (O(1))
	for each bucket:		(n * O(1) / n * O(k))
		sort bucket		(at most a constant number of elements,
					by assumption, so at most a constant
					amount of time for a comparison sort,
					so either O(1) or O(k))
		print bucket contents   (also O(1) or O(k))

However ensuring ou have at most a constant number of elements in each
bucket is difficult and depends crucially on how much you know about
your initial data set.

> I'd be sort of interested in knowing what we are sorting, and why.  We 
> have under 10,000 packages, and my /var/lib/dpkg/available is under 5MB 
> in size.  Sorting time doesn't seem like it should be our biggest 
> concern.

Definitely. (I don't have a clue what this thread is about, mind :)
We're working with large, but not inordinately huge, datasets, and
we have more than enough computing power to cope (albeit imperfectly)
with even downright bad algoirthmic choices wrt it (cf bugs.debian.org :).

And using qsort() from libc isn't a bad algorithmic choice.

Cheers,
aj

-- 
Anthony Towns <aj@humbug.org.au> <http://azure.humbug.org.au/~aj/>
I don't speak for anyone save myself. GPG signed mail preferred.

  ``We reject: kings, presidents, and voting.
                 We believe in: rough consensus and working code.''
                                      -- Dave Clark

Attachment: pgpbCSzIyI1oC.pgp
Description: PGP signature


Reply to: