[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Too Many Packages [was Re: Splitting locales...]



> On Mon, Oct 09, 2000 at 08:35:07AM -0500, David Starner wrote:
> > Which changes nothing that I said. Directories are still going to be O(n)
> > in total size, dselect is still going to be O(n) in memory, O(n log n) in
> > sorting stuff. It's not a solution here.
> 
> Please, please, please read an algorithms book. Sorting strings of bounded
> length is O(n).

I have seen that claim many times, but I always feel it's a bit 
disengenuous...

Sorting N distict items by radix sort is O(N log N).  Period.  Saying 
that is O(N) is lying.

How do I come to this conclusion?

Because a proper analysis of radix sort will say that sorting n strings 
is O(n*m), where m is the length of the strings (you have to do m 
passes of distribution and collection of the n items to sort, so you 
pass through all the data 2*m times).  But if you a) have a total of b 
buckets, and b) have n distinct strings, then the string length m > log 
n (base b).  So for arbitrary length strings, you have O(n log n).

That does -not- mean that for this problem space radix sort is 
inappropriate.  It may be.  Since we are dealing with a large number of 
buckets (96? 256?), we are gonna have to have a pretty big n before the 
log n portion starts to exceed any feasible choice for n.

> c.f.
> 	Knuth, "The Art of Computer Programming", Vol 3: "Sorting & Searching"
> 	Cormen, Leiserson, & Rivest, "Introduction to Algorithms"
> 	Baase, "Computer Algorithms: Introduction to Design & Analysis"

My Knuth is at work, but I do have CLR in front of me.  Note that the 
bucket sort described assumes that the distribution of items is 
uniform, which Debian package names aren't.  It also mentiones that for 
large amounts of data, the ability to sort in place can become very 
important and radix sort fails there -- leaving traditional O(n lg n) 
solutions as the best.

I'd be sort of interested in knowing what we are sorting, and why.  We 
have under 10,000 packages, and my /var/lib/dpkg/available is under 5MB 
in size.  Sorting time doesn't seem like it should be our biggest 
concern.

> 
> Look for "radix sort" and "bucket sort" in the indices in any of these
> three books, and please, bear that information in mind when you post to
> the list about the running time of sorting.
> 
> Bill
> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-devel-request@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> 

-- 
     Buddha Buck                             bmbuck@14850.com
"Just as the strength of the Internet is chaos, so the strength of our
liberty depends upon the chaos and cacophony of the unfettered speech
the First Amendment protects."  -- A.L.A. v. U.S. Dept. of Justice




Reply to: