[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Re : Re: Sorting data points

dbisnath@rave-tt.net wrote:
> Thanks DSC. Great site.
> To the first person who replied, sorry the message was deleted and i 
> lost your name. No, this was not a class assignment - not that anything 
> was wrong if it were so!  We were and are still are students, always
> learning. I simply asked what I wanted to know without wasting my and 
> other people's time with a larger than required message. After researching 
> on bubble sort, quicksort, vector etc etc
> I wanted to get second opinions on the best sort algorihtm.
> Thanks for you reply.

Okay, one other comment on identifying the best of the best in a lot of

The N*Log2(N) factor shows up in the maximum compression of random data,
in the conversion of accessible states to temperature, and in general
shows up in the 2nd law of thermodynamics.  What this means is that (1)
the 2nd law of thermo is not only a physical identity, it is a
mathematical identity and (2) You are going to run into it a lot for
data reduction/analysis/transformation.  Thus, for a lot of things
N*Log2(N) may be the mathematical optimum.

Now, that being said, N*Log2(N) is optimum for the truly random case.  A
lot of things are not random, and so you can get better than that for
the non-random case.  Consider the JPG image vs. the Run-length-encoded
TIFF for the case of a blank page.  The Run-length-encoded TIFF of any
size could be concievably compressed down to 2 bits.  The first bit says
"paint the entire page white."  The second bit says "That's all there
is."  In reality, you will have headers, a specific RLE format, and
such, so there is a minimum size for the TIFF.  But you get the idea.  

So if there is anything you *do* know about the image, you compress that
first.  Then you incorporate into that the best random compression you
are willing to pay for (in terms of time...  just as the only truly
reversible thermodynamic reaction takes forever, the only truly optimum
compression of random data takes forever.  Any increases in speed meet
with loses in compression.).  

In the same way, if there is anything you *do* know about the order in a
sort algorithm, you incorporate that into your sorting method.  For that
reason, for certain data sets, other sorts may be optimum.  That
includes the quick sort, or -- ideally for a mathematically predefined
data set -- a mathematical algorithm that will yield the order

If, for example, you had 12 different simple mathematical equations that
covered 12 different cases, I would combine the heap sort with the 12
different equations.  But to do that optimally, you still want to run
each equation a minimum number of times.  There your NLog2N will work
its way back in again... or had better, or you'll be wasting a lot of
unnecessary computation time.

Then, too, if you want to find the Nth item quickly, there might be a
whole different approach.  Perhaps the quicksort might be better there. 
I think nr.com talks about that a bit.

  ---> Just a whole bunch of thoughts about numerical optimization.

  - Michael

More on my own history behind these statements.

One thing that I think was great for my understanding of this area was
to try to program an "infinite compression routine."  For arguments'
sake, say I was using first of all variations of the LZW method, as
described in GIF definition.  So I ran the same program 20 times on the
same data.  I very carefully designed it so that the data would never
grow, but could transform.  My compression the first time was 30%.  2nd
pass, I got 9% additional compression.  3rd pass, I got 2%.  Fourth
pass, about 1/2%. 

I wondered why it wasn't working, so I took a look at the data that was
incoming.  I then changed the routine to try to optimize based upon the
data that was incoming -- especially all those 10..001 bit runs that
came out of the LZW compression.  I still didn't get much improvement. 
Oh, the second pass improved, but when I tracked the improvements due to
my new algorithm, they went away just as quickly. 

I then wrote a program to put up a histogram of the data in many
different forms, and took a look at the data.  What I saw was something
that (1) looked random (2) on each pass pushed closer and closer to what
I would call a normalized blackbody radiation curve (or perhaps a bell
curve, but there are many different bell curves possible).  And the
better my data approximated that curve, the less compression I could get
out of the data on the next pass.

More than that, no matter what form I was measuring the data
[distribution of values, distribution of 0 runs, distribution of
difference from previous byte, etc...], a compression run would push
*all forms* closer to that blackbody curve.  

It was then that I realized I was seeing a numerical version of the 2nd
law of thermo.  The more compression you get, the lower the state of
energy and the higher the state of entropy.  And no spontaneous decrease
in entropy can be expected.  

After all that, I have to say:  there's nothing like actually *building*
a perpetual motion machine to make a true believer in the 2nd law of
thermo.  The fact that I built mine in the world of data compression
simply let me not waste a lifetime of effort, since with my computer I
made hundreds of attempts in just a few days.  


Now, if you want a truly impressive algorithm, how about an algorithm
that you can often do on paper--it's so conceptually simple -- and will
give you a Taylor series for all variables for almost any set of
equations, linear, nonlinear, partial-differential, or otherwise
(initial value only... boundary conditions are not yet figured out.)?

Go look at the Parker - Souchacki solution to the Picard iteration.  You
should be able to find stuff on the web.

It's that powerful, it's new, and -- in my opinion -- it warrants an
algorithm library in Debian.

  - Michael

To UNSUBSCRIBE, email to debian-user-request@lists.debian.org 
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Reply to: