[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: fstrcmp



Jérôme Pouiller writes:

> In another thread, Adeodato Simó wrote:
>> I can't see how it'd work here, at least without the help of some
>> on-disk structure, since we're talking about a space of 25,000
>> packages.
>
> Naive search of matching string under a set of 25,000 strings is 
> something like 2000 times slower (maybe far more) than a correct 
> algorithm. Adeodato thinks it is not usable. I said we shouldn't reject 
> idea because of performances and I suggested a better algorithm should 
> be used.

If we are pulling numbers out of the air, reading data from disk is
something like 2000 times slower (maybe far more) than doing string
comparisons.  I/O may dominate the running time even for a naive
comparison.

I have some relevant experience that makes me skeptical about needing
sophisticated structures to achieve acceptable performance: Ten years
ago, I wrote a simple spell checker as part of a job interview.  It
was able to suggest (ranked) close matches for misspelled words from a
~100,000-word dictionary within a few milliseconds.  Given that I had
less than an hour's notice about what I would have to write, and only
about 4 hours to write and test the program, I would be very surprised
if performance for any reasonable algorithm on more modern machines
will be bad enough to make it unusable.

It is not as if the application code should require significant
rewriting based on whether it performs an element-wise comparison or
whether it passes some function an array of strings against which to
match.  The library code is what would need much deeper changes.  So
why not see how well the existing library works before demanding that
it be radically rewritten?

If you want to implement a superior algorithm, no one will stop you.
If the fuzzy match only runs when no exact match is found, it suffers
no slowdown on success and may be fast enough when there is incorrect
input.  If there are people who think it is unacceptably slow, let
them come forward with actual numbers in hand.  Insisting on much more
development work because of an unsubstantiated concern seems like a
poor trade-off.

Michael


Reply to: