Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way

To: 514495@bugs.debian.org
Subject: Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
From: Raphael Geissert <atomo64@gmail.com>
Date: Tue, 24 Feb 2009 23:39:27 -0600
Message-id: <[🔎] 200902242339.28139.atomo64@gmail.com>
Reply-to: Raphael Geissert <atomo64@gmail.com>, 514495@bugs.debian.org
In-reply-to: <[🔎] 871vu81ezr.fsf@windlord.stanford.edu>
References: <[🔎] 200902071903.09664.atomo64@gmail.com> <[🔎] 498e7d1d.030bca0a.6b6f.496b@mx.google.com> <[🔎] 871vu81ezr.fsf@windlord.stanford.edu>

I'm afraid your message didn't make it to the newsgroup provided by gmane; so 
it is until now that I'm reading your email.

On Sunday 08 February 2009 21:11:20 Russ Allbery wrote:
> Raphael Geissert <atomo64+debian@gmail.com> writes:
> > There might indeed be a degradation on those texts where the number of
> > characters is near the limit of the methods switcher. Other than those
> > cases, when the number of words in the text is greater than the number
> > of words in the hash there should be a performance improvement when
> > searching for every known word in the text (there is indeed a
> > performance penalty because of the regex and the text traversal, but it
> > is not enough to make it slower than the other method).
>
> But why would that be the case?
>

I don't remember very well my reasoning behind the new implementation but it 
was along the lines of being faster because it only looked up known words, 
ignoring dups.

> There are two scaling factors here.  Let's say n is the length of the text
> and w is the number of corrections we have.  The current algorithm walks
> the text once and does a hash table lookup for each word.  (Well,
> technically it probably walks the text once to split it on whitespace and
> then a second time to do the hash table lookups.)  So, we do work
> proportional to the length of the text, or O(n).  The factor is roughly
> 2 plus the length of time it takes to hash a word.
>

Plus the lc and regex operations on every word (a word being whatever is 
between two spaces or the start and end of line, or any combination of them).

> Here's the new code:
[...]
>
> The first branch is the current algorithm.  This code chooses that
> algorithm if n is small.  If n is large, it switches to doing a scan of
> the full text for each correction that we have, which is work proportional
> to w, the number of corrections.
>
> I was wrong originally -- this isn't O(n^2).  It's still O(n), but with a
> different factor, this time proportional to w.  It's O(wn), essentially.
>

There are two less operations being performed on the new implementation: lc 
and the stripping regex are done directly on the whole text, not per word. 
And I wouldn't consider it O(wn), it is more like O(wn/p), where p is the 
probability that a known word appears in the text, because if a known word 
appears, for example, at the beginning of the text it won't have to recur 
over the rest of the text.

> So, for this to be faster, the time it takes to check a portion of the
> text against every correction we have using a regex must be less than the
> time it takes to hash a word.
>
> This makes no sense to me.  Any reasonable hash algorithm should be much
> faster than more than 100 regex matches.
>
> The benchmarking looks like it may just be in the noise for me:
[...]

agreed

[...]
>
> So I'm getting basically similar results either way.  I would expect that.

I believe that's because the version of lintian you used for your tests didn't 
include the binary spell checking patch, my results did. Running lintian 
without that patch makes it spell check only very small texts, IOW: the new 
algorithm is never used without the patch.

Anyway, I have written several different implementations; one similar to the 
one I previously wrote but turning the whole list of known bad words into a 
big ORed regex and, as expected, resulted a lot faster than my first one. But 
the vast majority of times it was still slower than the current algorithm.

These are the benchmark results of several methods, all dropping the regex 
that strips most non-word characters.

On the output of strings /usr/bin/php5 (50 times):
        Rate   bts  orig  newfg
bts   7.74/s    --  -44%  -61%
orig  13.7/s   77%    --  -30%
newg 19.7/s  154%   43%     --

on /usr/share/common-licenses/GPL-3 (1000 times):
        Rate   bts  orig  new
bts   58.6/s    --  -60%  -76%
orig   146/s  148%    --  -40%
new  242/s  312%   66%     --

bts: the one I first submitted on this bug report
orig: the current one
new: the proposed one

The idea behind removing the regex that removes all non-alphabetic characters 
is that the likelyhood for the resulting "word" to be an actual match should 
be extremely remote. Instead, the replacement takes care of removing dots, 
commas, and other symbols that are commonly used in sentences.

Cheers,
-- 
Raphael Geissert - Debian Maintainer
www.debian.org - get.debian.net

Attachment: lintian_spelling-optim.mbox
Description: application/mbox

Reply to:

References:
- Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
  - From: Raphael Geissert <atomo64@gmail.com>
- Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
  - From: Raphael Geissert <atomo64+debian@gmail.com>
- Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Bug#516858: lintian: false negative arch-dependent-file-in-usr-share (nsis)
Next by Date: [SCM] Debian package checker branch, master, updated. 2.2.6-6-gda80c77
Previous by thread: Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
Next by thread: Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
Index(es):
- Date
- Thread