Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way

To: 514495@bugs.debian.org
Subject: Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
From: Russ Allbery <rra@debian.org>
Date: Sun, 08 Mar 2009 18:59:14 -0700
Message-id: <[🔎] 87zlfvpibx.fsf@windlord.stanford.edu>
Reply-to: Russ Allbery <rra@debian.org>, 514495@bugs.debian.org
In-reply-to: <200902242339.28139.atomo64@gmail.com> (Raphael Geissert's message of "Tue\, 24 Feb 2009 23\:39\:27 -0600")
References: <200902071903.09664.atomo64@gmail.com> <498e7d1d.030bca0a.6b6f.496b@mx.google.com> <871vu81ezr.fsf@windlord.stanford.edu> <200902242339.28139.atomo64@gmail.com>

Raphael Geissert <atomo64@gmail.com> writes:

> Anyway, I have written several different implementations; one similar to
> the one I previously wrote but turning the whole list of known bad words
> into a big ORed regex and, as expected, resulted a lot faster than my
> first one. But the vast majority of times it was still slower than the
> current algorithm.
>
> These are the benchmark results of several methods, all dropping the
> regex that strips most non-word characters.
>
> On the output of strings /usr/bin/php5 (50 times):
>         Rate   bts  orig  newfg
> bts   7.74/s    --  -44%  -61%
> orig  13.7/s   77%    --  -30%
> newg 19.7/s  154%   43%     --
>
> on /usr/share/common-licenses/GPL-3 (1000 times):
>         Rate   bts  orig  new
> bts   58.6/s    --  -60%  -76%
> orig   146/s  148%    --  -40%
> new  242/s  312%   66%     --
>
> bts: the one I first submitted on this bug report
> orig: the current one
> new: the proposed one
>
> The idea behind removing the regex that removes all non-alphabetic
> characters is that the likelyhood for the resulting "word" to be an
> actual match should be extremely remote. Instead, the replacement takes
> care of removing dots, commas, and other symbols that are commonly used
> in sentences.

Yeah, this looks much better.  Applied with one change: keeping hyphens to
match the behavior of the previous code.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

Reply to:

Prev by Date: [SCM] Debian package checker branch, master, updated. 2.2.6-71-gd31914a
Next by Date: [SCM] Debian package checker branch, master, updated. 2.2.6-72-gef773e9
Previous by thread: [SCM] Debian package checker branch, master, updated. 2.2.6-71-gd31914a
Next by thread: [SCM] Debian package checker branch, master, updated. 2.2.6-72-gef773e9
Index(es):
- Date
- Thread