Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
Raphael Geissert <atomo64+debian@gmail.com> writes:
> There might indeed be a degradation on those texts where the number of
> characters is near the limit of the methods switcher. Other than those
> cases, when the number of words in the text is greater than the number
> of words in the hash there should be a performance improvement when
> searching for every known word in the text (there is indeed a
> performance penalty because of the regex and the text traversal, but it
> is not enough to make it slower than the other method).
But why would that be the case?
There are two scaling factors here. Let's say n is the length of the text
and w is the number of corrections we have. The current algorithm walks
the text once and does a hash table lookup for each word. (Well,
technically it probably walks the text once to split it on whitespace and
then a second time to do the hash table lookups.) So, we do work
proportional to the length of the text, or O(n). The factor is roughly
2 plus the length of time it takes to hash a word.
Here's the new code:
# Check whether we should check every word in the text or
# better look for every word we know about in the text.
# Use 10 as the average word length in %CORRECTIONS
if (length($text) < $#CORRECTIONS * 10) {
for my $word (split(/\s+/, $text)) {
if (exists $CORRECTIONS{$word}) {
_tag($tag, $filename, $word, $CORRECTIONS{$word});
}
}
} else {
for my $word (@CORRECTIONS) {
my $rword = qr<$word>;
_tag($tag, $filename, $word, $CORRECTIONS{$word})
if ($text =~ m,(^|\s)$rword(\s|$),);
}
}
The first branch is the current algorithm. This code chooses that
algorithm if n is small. If n is large, it switches to doing a scan of
the full text for each correction that we have, which is work proportional
to w, the number of corrections.
I was wrong originally -- this isn't O(n^2). It's still O(n), but with a
different factor, this time proportional to w. It's O(wn), essentially.
So, for this to be faster, the time it takes to check a portion of the
text against every correction we have using a regex must be less than the
time it takes to hash a word.
This makes no sense to me. Any reasonable hash algorithm should be much
faster than more than 100 regex matches.
The benchmarking looks like it may just be in the noise for me:
> Without the patch:
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real 0m8.380s
> user 0m5.596s
> sys 0m0.640s
> (this is normal as the file is being read directly from the disk, not in
> memory)
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real 0m6.701s
> user 0m5.800s
> sys 0m0.684s
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real 0m6.424s
> user 0m5.740s
> sys 0m0.520s
>
> This time with the patch:
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real 0m6.613s
> user 0m5.728s
> sys 0m0.580s
> (Let's ignore this one, so that disk reads do not influence the results)
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real 0m6.254s
> user 0m5.564s
> sys 0m0.540s
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real 0m6.311s
> user 0m5.456s
> sys 0m0.600s
Here are my results.
Without the patch:
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb >/dev/null
2.304u 0.420s 0:05.08 53.5% 0+0k 16304+11392io 77pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.248u 0.468s 0:02.78 97.1% 0+0k 24+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.300u 0.400s 0:02.76 97.8% 0+0k 16+11392io 0pf+0w
With the patch:
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.264u 0.380s 0:02.75 96.0% 0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.228u 0.424s 0:02.82 93.6% 0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.144u 0.492s 0:02.79 94.2% 0+0k 0+11392io 0pf+0w
Without the patch again:
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.192u 0.448s 0:02.80 93.9% 0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.212u 0.448s 0:02.80 94.6% 0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.192u 0.484s 0:02.83 94.3% 0+0k 0+11392io 0pf+0w
So I'm getting basically similar results either way. I would expect that.
Either way, the spelling check is just not a big enough part of what
Lintian does to make any noticable difference in the overall running time
unless the algorithm were just horrible.
You need to benchmark specifically the algorithms that you're changing in
isolation to get a high enough granularity. When I do that, the new
algorithm looks much slower for long text:
windlord:~> ./benchmark
Benchmark: timing 1000 iterations of current, new...
current: 6 wallclock secs ( 5.11 usr + 0.00 sys = 5.11 CPU) @ 195.69/s (n=1000)
new: 12 wallclock secs (12.39 usr + 0.01 sys = 12.40 CPU) @ 80.65/s (n=1000)
Rate new current
new 80.6/s -- -59%
current 196/s 143% --
windlord:~> ./benchmark
Benchmark: timing 1000 iterations of current, new...
current: 6 wallclock secs ( 5.22 usr + 0.01 sys = 5.23 CPU) @ 191.20/s (n=1000)
new: 12 wallclock secs (12.27 usr + 0.02 sys = 12.29 CPU) @ 81.37/s (n=1000)
Rate new current
new 81.4/s -- -57%
current 191/s 135% --
Attached is the benchmarking script I used in case I made a mistake.
--
Russ Allbery (rra@debian.org) <http://www.eyrie.org/~eagle/>
#!/usr/bin/perl
use Benchmark qw(timethese cmpthese);
use lib '/home/eagle/dvl/debian/lintian/lib';
use Spelling;
our $text;
open (IN, '<', '/usr/share/common-licenses/GPL-3') or die;
{
local $/;
$text = <IN>;
}
close IN;
our @CORRECTIONS = keys %Spelling::CORRECTIONS;
sub current {
for my $word (split(/\s+/, $text)) {
$word = lc $word;
if (exists $Spelling::CORRECTIONS{$word}) {
print "Found spelling error $word\n";
}
}
}
sub new {
for my $word (@CORRECTIONS) {
my $rword = qr<$word>;
if ($text =~ m,(^|\s)$rword(\s|$),) {
print "Found spelling error $word\n";
}
}
}
$results = timethese(1000, {
current => \¤t,
new => \&new,
});
cmpthese($results);
Reply to: