Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way

To: 514495@bugs.debian.org
Subject: Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
From: Russ Allbery <rra@debian.org>
Date: Sun, 08 Feb 2009 19:11:20 -0800
Message-id: <[🔎] 871vu81ezr.fsf@windlord.stanford.edu>
Reply-to: Russ Allbery <rra@debian.org>, 514495@bugs.debian.org
In-reply-to: <[🔎] 498e7d1d.030bca0a.6b6f.496b@mx.google.com> (Raphael Geissert's message of "Sun\, 08 Feb 2009 00\:35\:06 -0600")
References: <[🔎] 200902071903.09664.atomo64@gmail.com> <873aep90t9.fsf__23267.8000153557$1234056705$gmane$org@windlord.stanford.edu> <[🔎] 498e7d1d.030bca0a.6b6f.496b@mx.google.com>

Raphael Geissert <atomo64+debian@gmail.com> writes:

> There might indeed be a degradation on those texts where the number of
> characters is near the limit of the methods switcher. Other than those
> cases, when the number of words in the text is greater than the number
> of words in the hash there should be a performance improvement when
> searching for every known word in the text (there is indeed a
> performance penalty because of the regex and the text traversal, but it
> is not enough to make it slower than the other method).

But why would that be the case?

There are two scaling factors here.  Let's say n is the length of the text
and w is the number of corrections we have.  The current algorithm walks
the text once and does a hash table lookup for each word.  (Well,
technically it probably walks the text once to split it on whitespace and
then a second time to do the hash table lookups.)  So, we do work
proportional to the length of the text, or O(n).  The factor is roughly
2 plus the length of time it takes to hash a word.

Here's the new code:

    # Check whether we should check every word in the text or
    # better look for every word we know about in the text.
    # Use 10 as the average word length in %CORRECTIONS
    if (length($text) < $#CORRECTIONS * 10) {
	for my $word (split(/\s+/, $text)) {
	    if (exists $CORRECTIONS{$word}) {
		_tag($tag, $filename, $word, $CORRECTIONS{$word});
	    }
	}
    } else {
	for my $word (@CORRECTIONS) {
	    my $rword = qr<$word>;
	    _tag($tag, $filename, $word, $CORRECTIONS{$word})
		if ($text =~ m,(^|\s)$rword(\s|$),);
	}
    }

The first branch is the current algorithm.  This code chooses that
algorithm if n is small.  If n is large, it switches to doing a scan of
the full text for each correction that we have, which is work proportional
to w, the number of corrections.

I was wrong originally -- this isn't O(n^2).  It's still O(n), but with a
different factor, this time proportional to w.  It's O(wn), essentially.

So, for this to be faster, the time it takes to check a portion of the
text against every correction we have using a regex must be less than the
time it takes to hash a word.

This makes no sense to me.  Any reasonable hash algorithm should be much
faster than more than 100 regex matches.

The benchmarking looks like it may just be in the noise for me:

> Without the patch:
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real    0m8.380s
> user    0m5.596s
> sys     0m0.640s
> (this is normal as the file is being read directly from the disk, not in
> memory)
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real    0m6.701s
> user    0m5.800s
> sys     0m0.684s
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real    0m6.424s
> user    0m5.740s
> sys     0m0.520s
>
> This time with the patch:
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real    0m6.613s
> user    0m5.728s
> sys     0m0.580s
> (Let's ignore this one, so that disk reads do not influence the results)
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real    0m6.254s
> user    0m5.564s
> sys     0m0.540s
> $ time
> xlintian /var/cache/apt/archives/php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb
> [...]
> real    0m6.311s
> user    0m5.456s
> sys     0m0.600s

Here are my results.

Without the patch:

windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb >/dev/null
2.304u 0.420s 0:05.08 53.5%     0+0k 16304+11392io 77pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.248u 0.468s 0:02.78 97.1%     0+0k 24+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.300u 0.400s 0:02.76 97.8%     0+0k 16+11392io 0pf+0w

With the patch:

windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.264u 0.380s 0:02.75 96.0%     0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.228u 0.424s 0:02.82 93.6%     0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.144u 0.492s 0:02.79 94.2%     0+0k 0+11392io 0pf+0w

Without the patch again:

windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.192u 0.448s 0:02.80 93.9%     0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.212u 0.448s 0:02.80 94.6%     0+0k 0+11392io 0pf+0w
windlord:~/tmp> time lintian --root ~/dvl/debian/lintian php5-cli_5.2.6.dfsg.1-1+lenny2_i386.deb > /dev/null
2.192u 0.484s 0:02.83 94.3%     0+0k 0+11392io 0pf+0w

So I'm getting basically similar results either way.  I would expect that.
Either way, the spelling check is just not a big enough part of what
Lintian does to make any noticable difference in the overall running time
unless the algorithm were just horrible.

You need to benchmark specifically the algorithms that you're changing in
isolation to get a high enough granularity.  When I do that, the new
algorithm looks much slower for long text:

windlord:~> ./benchmark
Benchmark: timing 1000 iterations of current, new...
   current:  6 wallclock secs ( 5.11 usr +  0.00 sys =  5.11 CPU) @ 195.69/s (n=1000)
       new: 12 wallclock secs (12.39 usr +  0.01 sys = 12.40 CPU) @ 80.65/s (n=1000)
          Rate     new current
new     80.6/s      --    -59%
current  196/s    143%      --
windlord:~> ./benchmark
Benchmark: timing 1000 iterations of current, new...
   current:  6 wallclock secs ( 5.22 usr +  0.01 sys =  5.23 CPU) @ 191.20/s (n=1000)
       new: 12 wallclock secs (12.27 usr +  0.02 sys = 12.29 CPU) @ 81.37/s (n=1000)
          Rate     new current
new     81.4/s      --    -57%
current  191/s    135%      --

Attached is the benchmarking script I used in case I made a mistake.

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>

#!/usr/bin/perl
use Benchmark qw(timethese cmpthese);
use lib '/home/eagle/dvl/debian/lintian/lib';
use Spelling;

our $text;
open (IN, '<', '/usr/share/common-licenses/GPL-3') or die;
{
    local $/;
    $text = <IN>;
}
close IN;

our @CORRECTIONS = keys %Spelling::CORRECTIONS;

sub current {
    for my $word (split(/\s+/, $text)) {
        $word = lc $word;
        if (exists $Spelling::CORRECTIONS{$word}) {
            print "Found spelling error $word\n";
        }
    }
}

sub new {
    for my $word (@CORRECTIONS) {
        my $rword = qr<$word>;
        if ($text =~ m,(^|\s)$rword(\s|$),) {
            print "Found spelling error $word\n";
        }
    }
}

$results = timethese(1000, {
    current => \&current,
    new     => \&new,
});
cmpthese($results);

Reply to:

Follow-Ups:
- Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
  - From: Raphael Geissert <atomo64@gmail.com>

References:
- Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
  - From: Raphael Geissert <atomo64@gmail.com>
- Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
  - From: Raphael Geissert <atomo64+debian@gmail.com>

Prev by Date: Re: DRAFT: Bits from the Lintian maintainers
Next by Date: Re: Couple of patches
Previous by thread: Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
Next by thread: Bug#514495: [lib/Spelling.pm] check the spelling of large texts in a more efficient way
Index(es):
- Date
- Thread