[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: check_typos.pl script



Jens Seidel wrote:

I implemented three tests:
 * swapped characters (helol)
 * duplicated character (helllo)
 * removed characters (helo)
 (a check for doubled words is missing missing)

In a more general way, you can compute the 'edit distance', which means, how many characters are different.

An edit distance of 1 would also catch 'hallo' or 'gello' or 'ello' or 'hell' (this will be correct under a spellchecker).

In my script with a similar intention I coded the 'Ukkonen' algorithm, after I experienced that the more general 'Levenshtein' is very slow.

sub ukkonen {
    my (
        $string1,
        $string2,
        $max_distance
        ) = @_;

    my ($length1, $length2) = (length $string1, length $string2);

    my $length_difference = abs ($length1 - $length2);

    if ($length_difference > $max_distance) {
        return ($length_difference);
    }

    if ($string1 eq $string2 ) {
        return (0);
    }


    my @array1 = split(//, $string1);
    my @array2 = split(//, $string2);

    my $i = 0;
    my $j = 0;

    my $error_count = 0;

#TRY: while ( $error_count <= $max_distance ) {
TRY: while ( ($error_count <= $max_distance) and ( ($i < $length1) and ($j < $length2) ) ) {
        # The current characters match.
        #   Advance to next character in both strings.
        if ( $array1[$i] eq $array2[$j] ) {
            $i++;
            $j++;
            next TRY;
        }

        # The current character matches the next in the other string.
        #   Advance to next character in other string, error++.
        if ( ($j+1) < $length2 ) {
            if ($array1[$i] eq $array2[$j + 1]) {
                $error_count++;
                $j++;
                next TRY;
            }
        }
        if ( ($i + 1) < $length1 ) {
            if ($array1[$i + 1] eq $array2[$j]) {
                $error_count++;
                $i++;
                next TRY;
            }
        }

        # Else: Advance in both strings; error++.
        $error_count++;
        $i++;
        $j++;
        next TRY;
    }

    if ($error_count <= $max_distance) {
        if ($i < $length1) {
            $error_count = $error_count + ($length1 - ($i + 1));
        }
        if ($j < $length2) {
            $error_count = $error_count + ($length2 - ($i + 1));
        }
    }
    return ($error_count);
}

Please apply it using
$ ./check_typos.pl -d directory -t test-number

These are the options of my script (some not implemented yet):
--top-dir PATH
--sub-dir PATTERN
--extension PATTERN
--syntax [html|di-po|di-templates|hd-php|txt|docbook ...]
--stop-words FILE
--problems FILE
--dicts DICTIONARIES

This means bseoin was found once but besoin was found 990 times so it's
likely that the first is a typo. Now I search for bseoin using grep -rw.

I output the complete line like $path,$line-number,$problem,$line, which in future should allow to generate patches.

This script is much more efficient than aspell or other spell checker.
It also finds typos in names and URLs (Meyer vs. Mayer,
php382&amp;tzd_d vs. php381&amp;tzd_d)

Because the English version is of very high quality, this needs more sophisticated techniques like phrase parsing, to dedect e.g. 'file system' versus 'file-system' versus 'filesystem'.

That's why I usually extract msgid strings using (is there really no
msg* command to do this??)

With extraction to a file you loose the context. That's why I like more to have the original file context and apply parser-plugins to filter away the surrounding syntax.

We seem to have similar ideas, so we should share them. My aim is, to have a toolbox for semi-automatic reviews of documents - i.e. wording consistancy, candidates for glossaries, undocumented functions. Dedection of typos or wrong spelling is not the aim, but a side-effect.

Helmut Wollmersdorfer



Reply to: