Re: check_typos.pl script

To: debian-i18n@lists.debian.org
Subject: Re: check_typos.pl script
From: Helmut Wollmersdorfer <helmut.wollmersdorfer@gmx.at>
Date: Sat, 08 Oct 2005 02:25:05 +0200
Message-id: <[🔎] di73l1$4sg$1@sea.gmane.org>
In-reply-to: <20050925181559.GA2736@pluto>
References: <42D7AE5B000A5B1E@mail-7.mail.tiscali.sys> <200509242315.09106.aragorn@tiscali.nl> <20050924224034.GA8428@pluto> <200509250141.49377.aragorn@tiscali.nl> <20050925101519.GA31312@pluto> <20050925105124.GA27920@djedefre.onera> <20050925125705.GA32389@pluto> <pan.2005.09.25.16.19.09.31840@tiscali.it> <20050925181559.GA2736@pluto>

Jens Seidel wrote:

I implemented three tests:
 * swapped characters (helol)
 * duplicated character (helllo)
 * removed characters (helo)
 (a check for doubled words is missing missing)

In a more general way, you can compute the 'edit distance', which means,how many characters are different.

An edit distance of 1 would also catch 'hallo' or 'gello' or 'ello' or'hell' (this will be correct under a spellchecker).

In my script with a similar intention I coded the 'Ukkonen' algorithm,after I experienced that the more general 'Levenshtein' is very slow.


sub ukkonen {
    my (
        $string1,
        $string2,
        $max_distance
        ) = @_;

    my ($length1, $length2) = (length $string1, length $string2);

    my $length_difference = abs ($length1 - $length2);

    if ($length_difference > $max_distance) {
        return ($length_difference);
    }

    if ($string1 eq $string2 ) {
        return (0);
    }


    my @array1 = split(//, $string1);
    my @array2 = split(//, $string2);

    my $i = 0;
    my $j = 0;

    my $error_count = 0;

#TRY: while ( $error_count <= $max_distance ) {

TRY: while ( ($error_count <= $max_distance) and ( ($i < $length1) and($j < $length2) ) ) {

        # The current characters match.
        #   Advance to next character in both strings.
        if ( $array1[$i] eq $array2[$j] ) {
            $i++;
            $j++;
            next TRY;
        }

        # The current character matches the next in the other string.
        #   Advance to next character in other string, error++.
        if ( ($j+1) < $length2 ) {
            if ($array1[$i] eq $array2[$j + 1]) {
                $error_count++;
                $j++;
                next TRY;
            }
        }
        if ( ($i + 1) < $length1 ) {
            if ($array1[$i + 1] eq $array2[$j]) {
                $error_count++;
                $i++;
                next TRY;
            }
        }

        # Else: Advance in both strings; error++.
        $error_count++;
        $i++;
        $j++;
        next TRY;
    }

    if ($error_count <= $max_distance) {
        if ($i < $length1) {
            $error_count = $error_count + ($length1 - ($i + 1));
        }
        if ($j < $length2) {
            $error_count = $error_count + ($length2 - ($i + 1));
        }
    }
    return ($error_count);
}

Please apply it using
$ ./check_typos.pl -d directory -t test-number


These are the options of my script (some not implemented yet):
--top-dir PATH
--sub-dir PATTERN
--extension PATTERN
--syntax [html|di-po|di-templates|hd-php|txt|docbook ...]
--stop-words FILE
--problems FILE
--dicts DICTIONARIES

This means bseoin was found once but besoin was found 990 times so it's
likely that the first is a typo. Now I search for bseoin using grep -rw.

I output the complete line like $path,$line-number,$problem,$line, whichin future should allow to generate patches.

This script is much more efficient than aspell or other spell checker.
It also finds typos in names and URLs (Meyer vs. Mayer,
php382&amp;tzd_d vs. php381&amp;tzd_d)

Because the English version is of very high quality, this needs moresophisticated techniques like phrase parsing, to dedect e.g. 'filesystem' versus 'file-system' versus 'filesystem'.

That's why I usually extract msgid strings using (is there really no
msg* command to do this??)

With extraction to a file you loose the context. That's why I like moreto have the original file context and apply parser-plugins to filteraway the surrounding syntax.

We seem to have similar ideas, so we should share them. My aim is, tohave a toolbox for semi-automatic reviews of documents - i.e. wordingconsistancy, candidates for glossaries, undocumented functions.Dedection of typos or wrong spelling is not the aim, but a side-effect.


Helmut Wollmersdorfer

Reply to:

Follow-Ups:
- Re: check_typos.pl script
  - From: Jens Seidel <jensseidel@users.sf.net>

Prev by Date: D-I Manual - Source files moved within d-i SVN repository
Next by Date: Re: check_typos.pl script
Previous by thread: D-I Manual - Source files moved within d-i SVN repository
Next by thread: Re: check_typos.pl script
Index(es):
- Date
- Thread