Re: check_typos.pl script
- To: debian-i18n@lists.debian.org
- Subject: Re: check_typos.pl script
- From: Helmut Wollmersdorfer <helmut.wollmersdorfer@gmx.at>
- Date: Sat, 08 Oct 2005 02:25:05 +0200
- Message-id: <[🔎] di73l1$4sg$1@sea.gmane.org>
- In-reply-to: <20050925181559.GA2736@pluto>
- References: <42D7AE5B000A5B1E@mail-7.mail.tiscali.sys> <200509242315.09106.aragorn@tiscali.nl> <20050924224034.GA8428@pluto> <200509250141.49377.aragorn@tiscali.nl> <20050925101519.GA31312@pluto> <20050925105124.GA27920@djedefre.onera> <20050925125705.GA32389@pluto> <pan.2005.09.25.16.19.09.31840@tiscali.it> <20050925181559.GA2736@pluto>
Jens Seidel wrote:
I implemented three tests:
* swapped characters (helol)
* duplicated character (helllo)
* removed characters (helo)
(a check for doubled words is missing missing)
In a more general way, you can compute the 'edit distance', which means,
how many characters are different.
An edit distance of 1 would also catch 'hallo' or 'gello' or 'ello' or
'hell' (this will be correct under a spellchecker).
In my script with a similar intention I coded the 'Ukkonen' algorithm,
after I experienced that the more general 'Levenshtein' is very slow.
sub ukkonen {
my (
$string1,
$string2,
$max_distance
) = @_;
my ($length1, $length2) = (length $string1, length $string2);
my $length_difference = abs ($length1 - $length2);
if ($length_difference > $max_distance) {
return ($length_difference);
}
if ($string1 eq $string2 ) {
return (0);
}
my @array1 = split(//, $string1);
my @array2 = split(//, $string2);
my $i = 0;
my $j = 0;
my $error_count = 0;
#TRY: while ( $error_count <= $max_distance ) {
TRY: while ( ($error_count <= $max_distance) and ( ($i < $length1) and
($j < $length2) ) ) {
# The current characters match.
# Advance to next character in both strings.
if ( $array1[$i] eq $array2[$j] ) {
$i++;
$j++;
next TRY;
}
# The current character matches the next in the other string.
# Advance to next character in other string, error++.
if ( ($j+1) < $length2 ) {
if ($array1[$i] eq $array2[$j + 1]) {
$error_count++;
$j++;
next TRY;
}
}
if ( ($i + 1) < $length1 ) {
if ($array1[$i + 1] eq $array2[$j]) {
$error_count++;
$i++;
next TRY;
}
}
# Else: Advance in both strings; error++.
$error_count++;
$i++;
$j++;
next TRY;
}
if ($error_count <= $max_distance) {
if ($i < $length1) {
$error_count = $error_count + ($length1 - ($i + 1));
}
if ($j < $length2) {
$error_count = $error_count + ($length2 - ($i + 1));
}
}
return ($error_count);
}
Please apply it using
$ ./check_typos.pl -d directory -t test-number
These are the options of my script (some not implemented yet):
--top-dir PATH
--sub-dir PATTERN
--extension PATTERN
--syntax [html|di-po|di-templates|hd-php|txt|docbook ...]
--stop-words FILE
--problems FILE
--dicts DICTIONARIES
This means bseoin was found once but besoin was found 990 times so it's
likely that the first is a typo. Now I search for bseoin using grep -rw.
I output the complete line like $path,$line-number,$problem,$line, which
in future should allow to generate patches.
This script is much more efficient than aspell or other spell checker.
It also finds typos in names and URLs (Meyer vs. Mayer,
php382&tzd_d vs. php381&tzd_d)
Because the English version is of very high quality, this needs more
sophisticated techniques like phrase parsing, to dedect e.g. 'file
system' versus 'file-system' versus 'filesystem'.
That's why I usually extract msgid strings using (is there really no
msg* command to do this??)
With extraction to a file you loose the context. That's why I like more
to have the original file context and apply parser-plugins to filter
away the surrounding syntax.
We seem to have similar ideas, so we should share them. My aim is, to
have a toolbox for semi-automatic reviews of documents - i.e. wording
consistancy, candidates for glossaries, undocumented functions.
Dedection of typos or wrong spelling is not the aim, but a side-effect.
Helmut Wollmersdorfer
Reply to: