[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

check_typos.pl script (Was: Re: A few typos in debian-installer)



Hi Davide,

On Sun, Sep 25, 2005 at 04:18:57PM +0000, Davide Viti wrote:
> >> The tools you use seem anyway to be very good at spotting stuff that
> >> might be missed by Davide Viti's spellchecker so I suggest you both
> >> talk together to see what could be done to integrate your scripts to
> >> the spellchecker.
> > 
> > There is nevertheless also a lot of manual work involved. But I agree
> > that the results of the script should be published. I'm not able to fix
> > all languages (even if I try it sometimes :-))
> > I will contact Davide.
> 
> I'm open to integrating your scripts with the spellchecking system.
> ATM things are a bit messy because I switched to using a complete Sarge
> system for the scripts and some things have to be fixed yet, but could be
> a good idea discussing things already.

OK, I attached my script. It's really far far away from beeing perfect
and every perl hacker knows better ways to do it but nevertheless I
found it very useful.

I implemented three tests:
 * swapped characters (helol)
 * duplicated character (helllo)
 * removed characters (helo)
 (a check for doubled words is missing missing)

Since all words which where found in a subdirectory (single files cannot
be tested, sorry) are considered, there is no need for a wordlist or a
special file format -- text, HTML or XML all work.

The words which occur most often where checked for one of the specified
kind of typo.

Please apply it using
$ ./check_typos.pl -d directory -t test-number
(Test 3 has many wrong possitives).
The script doesn't modify anything it just outputs found typos,
similar to:

bseoin (1) ==> besoin (990)

This means bseoin was found once but besoin was found 990 times so it's
likely that the first is a typo. Now I search for bseoin using grep -rw.

This script is much more efficient than aspell or other spell checker.
It also finds typos in names and URLs (Meyer vs. Mayer,
php382&tzd_d vs. php381&tzd_d)

I created my last patch by running my script against the full packages/po/
directory. This has the advantage that strings in msgid's and msgstr's
are compared at the same time and I was able to find even consistent
typos accross a language file, such as etx2 and boostrap.
Nevertheless it's also suggested to restrict tests to only one language.
That's why I usually extract msgid strings using (is there really no
msg* command to do this??)

cat packages/po/de.po | msgconv | \
 awk '/^msgstr/ {t=1};
      /^msgid/ {t=0}; {
       if (t==1 && index($0, "#")==0) {
         gsub("^msgstr ", "");
         gsub("^\"", "");
	 gsub("\"$", "");
	 gsub("\\\\n", " ");
	 print
       }
      }' > /tmp/check/de
(not yet tested with plural forms of PO files).

Attention: Since I do not know perl good enough I explictely wrote the
word separators into the code (\W seems to be not locale specific). So I
suggest you add common accents for other languages to the script, line 48.
Do you know a solution for this?

Davide, you still have to iterate accross all languages and to do other
stuff. But I'm sure you know the required shell snippets, right?

PS: Another script I run once per year is pattern-match
http://alioth.debian.org/snippet/detail.php?type=snippet&id=2
This script checks for matching patterns in the specified file. It was
written mainly to revise parenthesis, braces, brackets, ... in my math
documents.
 
  Examples:
   "([x])", "{\|x\||y|}", ... correct
   "([x)]", "\|||", "{()", ... incorrect

Jens

Attachment: check_typos.pl.gz
Description: Binary data


Reply to: