check_typos.pl script (Was: Re: A few typos in debian-installer)

To: debian-i18n@lists.debian.org
Subject: check_typos.pl script (Was: Re: A few typos in debian-installer)
From: Jens Seidel <jensseidel@users.sf.net>
Date: Sun, 25 Sep 2005 20:15:59 +0200
Message-id: <[🔎] 20050925181559.GA2736@pluto>
Mail-followup-to: debian-i18n@lists.debian.org
In-reply-to: <[🔎] pan.2005.09.25.16.19.09.31840@tiscali.it>
References: <[🔎] 42D7AE5B000A5B1E@mail-7.mail.tiscali.sys> <[🔎] 200509242315.09106.aragorn@tiscali.nl> <[🔎] 20050924224034.GA8428@pluto> <[🔎] 200509250141.49377.aragorn@tiscali.nl> <[🔎] 20050925101519.GA31312@pluto> <[🔎] 20050925105124.GA27920@djedefre.onera> <[🔎] 20050925125705.GA32389@pluto> <[🔎] pan.2005.09.25.16.19.09.31840@tiscali.it>

Hi Davide,

On Sun, Sep 25, 2005 at 04:18:57PM +0000, Davide Viti wrote:
> >> The tools you use seem anyway to be very good at spotting stuff that
> >> might be missed by Davide Viti's spellchecker so I suggest you both
> >> talk together to see what could be done to integrate your scripts to
> >> the spellchecker.
> > 
> > There is nevertheless also a lot of manual work involved. But I agree
> > that the results of the script should be published. I'm not able to fix
> > all languages (even if I try it sometimes :-))
> > I will contact Davide.
> 
> I'm open to integrating your scripts with the spellchecking system.
> ATM things are a bit messy because I switched to using a complete Sarge
> system for the scripts and some things have to be fixed yet, but could be
> a good idea discussing things already.

OK, I attached my script. It's really far far away from beeing perfect
and every perl hacker knows better ways to do it but nevertheless I
found it very useful.

I implemented three tests:
 * swapped characters (helol)
 * duplicated character (helllo)
 * removed characters (helo)
 (a check for doubled words is missing missing)

Since all words which where found in a subdirectory (single files cannot
be tested, sorry) are considered, there is no need for a wordlist or a
special file format -- text, HTML or XML all work.

The words which occur most often where checked for one of the specified
kind of typo.

Please apply it using
$ ./check_typos.pl -d directory -t test-number
(Test 3 has many wrong possitives).
The script doesn't modify anything it just outputs found typos,
similar to:

bseoin (1) ==> besoin (990)

This means bseoin was found once but besoin was found 990 times so it's
likely that the first is a typo. Now I search for bseoin using grep -rw.

This script is much more efficient than aspell or other spell checker.
It also finds typos in names and URLs (Meyer vs. Mayer,
php382&amp;tzd_d vs. php381&amp;tzd_d)

I created my last patch by running my script against the full packages/po/
directory. This has the advantage that strings in msgid's and msgstr's
are compared at the same time and I was able to find even consistent
typos accross a language file, such as etx2 and boostrap.
Nevertheless it's also suggested to restrict tests to only one language.
That's why I usually extract msgid strings using (is there really no
msg* command to do this??)

cat packages/po/de.po | msgconv | \
 awk '/^msgstr/ {t=1};
      /^msgid/ {t=0}; {
       if (t==1 && index($0, "#")==0) {
         gsub("^msgstr ", "");
         gsub("^\"", "");
	 gsub("\"$", "");
	 gsub("\\\\n", " ");
	 print
       }
      }' > /tmp/check/de
(not yet tested with plural forms of PO files).

Attention: Since I do not know perl good enough I explictely wrote the
word separators into the code (\W seems to be not locale specific). So I
suggest you add common accents for other languages to the script, line 48.
Do you know a solution for this?

Davide, you still have to iterate accross all languages and to do other
stuff. But I'm sure you know the required shell snippets, right?

PS: Another script I run once per year is pattern-match
http://alioth.debian.org/snippet/detail.php?type=snippet&id=2
This script checks for matching patterns in the specified file. It was
written mainly to revise parenthesis, braces, brackets, ... in my math
documents.

  Examples:
   "([x])", "{\|x\||y|}", ... correct
   "([x)]", "\|||", "{()", ... incorrect

Jens

Attachment: check_typos.pl.gz
Description: Binary data

Reply to:

Follow-Ups:
- Re: check_typos.pl script (Was: Re: A few typos in debian-installer)
  - From: Davide Viti <zinosat@tiscali.it>

References:
- [D-I] Variable subsitution (level2)
  - From: zinosat@tiscali.it
- Re: [D-I] Variable subsitution (level2)
  - From: Frans Pop <aragorn@tiscali.nl>
- A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Jens Seidel <jensseidel@users.sf.net>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Frans Pop <aragorn@tiscali.nl>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Jens Seidel <jensseidel@users.sf.net>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Christian Perrier <bubulle@debian.org>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Jens Seidel <jensseidel@users.sf.net>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Davide Viti <zinosat@tiscali.it>

Prev by Date: Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
Next by Date: Re: check_typos.pl script (Was: Re: A few typos in debian-installer)
Previous by thread: Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
Next by thread: Re: check_typos.pl script (Was: Re: A few typos in debian-installer)
Index(es):
- Date
- Thread