Re: check_typos.pl script (Was: Re: A few typos in debian-installer)

To: debian-i18n@lists.debian.org
Subject: Re: check_typos.pl script (Was: Re: A few typos in debian-installer)
From: Davide Viti <zinosat@tiscali.it>
Date: Sun, 25 Sep 2005 18:58:25 GMT
Message-id: <[🔎] pan.2005.09.25.18.58.31.385084@tiscali.it>
References: <[🔎] 42D7AE5B000A5B1E@mail-7.mail.tiscali.sys> <[🔎] 200509242315.09106.aragorn@tiscali.nl> <[🔎] 20050924224034.GA8428@pluto> <[🔎] 200509250141.49377.aragorn@tiscali.nl> <[🔎] 20050925101519.GA31312@pluto> <[🔎] 20050925105124.GA27920@djedefre.onera> <[🔎] 20050925125705.GA32389@pluto> <[🔎] pan.2005.09.25.16.19.09.31840@tiscali.it> <[🔎] 20050925181559.GA2736@pluto>

Hi Jens,

> OK, I attached my script. It's really far far away from beeing perfect
> and every perl hacker knows better ways to do it but nevertheless I
> found it very useful.
> 

You probably haven't seen my scripts yet :)

> I implemented three tests:
>  * swapped characters (helol)
>  * duplicated character (helllo)
>  * removed characters (helo)
>  (a check for doubled words is missing missing)

cool,
the spellchecker uses aspell for spotting such typos, but obviously only
for the languages that have an aspell dictionary; for the other levels,
the spellchecker can still be useful, but not for syntax checking, so your
script might fill that gap!

> 
> Since all words which where found in a subdirectory (single files cannot
> be tested, sorry) are considered, there is no need for a wordlist or a
> special file format -- text, HTML or XML all work.
> 
> The words which occur most often where checked for one of the specified
> kind of typo.
> 
> Please apply it using
> $ ./check_typos.pl -d directory -t test-number
> (Test 3 has many wrong possitives).
> The script doesn't modify anything it just outputs found typos,
> similar to:
> 
> bseoin (1) ==> besoin (990)
> 
> This means bseoin was found once but besoin was found 990 times so it's
> likely that the first is a typo. Now I search for bseoin using grep -rw.
> 
> This script is much more efficient than aspell or other spell checker.
> It also finds typos in names and URLs (Meyer vs. Mayer,
> php382&amp;tzd_d vs. php381&amp;tzd_d)
> 
> I created my last patch by running my script against the full packages/po/
> directory. This has the advantage that strings in msgid's and msgstr's
> are compared at the same time and I was able to find even consistent
> typos accross a language file, such as etx2 and boostrap.
> Nevertheless it's also suggested to restrict tests to only one language.
> That's why I usually extract msgid strings using (is there really no
> msg* command to do this??)
> 
> cat packages/po/de.po | msgconv | \
>  awk '/^msgstr/ {t=1};
>       /^msgid/ {t=0}; {
>        if (t==1 && index($0, "#")==0) {
>          gsub("^msgstr ", "");
>          gsub("^\"", "");
> 	 gsub("\"$", "");
> 	 gsub("\\\\n", " ");
> 	 print
>        }
>       }' > /tmp/check/de
> (not yet tested with plural forms of PO files).
> 

well, my scripts take care of stripping all the unneeded stuff, so I can
use text files containing *only* translated strings
( look at any of the files in the
"messages" column at http://d-i.alioth.debian.org/spellcheck/)

> Attention: Since I do not know perl good enough I explictely wrote the
> word separators into the code (\W seems to be not locale specific). So I
> suggest you add common accents for other languages to the script, line 48.
> Do you know a solution for this?
> 

I don't know perl, but I'll have a look at this and in the worst case I'm
sure somebody will be happy to take a look at it

> Davide, you still have to iterate accross all languages and to do
other
> stuff. But I'm sure you know the required shell snippets, right?
> 

I'll take a look in the next few days; I think I won't have any problem
with this

> PS: Another script I run once per year is pattern-match
> http://alioth.debian.org/snippet/detail.php?type=snippet&id=2 This
> script checks for matching patterns in the specified file. It was
> written mainly to revise parenthesis, braces, brackets, ... in my math
> documents.
>  
>   Examples:
>    "([x])", "{\|x\||y|}", ... correct
>    "([x)]", "\|||", "{()", ... incorrect

oh yes! I tried it a while back and thought about integrating it with the
spellchecker; I think I'd like to focus more on the syntax before taking
care of such specific stuff, but I'm sure sooner or later it'll be very
useful

ciao

Davide

Reply to:

References:
- [D-I] Variable subsitution (level2)
  - From: zinosat@tiscali.it
- Re: [D-I] Variable subsitution (level2)
  - From: Frans Pop <aragorn@tiscali.nl>
- A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Jens Seidel <jensseidel@users.sf.net>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Frans Pop <aragorn@tiscali.nl>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Jens Seidel <jensseidel@users.sf.net>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Christian Perrier <bubulle@debian.org>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Jens Seidel <jensseidel@users.sf.net>
- Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
  - From: Davide Viti <zinosat@tiscali.it>
- check_typos.pl script (Was: Re: A few typos in debian-installer)
  - From: Jens Seidel <jensseidel@users.sf.net>

Prev by Date: check_typos.pl script (Was: Re: A few typos in debian-installer)
Next by Date: Could someone please check some Japanese manpages?
Previous by thread: check_typos.pl script (Was: Re: A few typos in debian-installer)
Next by thread: Re: A few typos in debian-installer (Was: Re: [D-I] Variable subsitution (level2))
Index(es):
- Date
- Thread