[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: check_typos.pl script (Was: Re: A few typos in debian-installer)

Hi Jens,

> OK, I attached my script. It's really far far away from beeing perfect
> and every perl hacker knows better ways to do it but nevertheless I
> found it very useful.

You probably haven't seen my scripts yet :)

> I implemented three tests:
>  * swapped characters (helol)
>  * duplicated character (helllo)
>  * removed characters (helo)
>  (a check for doubled words is missing missing)

the spellchecker uses aspell for spotting such typos, but obviously only
for the languages that have an aspell dictionary; for the other levels,
the spellchecker can still be useful, but not for syntax checking, so your
script might fill that gap!

> Since all words which where found in a subdirectory (single files cannot
> be tested, sorry) are considered, there is no need for a wordlist or a
> special file format -- text, HTML or XML all work.
> The words which occur most often where checked for one of the specified
> kind of typo.
> Please apply it using
> $ ./check_typos.pl -d directory -t test-number
> (Test 3 has many wrong possitives).
> The script doesn't modify anything it just outputs found typos,
> similar to:
> bseoin (1) ==> besoin (990)
> This means bseoin was found once but besoin was found 990 times so it's
> likely that the first is a typo. Now I search for bseoin using grep -rw.
> This script is much more efficient than aspell or other spell checker.
> It also finds typos in names and URLs (Meyer vs. Mayer,
> php382&tzd_d vs. php381&tzd_d)
> I created my last patch by running my script against the full packages/po/
> directory. This has the advantage that strings in msgid's and msgstr's
> are compared at the same time and I was able to find even consistent
> typos accross a language file, such as etx2 and boostrap.
> Nevertheless it's also suggested to restrict tests to only one language.
> That's why I usually extract msgid strings using (is there really no
> msg* command to do this??)
> cat packages/po/de.po | msgconv | \
>  awk '/^msgstr/ {t=1};
>       /^msgid/ {t=0}; {
>        if (t==1 && index($0, "#")==0) {
>          gsub("^msgstr ", "");
>          gsub("^\"", "");
> 	 gsub("\"$", "");
> 	 gsub("\\\\n", " ");
> 	 print
>        }
>       }' > /tmp/check/de
> (not yet tested with plural forms of PO files).

well, my scripts take care of stripping all the unneeded stuff, so I can
use text files containing *only* translated strings
( look at any of the files in the
"messages" column at http://d-i.alioth.debian.org/spellcheck/)

> Attention: Since I do not know perl good enough I explictely wrote the
> word separators into the code (\W seems to be not locale specific). So I
> suggest you add common accents for other languages to the script, line 48.
> Do you know a solution for this?

I don't know perl, but I'll have a look at this and in the worst case I'm
sure somebody will be happy to take a look at it

> Davide, you still have to iterate accross all languages and to do
> stuff. But I'm sure you know the required shell snippets, right?

I'll take a look in the next few days; I think I won't have any problem
with this

> PS: Another script I run once per year is pattern-match
> http://alioth.debian.org/snippet/detail.php?type=snippet&id=2 This
> script checks for matching patterns in the specified file. It was
> written mainly to revise parenthesis, braces, brackets, ... in my math
> documents.
>   Examples:
>    "([x])", "{\|x\||y|}", ... correct
>    "([x)]", "\|||", "{()", ... incorrect

oh yes! I tried it a while back and thought about integrating it with the
spellchecker; I think I'd like to focus more on the syntax before taking
care of such specific stuff, but I'm sure sooner or later it'll be very



Reply to: