Re: check_typos.pl script (Was: Re: A few typos in debian-installer)
Hi Jens,
> OK, I attached my script. It's really far far away from beeing perfect
> and every perl hacker knows better ways to do it but nevertheless I
> found it very useful.
>
You probably haven't seen my scripts yet :)
> I implemented three tests:
> * swapped characters (helol)
> * duplicated character (helllo)
> * removed characters (helo)
> (a check for doubled words is missing missing)
cool,
the spellchecker uses aspell for spotting such typos, but obviously only
for the languages that have an aspell dictionary; for the other levels,
the spellchecker can still be useful, but not for syntax checking, so your
script might fill that gap!
>
> Since all words which where found in a subdirectory (single files cannot
> be tested, sorry) are considered, there is no need for a wordlist or a
> special file format -- text, HTML or XML all work.
>
> The words which occur most often where checked for one of the specified
> kind of typo.
>
> Please apply it using
> $ ./check_typos.pl -d directory -t test-number
> (Test 3 has many wrong possitives).
> The script doesn't modify anything it just outputs found typos,
> similar to:
>
> bseoin (1) ==> besoin (990)
>
> This means bseoin was found once but besoin was found 990 times so it's
> likely that the first is a typo. Now I search for bseoin using grep -rw.
>
> This script is much more efficient than aspell or other spell checker.
> It also finds typos in names and URLs (Meyer vs. Mayer,
> php382&tzd_d vs. php381&tzd_d)
>
> I created my last patch by running my script against the full packages/po/
> directory. This has the advantage that strings in msgid's and msgstr's
> are compared at the same time and I was able to find even consistent
> typos accross a language file, such as etx2 and boostrap.
> Nevertheless it's also suggested to restrict tests to only one language.
> That's why I usually extract msgid strings using (is there really no
> msg* command to do this??)
>
> cat packages/po/de.po | msgconv | \
> awk '/^msgstr/ {t=1};
> /^msgid/ {t=0}; {
> if (t==1 && index($0, "#")==0) {
> gsub("^msgstr ", "");
> gsub("^\"", "");
> gsub("\"$", "");
> gsub("\\\\n", " ");
> print
> }
> }' > /tmp/check/de
> (not yet tested with plural forms of PO files).
>
well, my scripts take care of stripping all the unneeded stuff, so I can
use text files containing *only* translated strings
( look at any of the files in the
"messages" column at http://d-i.alioth.debian.org/spellcheck/)
> Attention: Since I do not know perl good enough I explictely wrote the
> word separators into the code (\W seems to be not locale specific). So I
> suggest you add common accents for other languages to the script, line 48.
> Do you know a solution for this?
>
I don't know perl, but I'll have a look at this and in the worst case I'm
sure somebody will be happy to take a look at it
> Davide, you still have to iterate accross all languages and to do
other
> stuff. But I'm sure you know the required shell snippets, right?
>
I'll take a look in the next few days; I think I won't have any problem
with this
> PS: Another script I run once per year is pattern-match
> http://alioth.debian.org/snippet/detail.php?type=snippet&id=2 This
> script checks for matching patterns in the specified file. It was
> written mainly to revise parenthesis, braces, brackets, ... in my math
> documents.
>
> Examples:
> "([x])", "{\|x\||y|}", ... correct
> "([x)]", "\|||", "{()", ... incorrect
oh yes! I tried it a while back and thought about integrating it with the
spellchecker; I think I'd like to focus more on the syntax before taking
care of such specific stuff, but I'm sure sooner or later it'll be very
useful
ciao
Davide
Reply to: