Hi Davide, On Sun, Sep 25, 2005 at 04:18:57PM +0000, Davide Viti wrote: > >> The tools you use seem anyway to be very good at spotting stuff that > >> might be missed by Davide Viti's spellchecker so I suggest you both > >> talk together to see what could be done to integrate your scripts to > >> the spellchecker. > > > > There is nevertheless also a lot of manual work involved. But I agree > > that the results of the script should be published. I'm not able to fix > > all languages (even if I try it sometimes :-)) > > I will contact Davide. > > I'm open to integrating your scripts with the spellchecking system. > ATM things are a bit messy because I switched to using a complete Sarge > system for the scripts and some things have to be fixed yet, but could be > a good idea discussing things already. OK, I attached my script. It's really far far away from beeing perfect and every perl hacker knows better ways to do it but nevertheless I found it very useful. I implemented three tests: * swapped characters (helol) * duplicated character (helllo) * removed characters (helo) (a check for doubled words is missing missing) Since all words which where found in a subdirectory (single files cannot be tested, sorry) are considered, there is no need for a wordlist or a special file format -- text, HTML or XML all work. The words which occur most often where checked for one of the specified kind of typo. Please apply it using $ ./check_typos.pl -d directory -t test-number (Test 3 has many wrong possitives). The script doesn't modify anything it just outputs found typos, similar to: bseoin (1) ==> besoin (990) This means bseoin was found once but besoin was found 990 times so it's likely that the first is a typo. Now I search for bseoin using grep -rw. This script is much more efficient than aspell or other spell checker. It also finds typos in names and URLs (Meyer vs. Mayer, php382&tzd_d vs. php381&tzd_d) I created my last patch by running my script against the full packages/po/ directory. This has the advantage that strings in msgid's and msgstr's are compared at the same time and I was able to find even consistent typos accross a language file, such as etx2 and boostrap. Nevertheless it's also suggested to restrict tests to only one language. That's why I usually extract msgid strings using (is there really no msg* command to do this??) cat packages/po/de.po | msgconv | \ awk '/^msgstr/ {t=1}; /^msgid/ {t=0}; { if (t==1 && index($0, "#")==0) { gsub("^msgstr ", ""); gsub("^\"", ""); gsub("\"$", ""); gsub("\\\\n", " "); print } }' > /tmp/check/de (not yet tested with plural forms of PO files). Attention: Since I do not know perl good enough I explictely wrote the word separators into the code (\W seems to be not locale specific). So I suggest you add common accents for other languages to the script, line 48. Do you know a solution for this? Davide, you still have to iterate accross all languages and to do other stuff. But I'm sure you know the required shell snippets, right? PS: Another script I run once per year is pattern-match http://alioth.debian.org/snippet/detail.php?type=snippet&id=2 This script checks for matching patterns in the specified file. It was written mainly to revise parenthesis, braces, brackets, ... in my math documents. Examples: "([x])", "{\|x\||y|}", ... correct "([x)]", "\|||", "{()", ... incorrect Jens
Attachment:
check_typos.pl.gz
Description: Binary data