Re: check_typos.pl script
On Sat, Oct 08, 2005 at 02:25:05AM +0200, Helmut Wollmersdorfer wrote:
> Jens Seidel wrote:
>
> >I implemented three tests:
> > * swapped characters (helol)
> > * duplicated character (helllo)
> > * removed characters (helo)
> > (a check for doubled words is missing missing)
I'm late, but a few minutes ago I contacted all translators who haven't
yet fixed their corresponding d-i po file.
> In a more general way, you can compute the 'edit distance', which means,
> how many characters are different.
>
> An edit distance of 1 would also catch 'hallo' or 'gello' or 'ello' or
> 'hell' (this will be correct under a spellchecker).
I know, but since I had already many wrong possitives by a distance of 1
I never tried larger distances. Long words may indeed contain multiple
typos so it's maybe a good idea to use a maximal distance of
length(word)/10. Especially German and a few other languages would profit.
But now I see that even in distance 1 your test matches more errors (I
checked sometimes also for Deb[^i]an and similar stuff, but without
proper script code).
> In my script with a similar intention I coded the 'Ukkonen' algorithm,
> after I experienced that the more general 'Levenshtein' is very slow.
I don't know these algorithms, my script was in the past only a ugly
workaround for me and produced sufficent output. But I will definitively
test it (the code looks much cleaner, I'm a C/C++/Fortran77 coder not a
perl hacker :-))
> These are the options of my script (some not implemented yet):
> --top-dir PATH
> --sub-dir PATTERN
> --extension PATTERN
> --syntax [html|di-po|di-templates|hd-php|txt|docbook ...]
> --stop-words FILE
> --problems FILE
> --dicts DICTIONARIES
Where is it available?
> >This means bseoin was found once but besoin was found 990 times so it's
> >likely that the first is a typo. Now I search for bseoin using grep -rw.
>
> I output the complete line like $path,$line-number,$problem,$line, which
> in future should allow to generate patches.
Right, I always have to apply grep -rw to find the location.
> >This script is much more efficient than aspell or other spell checker.
> >It also finds typos in names and URLs (Meyer vs. Mayer,
> >php382&tzd_d vs. php381&tzd_d)
>
> Because the English version is of very high quality, this needs more
> sophisticated techniques like phrase parsing, to dedect e.g. 'file
> system' versus 'file-system' versus 'filesystem'.
You refer to d-i, right? There are also many other English documents
which are not of a very high quality -:). Also not every document use
'file system' instead of 'file-system' or 'filesystem' resp.
> >That's why I usually extract msgid strings using (is there really no
> >msg* command to do this??)
>
> With extraction to a file you loose the context. That's why I like more
Right. Nevertheless it's the common UNIX philosophy to pipe through
different simple programs. But I agree that your solution is probably better.
> to have the original file context and apply parser-plugins to filter
> away the surrounding syntax.
>
> We seem to have similar ideas, so we should share them. My aim is, to
> have a toolbox for semi-automatic reviews of documents - i.e. wording
> consistancy, candidates for glossaries, undocumented functions.
> Dedection of typos or wrong spelling is not the aim, but a side-effect.
It would be nice to have such a tool. I remember that I once also
checked <packages> tags in SGML documents against a list of available
Debian packages. There are many more tests possible ...
Jens
Reply to: