[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: check_typos.pl script



On Sat, Oct 08, 2005 at 02:25:05AM +0200, Helmut Wollmersdorfer wrote:
> Jens Seidel wrote:
> 
> >I implemented three tests:
> > * swapped characters (helol)
> > * duplicated character (helllo)
> > * removed characters (helo)
> > (a check for doubled words is missing missing)

I'm late, but a few minutes ago I contacted all translators who haven't
yet fixed their corresponding d-i po file.

> In a more general way, you can compute the 'edit distance', which means, 
> how many characters are different.
> 
> An edit distance of 1 would also catch 'hallo' or 'gello' or 'ello' or 
> 'hell' (this will be correct under a spellchecker).

I know, but since I had already many wrong possitives by a distance of 1
I never tried larger distances. Long words may indeed contain multiple
typos so it's maybe a good idea to use a maximal distance of
length(word)/10. Especially German and a few other languages would profit.

But now I see that even in distance 1 your test matches more errors (I
checked sometimes also for Deb[^i]an and similar stuff, but without
proper script code).

> In my script with a similar intention I coded the 'Ukkonen' algorithm, 
> after I experienced that the more general 'Levenshtein' is very slow.

I don't know these algorithms, my script was in the past only a ugly
workaround for me and produced sufficent output. But I will definitively
test it (the code looks much cleaner, I'm a C/C++/Fortran77 coder not a
perl hacker :-))

> These are the options of my script (some not implemented yet):
> --top-dir PATH
> --sub-dir PATTERN
> --extension PATTERN
> --syntax [html|di-po|di-templates|hd-php|txt|docbook ...]
> --stop-words FILE
> --problems FILE
> --dicts DICTIONARIES

Where is it available?

> >This means bseoin was found once but besoin was found 990 times so it's
> >likely that the first is a typo. Now I search for bseoin using grep -rw.
> 
> I output the complete line like $path,$line-number,$problem,$line, which 
> in future should allow to generate patches.

Right, I always have to apply grep -rw to find the location.

> >This script is much more efficient than aspell or other spell checker.
> >It also finds typos in names and URLs (Meyer vs. Mayer,
> >php382&tzd_d vs. php381&tzd_d)
> 
> Because the English version is of very high quality, this needs more 
> sophisticated techniques like phrase parsing, to dedect e.g. 'file 
> system' versus 'file-system' versus 'filesystem'.

You refer to d-i, right? There are also many other English documents
which are not of a very high quality -:). Also not every document use
'file system' instead of 'file-system' or 'filesystem' resp.

> >That's why I usually extract msgid strings using (is there really no
> >msg* command to do this??)
> 
> With extraction to a file you loose the context. That's why I like more 

Right. Nevertheless it's the common UNIX philosophy to pipe through
different simple programs. But I agree that your solution is probably better.

> to have the original file context and apply parser-plugins to filter 
> away the surrounding syntax.
> 
> We seem to have similar ideas, so we should share them. My aim is, to 
> have a toolbox for semi-automatic reviews of documents - i.e. wording 
> consistancy, candidates for glossaries, undocumented functions. 
> Dedection of typos or wrong spelling is not the aim, but a side-effect.

It would be nice to have such a tool. I remember that I once also
checked <packages> tags in SGML documents against a list of available
Debian packages. There are many more tests possible ...

Jens



Reply to: