Re: check_typos.pl script

To: debian-i18n@lists.debian.org
Subject: Re: check_typos.pl script
From: Jens Seidel <jensseidel@users.sf.net>
Date: Sat, 8 Oct 2005 13:47:59 +0200
Message-id: <[🔎] 20051008114759.GP17406@pluto>
Mail-followup-to: debian-i18n@lists.debian.org
In-reply-to: <[🔎] di73l1$4sg$1@sea.gmane.org>
References: <42D7AE5B000A5B1E@mail-7.mail.tiscali.sys> <200509242315.09106.aragorn@tiscali.nl> <20050924224034.GA8428@pluto> <200509250141.49377.aragorn@tiscali.nl> <20050925101519.GA31312@pluto> <20050925105124.GA27920@djedefre.onera> <20050925125705.GA32389@pluto> <pan.2005.09.25.16.19.09.31840@tiscali.it> <20050925181559.GA2736@pluto> <[🔎] di73l1$4sg$1@sea.gmane.org>

On Sat, Oct 08, 2005 at 02:25:05AM +0200, Helmut Wollmersdorfer wrote:
> Jens Seidel wrote:
> 
> >I implemented three tests:
> > * swapped characters (helol)
> > * duplicated character (helllo)
> > * removed characters (helo)
> > (a check for doubled words is missing missing)

I'm late, but a few minutes ago I contacted all translators who haven't
yet fixed their corresponding d-i po file.

> In a more general way, you can compute the 'edit distance', which means, 
> how many characters are different.
> 
> An edit distance of 1 would also catch 'hallo' or 'gello' or 'ello' or 
> 'hell' (this will be correct under a spellchecker).

I know, but since I had already many wrong possitives by a distance of 1
I never tried larger distances. Long words may indeed contain multiple
typos so it's maybe a good idea to use a maximal distance of
length(word)/10. Especially German and a few other languages would profit.

But now I see that even in distance 1 your test matches more errors (I
checked sometimes also for Deb[^i]an and similar stuff, but without
proper script code).

> In my script with a similar intention I coded the 'Ukkonen' algorithm, 
> after I experienced that the more general 'Levenshtein' is very slow.

I don't know these algorithms, my script was in the past only a ugly
workaround for me and produced sufficent output. But I will definitively
test it (the code looks much cleaner, I'm a C/C++/Fortran77 coder not a
perl hacker :-))

> These are the options of my script (some not implemented yet):
> --top-dir PATH
> --sub-dir PATTERN
> --extension PATTERN
> --syntax [html|di-po|di-templates|hd-php|txt|docbook ...]
> --stop-words FILE
> --problems FILE
> --dicts DICTIONARIES

Where is it available?

> >This means bseoin was found once but besoin was found 990 times so it's
> >likely that the first is a typo. Now I search for bseoin using grep -rw.
> 
> I output the complete line like $path,$line-number,$problem,$line, which 
> in future should allow to generate patches.

Right, I always have to apply grep -rw to find the location.

> >This script is much more efficient than aspell or other spell checker.
> >It also finds typos in names and URLs (Meyer vs. Mayer,
> >php382&amp;tzd_d vs. php381&amp;tzd_d)
> 
> Because the English version is of very high quality, this needs more 
> sophisticated techniques like phrase parsing, to dedect e.g. 'file 
> system' versus 'file-system' versus 'filesystem'.

You refer to d-i, right? There are also many other English documents
which are not of a very high quality -:). Also not every document use
'file system' instead of 'file-system' or 'filesystem' resp.

> >That's why I usually extract msgid strings using (is there really no
> >msg* command to do this??)
> 
> With extraction to a file you loose the context. That's why I like more 

Right. Nevertheless it's the common UNIX philosophy to pipe through
different simple programs. But I agree that your solution is probably better.

> to have the original file context and apply parser-plugins to filter 
> away the surrounding syntax.
> 
> We seem to have similar ideas, so we should share them. My aim is, to 
> have a toolbox for semi-automatic reviews of documents - i.e. wording 
> consistancy, candidates for glossaries, undocumented functions. 
> Dedection of typos or wrong spelling is not the aim, but a side-effect.

It would be nice to have such a tool. I remember that I once also
checked <packages> tags in SGML documents against a list of available
Debian packages. There are many more tests possible ...

Jens

Reply to:

Follow-Ups:
- Re: check_typos.pl script
  - From: Helmut Wollmersdorfer <helmut.wollmersdorfer@gmx.at>

References:
- Re: check_typos.pl script
  - From: Helmut Wollmersdorfer <helmut.wollmersdorfer@gmx.at>

Prev by Date: Re: check_typos.pl script
Next by Date: Step 7/8 of N(ew)L(anguage)P(rocess) for Gujarati
Previous by thread: Re: check_typos.pl script
Next by thread: Re: check_typos.pl script
Index(es):
- Date
- Thread