[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: ITO: several packages available for adoption



On Tue, Dec 18, 2001 at 12:01:13PM -0500, Dale Scheetz wrote:

> What would be really cool would be for Lintian to check the required text
> for spelling errors according to the Debian template. While this one time
> sweep has been very helpful, things change with time and errors creep in.
> Having a standard way to deal with this without having to remember to
> break out the spell checker would be a very nice enhancement to Lintian.

The only difficulty with this is managing the official Debian
dictionary/wordlist.  The one that I came up with during this scan is about
13478 words / 100kb, but it is basically a brute-force one I am not at all
happy with it.  Most of it is package names and parts thereof, because
otherwise I would have spent a lot of time skipping them due to packages
mentioning one another.  The remainder is probably an even split between
technical jargon and things that should have been ignored (foreign
languages, URL fragments, and pathname fragments).

David Coe was kind enough to incorporate my ispell patch to grok the Debian
control file format which skips everything but description fields.  It would
be great to enable ispell to skip over URLs and pathnames as well.  David
mentioned that ispell will soon have a new mechanism for doing filtering of
this kind which should make this a trivial operation, driven by an external
script.

Once that is done, I think that the next logical step would be to create a
couple of specialized wordlists so that they can be managed separately:

- technical jargon
- Debian package names and fragments

The former would probably be useful in many other contexts and could be
merged into some wordlist package for general use.  The latter is much
trickier to manage.  It is possible to automatically generate something
pretty useful (as I did), but it is bound to contain false positives for
real words and such.  I don't think that problem can be completely solved
unless some kind of markup is introduced for descriptions, and I don't think
that there's enough reason to do that yet.

-- 
 - mdz



Reply to: