[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

The Second Great Spelling Check (Re: Package descriptions and making them better)



On Wed, Oct 03, 2001 at 12:20:38PM +0300, Richard Braakman wrote:

> I used ispell + some scripts which would exctract the Description from
> control files.  After the spelling check I used a text editor to further
> correct the ones that had spelling mistakes (on the theory that errors
> travel in groups).  Then I used templates to turn the diffs into
> bugreports.  See bug#18878 (archived) for a simple example, and #18914 for
> a more complex one.  I've had no negative reactions about them, and all of
> those bugs have been closed by now.

I wrote a patch to make ispell smart about the Debian control file format
(see bug #119782), seeded the dictionary with package names and parts
thereof, and in chunks over the course of the past month and a half,
spell-checked my entire available file.  The diff is available at:

http://people.debian.org/~mdz/spelling/corrections.diff.gz

I would appreciate some feedback before I start filing bugs.  I used the
following guidelines in my corrections:

- Capitalization 

  Languages and proper nouns should be capitalized in English

- Consistency 

  Where possible, I tried to correct packages with the same errors in the
  same way.  The sheer size of this project made it difficult to keep track,
  but I did try.

- Hyphenation 

  Package descriptions must not be hyphenated, but properly word-wrapped.
  Many packages seem to have inherited hyphenation or bad word-wrap through
  cut-and-paste from upstream descriptions.

- Abbreviation 

  Chat abbreviations (e.g. "wrt") and other inappropriate abbreviations
  should not be used in descriptions

- Keyword searching 

  Descriptions should make themselves suitable for keyword searches by users
  looking for something specific.  See "word joining".

- Word joining

  Technical terms seem to lend themselves to the formation of new words
  through joining, e.g. "lowlevel", "mousewheel", "bugreport".  In cases
  where the joined form is a common technical term, I allowed it, and in
  most other cases I recommended hyphenation or splitting into two words.
  The rationale was to allow for keyword searches to work as expected, so
  there were few hard and fast rules.

- Foreign language descriptions

  Because it would drive me mad to try to spell-check different packages
  against different dictionaries, I did everything with an English
  dictionary, and added foreign language words to the supplementary
  dictionary (which also contains package names, technical words, etc.).
  This opens a window for possible errors.

- Filenames and URLs

  These were a big pain, and it would be great if someone could extend
  ispell to skip over them.  They are substantially more difficult to
  recognize and parse than Debian control file syntax, so I didn't attempt
  this.  aspell already has this capability, and I considered using it
  instead of ispell (wishlist bug #111929), but the filtering system is a
  maze of objects and templates that made it very unclear how to create a
  new one without some internals documentation.

> > Lintian warnings might be a good outlet for such a check.
> 
> Lintian already checks for common spelling errors, and its original
> database was seeded from that round of spelling checking :-)  I think that
> for performance reasons I included only misspellings that occurred in more
> than one description.  Many of the ones you list would also be suitable
> for inclusion, I think.

I didn't keep any automated statistics on the frequency of various
misspellings, but I took some notes based on my subjective observations.
The results are filed as a lintian wishlist bug, whose bug number hasn't
come back yet.

-- 
 - mdz

Attachment: pgphtKDEzDxzd.pgp
Description: PGP signature


Reply to: