On Wed, Oct 03, 2001 at 12:20:38PM +0300, Richard Braakman wrote: > I used ispell + some scripts which would exctract the Description from > control files. After the spelling check I used a text editor to further > correct the ones that had spelling mistakes (on the theory that errors > travel in groups). Then I used templates to turn the diffs into > bugreports. See bug#18878 (archived) for a simple example, and #18914 for > a more complex one. I've had no negative reactions about them, and all of > those bugs have been closed by now. I wrote a patch to make ispell smart about the Debian control file format (see bug #119782), seeded the dictionary with package names and parts thereof, and in chunks over the course of the past month and a half, spell-checked my entire available file. The diff is available at: http://people.debian.org/~mdz/spelling/corrections.diff.gz I would appreciate some feedback before I start filing bugs. I used the following guidelines in my corrections: - Capitalization Languages and proper nouns should be capitalized in English - Consistency Where possible, I tried to correct packages with the same errors in the same way. The sheer size of this project made it difficult to keep track, but I did try. - Hyphenation Package descriptions must not be hyphenated, but properly word-wrapped. Many packages seem to have inherited hyphenation or bad word-wrap through cut-and-paste from upstream descriptions. - Abbreviation Chat abbreviations (e.g. "wrt") and other inappropriate abbreviations should not be used in descriptions - Keyword searching Descriptions should make themselves suitable for keyword searches by users looking for something specific. See "word joining". - Word joining Technical terms seem to lend themselves to the formation of new words through joining, e.g. "lowlevel", "mousewheel", "bugreport". In cases where the joined form is a common technical term, I allowed it, and in most other cases I recommended hyphenation or splitting into two words. The rationale was to allow for keyword searches to work as expected, so there were few hard and fast rules. - Foreign language descriptions Because it would drive me mad to try to spell-check different packages against different dictionaries, I did everything with an English dictionary, and added foreign language words to the supplementary dictionary (which also contains package names, technical words, etc.). This opens a window for possible errors. - Filenames and URLs These were a big pain, and it would be great if someone could extend ispell to skip over them. They are substantially more difficult to recognize and parse than Debian control file syntax, so I didn't attempt this. aspell already has this capability, and I considered using it instead of ispell (wishlist bug #111929), but the filtering system is a maze of objects and templates that made it very unclear how to create a new one without some internals documentation. > > Lintian warnings might be a good outlet for such a check. > > Lintian already checks for common spelling errors, and its original > database was seeded from that round of spelling checking :-) I think that > for performance reasons I included only misspellings that occurred in more > than one description. Many of the ones you list would also be suitable > for inclusion, I think. I didn't keep any automated statistics on the frequency of various misspellings, but I took some notes based on my subjective observations. The results are filed as a lintian wishlist bug, whose bug number hasn't come back yet. -- - mdz
Attachment:
pgphtKDEzDxzd.pgp
Description: PGP signature