[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Grep on dictionary words



On Sun, Nov 29, 2009 at 12:00:33AM +0200, Dotan Cohen wrote:
> > ISTM that because the output of strings is not discrete list of
> > potential words, but is instead a long list of concatenated
> > characters, this problem is really rather daunting. The output should
> > probably be first broken up into something resembling words by perhaps
> > breaking on non-alphabetic characters. That should do two things: 1)
> > get you somthing that resembles words to actually test and 2) somewhat
> > smaller set of "stuff" to check.
> >
> > This won't necessarily handle "compound" words though where two
> > word-like things are jammed together, or an actual word is embedded
> > within a string of nonsense.
> >
> > I think this problem is potentially rather harder than I thought when
> > I saw OP's original question.
> >
> 
> It does not need to be comprehensive. Would it be possible to only
> show lines that have "words" (continuous strings) of alpha characters
> that are all lowercase except for the first character? That would
> handle about 90% of the work by eliminating lines line these:
> pDuf
> #k0H}g)
> GoV5
> rLeY1
> TMlq,*

well, something simple in sed would help:

sed 's/[^a-zA-Z]\+/\n/g'

splits "words" at non-alphas and inserts a newline to make each a
separate line. or leave out the '\n' to leave the "line" structure as
it is. Then you can grep with something like:

grep ^[A-Z] 

will get the ones that start with capital alphas. if you want initial
caps *only* then:

grep "^[A-Z][a-z]*$"

would match those. 

I'm sure someone can do better. But that gets you down to maybe a very
truncated dataset, then you can somehow look each of those up in
aspell.

A

Attachment: signature.asc
Description: Digital signature


Reply to: