Re: Grep on dictionary words

To: debian-user@lists.debian.org
Subject: Re: Grep on dictionary words
From: Andrew Sackville-West <andrew@farwestbilliards.com>
Date: Sat, 28 Nov 2009 15:15:25 -0800
Message-id: <[🔎] 20091128231525.GS20929@basement.swclan.homelinux.org>
Mail-followup-to: debian-user@lists.debian.org
In-reply-to: <[🔎] 880dece00911281400g76942873m858e56c9f9ef7587@mail.gmail.com>
References: <[🔎] 880dece00911280713n6193b8das6970e8a071fc22a6@mail.gmail.com> <[🔎] 200911281133.05340.bss@iguanasuicide.net> <[🔎] 20091128213918.GL20929@basement.swclan.homelinux.org> <[🔎] 880dece00911281400g76942873m858e56c9f9ef7587@mail.gmail.com>

On Sun, Nov 29, 2009 at 12:00:33AM +0200, Dotan Cohen wrote:
> > ISTM that because the output of strings is not discrete list of
> > potential words, but is instead a long list of concatenated
> > characters, this problem is really rather daunting. The output should
> > probably be first broken up into something resembling words by perhaps
> > breaking on non-alphabetic characters. That should do two things: 1)
> > get you somthing that resembles words to actually test and 2) somewhat
> > smaller set of "stuff" to check.
> >
> > This won't necessarily handle "compound" words though where two
> > word-like things are jammed together, or an actual word is embedded
> > within a string of nonsense.
> >
> > I think this problem is potentially rather harder than I thought when
> > I saw OP's original question.
> >
> 
> It does not need to be comprehensive. Would it be possible to only
> show lines that have "words" (continuous strings) of alpha characters
> that are all lowercase except for the first character? That would
> handle about 90% of the work by eliminating lines line these:
> pDuf
> #k0H}g)
> GoV5
> rLeY1
> TMlq,*

well, something simple in sed would help:

sed 's/[^a-zA-Z]\+/\n/g'

splits "words" at non-alphas and inserts a newline to make each a
separate line. or leave out the '\n' to leave the "line" structure as
it is. Then you can grep with something like:

grep ^[A-Z] 

will get the ones that start with capital alphas. if you want initial
caps *only* then:

grep "^[A-Z][a-z]*$"

would match those. 

I'm sure someone can do better. But that gets you down to maybe a very
truncated dataset, then you can somehow look each of those up in
aspell.

A

Attachment: signature.asc
Description: Digital signature

Reply to:

Follow-Ups:
- Re: Grep on dictionary words
  - From: Dotan Cohen <dotancohen@gmail.com>

References:
- Grep on dictionary words
  - From: Dotan Cohen <dotancohen@gmail.com>
- Re: Grep on dictionary words
  - From: "Boyd Stephen Smith Jr." <bss@iguanasuicide.net>
- Re: Grep on dictionary words
  - From: Andrew Sackville-West <andrew@farwestbilliards.com>
- Re: Grep on dictionary words
  - From: Dotan Cohen <dotancohen@gmail.com>

Prev by Date: RE: preseed with two hard disks
Next by Date: Flash plugin problem - cant click - amd64 unstable
Previous by thread: Re: Grep on dictionary words
Next by thread: Re: Grep on dictionary words
Index(es):
- Date
- Thread