Re: Dictionary changes

To: "debian-user@lists.debian.org User" <debian-user@lists.debian.org>
Cc: debian-user@lists.debian.org
Subject: Re: Dictionary changes
From: Amodelo <helmut.wollmersdorfer@amodelo.de>
Date: Thu, 3 Jul 2014 10:33:58 +0200
Message-id: <[🔎] 8B5F736B-8417-4717-8B98-FA81369C31EF@amodelo.de>
In-reply-to: <[🔎] 20140702132517.2e38268b@mydesq2.domain.cxm>
References: <[🔎] 20140702122202.3ae6dd35@mydesq2.domain.cxm> <[🔎] 20140702184018.6aabc27b@anubis.defcon1> <[🔎] 20140702132517.2e38268b@mydesq2.domain.cxm>

Am 02.07.2014 um 19:25 schrieb Steve Litt <slitt@troubleshooters.com>:

> On Wed, 2 Jul 2014 18:40:18 +0200
> Bzzzz <lazyvirus@gmx.com> wrote:
> 
>> On Wed, 2 Jul 2014 12:22:02 -0400
>> Steve Litt <slitt@troubleshooters.com> wrote:
> 
>>> If worst comes to worst and I can't find a way to get grep to do
>>> this, I'll just put together a substitution table,
>>> convert /usr/share/dict/words to words.ascii, line for line, search
>>> words.ascii, get the line number, and pull that line out of words.
>>> Crude, but effective.
>> 
>> AFAIK, this is the only way to be able to perform what you want.
>> 
> 
> So then, the question becomes, where does there exist a list of common
> letters that are, for want of a better word, "ornamented ascii"?
> Umlauts, Carats, Circles, Grave accents, etc.

This is a known problem without perfect solution. Some years ago I wrote a Perl module for this:

https://metacpan.org/pod/Text::Undiacritic

DESCRIPTION
Changes characters with diacritics into their base characters.
Also changes into base character in cases where UNICODE does not provide a decomposition.
E.g. all characters '... WITH STROKE' like 'LATIN SMALL LETTER L WITH STROKE' do not have a decomposition. In the latter case the result will be 'LATIN SMALL LETTER L'.
Removing diacritics is useful for matching text independent of spelling variants.

But a more general approach would be to use some sort of approximate matching via calculating a similarity coefficient and displaying the best matching strings.

See e.g. here:

https://metacpan.org/release/Set-Similarity
https://metacpan.org/pod/String::Similarity
http://www.chokkan.org/software/simstring/

HTH

Helmut Wollmersdorfer

Reply to:

References:
- Dictionary changes
  - From: Steve Litt <slitt@troubleshooters.com>
- Re: Dictionary changes
  - From: Bzzzz <lazyvirus@gmx.com>
- Re: Dictionary changes
  - From: Steve Litt <slitt@troubleshooters.com>

Prev by Date: Re: Can't see Bengali, only English is visible in websites
Next by Date: Re: Install both PHP53 and PHP54 with dotdeb on squeeze ?
Previous by thread: Re: Dictionary changes
Next by thread: Re: Dictionary changes
Index(es):
- Date
- Thread