[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Dictionary changes



Am 02.07.2014 um 19:25 schrieb Steve Litt <slitt@troubleshooters.com>:

> On Wed, 2 Jul 2014 18:40:18 +0200
> Bzzzz <lazyvirus@gmx.com> wrote:
> 
>> On Wed, 2 Jul 2014 12:22:02 -0400
>> Steve Litt <slitt@troubleshooters.com> wrote:
> 
>>> If worst comes to worst and I can't find a way to get grep to do
>>> this, I'll just put together a substitution table,
>>> convert /usr/share/dict/words to words.ascii, line for line, search
>>> words.ascii, get the line number, and pull that line out of words.
>>> Crude, but effective.
>> 
>> AFAIK, this is the only way to be able to perform what you want.
>> 
> 
> So then, the question becomes, where does there exist a list of common
> letters that are, for want of a better word, "ornamented ascii"?
> Umlauts, Carats, Circles, Grave accents, etc.

This is a known problem without perfect solution. Some years ago I wrote a Perl module for this:

https://metacpan.org/pod/Text::Undiacritic


DESCRIPTION
Changes characters with diacritics into their base characters.
Also changes into base character in cases where UNICODE does not provide a decomposition.
E.g. all characters '... WITH STROKE' like 'LATIN SMALL LETTER L WITH STROKE' do not have a decomposition. In the latter case the result will be 'LATIN SMALL LETTER L'.
Removing diacritics is useful for matching text independent of spelling variants.


But a more general approach would be to use some sort of approximate matching via calculating a similarity coefficient and displaying the best matching strings.

See e.g. here:

https://metacpan.org/release/Set-Similarity
https://metacpan.org/pod/String::Similarity
http://www.chokkan.org/software/simstring/

HTH

Helmut Wollmersdorfer

Reply to: