[OT] Re: the 'original' string function?
Emanuel Berg (12024-07-10):
> Okay, this is gonna be a challenge to most guys who have been
> processing text for a long time.
>
> So, I would like a command, function or script, 'original',
> that takes a string STR and a text file TXT and outputs
> a score, from 0 to 100, how _original_ STR is, compared to
> what is already in TXT.
>
> So if I do
>
> $ original "This isn't just another party" comments.txt
>
> this will score 0 if that exact phrase to the letter already
> exists in comments.txt.
>
> But it will score 100 if not a single of those words exists in
> the file! Because that would be 100% original.
>
> Those endpoints are easy. But how to make it score - say - 62%
> if some of the words are present, mostly spelled like that and
> combined in ways that are not completely different?
>
> Note: The above examples are examples, other definitions of
> originality are okay. That is not the important part now - but
> can be as interesting a part, later.
You can use that:
https://en.wikipedia.org/wiki/Levenshtein_distance
But you also need to define what you want with more precision:
How do you count the replacement of a word by a synonym?
How do you count a change in the order of the words?
How do you count a transparent spelling mistake?
How do you count a spelling mistake that turns a word into another
existing word?
Not related to Debian, putting “[OT]” in the subject.
Regards,
--
Nicolas George
Reply to: