Re: Removing duplication: Word lists of common words in languages
Simon McVittie <firstname.lastname@example.org> writes:
> On 10/11/14 23:16, Ben Finney wrote:
> > To avoid duplicating these “the N most common words, ranked by
> > frequency, for language FOO”
> For a password generator you ideally want the word-list to be sorted
> alphabetically, so that it's trivial to verify "by eye" that there are
> no duplicates. Duplicate entries would reduce the entropy of the
> generated passwords, without anything being obviously wrong.
It's already trivial to test a wordlist, regardless of its existing
order, to check whether it has duplicates:
$ ( sort | uniq -d | wc -l ) < /usr/share/dict/american-english
What's impossible is to go from an alphabetically-ordered list of unique
words to a frequency-ordered one, without introducing the frequency
information from outside.
> (Idea stolen from Diceware, for which it is essential, because the word
> list is designed to be usable without a computer
Well, I don't think we need to cater for “use without a computer” for
programs in Debian. Unless I'm misunderstanding something?
An important property of a “N most common” wordlist ordered by frequency
in the language, is that it's trivial to get a “X most common” wordlist
(where X < N) by simply truncating the list. This is a property lacking
in a wordlist not ordered by frequency.
Where is a good authoritative source of such words, by frequency, for
various natural languages, suitable for inclusion in Debian as a data
\ “Anything that we scientists can do to weaken the hold of |
`\ religion should be done and may in the end be our greatest |
_o__) contribution to civilization.” —Steven Weinberg |