[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Removing duplication: Word lists of common words in languages

Simon McVittie <smcv@debian.org> writes:

> On 10/11/14 23:16, Ben Finney wrote:
> > To avoid duplicating these “the N most common words, ranked by
> > frequency, for language FOO”
> For a password generator you ideally want the word-list to be sorted
> alphabetically, so that it's trivial to verify "by eye" that there are
> no duplicates. Duplicate entries would reduce the entropy of the
> generated passwords, without anything being obviously wrong.

It's already trivial to test a wordlist, regardless of its existing
order, to check whether it has duplicates:

    $ ( sort | uniq -d | wc -l ) < /usr/share/dict/american-english

What's impossible is to go from an alphabetically-ordered list of unique
words to a frequency-ordered one, without introducing the frequency
information from outside.

> (Idea stolen from Diceware, for which it is essential, because the word
> list is designed to be usable without a computer

Well, I don't think we need to cater for “use without a computer” for
programs in Debian. Unless I'm misunderstanding something?

An important property of a “N most common” wordlist ordered by frequency
in the language, is that it's trivial to get a “X most common” wordlist
(where X < N) by simply truncating the list. This is a property lacking
in a wordlist not ordered by frequency.

Where is a good authoritative source of such words, by frequency, for
various natural languages, suitable for inclusion in Debian as a data

 \           “Anything that we scientists can do to weaken the hold of |
  `\        religion should be done and may in the end be our greatest |
_o__)                  contribution to civilization.” —Steven Weinberg |
Ben Finney

Reply to: