Re: Removing duplication: Word lists of common words in languages

To: debian-devel@lists.debian.org
Subject: Re: Removing duplication: Word lists of common words in languages
From: Ben Finney <ben+debian@benfinney.id.au>
Date: Tue, 11 Nov 2014 22:48:00 +1100
Message-id: <[🔎] 85k332c61b.fsf@benfinney.id.au>
References: <[🔎] 20141109081933.24446.59817.reportbug@lavender> <[🔎] 545F6B5A.5040302@debian.org> <[🔎] 8561eoc5jz.fsf@benfinney.id.au> <[🔎] 20141110102702.4694.52780@bastian.jones.dk> <[🔎] 20141110231653.GB9455@benfinney.id.au> <[🔎] 5461E839.5030607@debian.org>

Simon McVittie <smcv@debian.org> writes:

> On 10/11/14 23:16, Ben Finney wrote:
> > To avoid duplicating these “the N most common words, ranked by
> > frequency, for language FOO”
>
> For a password generator you ideally want the word-list to be sorted
> alphabetically, so that it's trivial to verify "by eye" that there are
> no duplicates. Duplicate entries would reduce the entropy of the
> generated passwords, without anything being obviously wrong.

It's already trivial to test a wordlist, regardless of its existing
order, to check whether it has duplicates:

    $ ( sort | uniq -d | wc -l ) < /usr/share/dict/american-english
    0

What's impossible is to go from an alphabetically-ordered list of unique
words to a frequency-ordered one, without introducing the frequency
information from outside.

> (Idea stolen from Diceware, for which it is essential, because the word
> list is designed to be usable without a computer

Well, I don't think we need to cater for “use without a computer” for
programs in Debian. Unless I'm misunderstanding something?

An important property of a “N most common” wordlist ordered by frequency
in the language, is that it's trivial to get a “X most common” wordlist
(where X < N) by simply truncating the list. This is a property lacking
in a wordlist not ordered by frequency.

Where is a good authoritative source of such words, by frequency, for
various natural languages, suitable for inclusion in Debian as a data
package?

-- 
 \           “Anything that we scientists can do to weaken the hold of |
  `\        religion should be done and may in the end be our greatest |
_o__)                  contribution to civilization.” —Steven Weinberg |
Ben Finney

Reply to:

Follow-Ups:
- Re: Removing duplication: Word lists of common words in languages
  - From: Ian Jackson <ijackson@chiark.greenend.org.uk>

References:
- Bug#768772: ITP: xkcdpass -- secure passphrase generator inspired by XKCD 936
  - From: Ben Finney <ben+debian@benfinney.id.au>
- Re: Bug#768772: ITP: xkcdpass -- secure passphrase generator inspired by XKCD 936
  - From: Simon McVittie <smcv@debian.org>
- Re: Bug#768772: ITP: xkcdpass -- secure passphrase generator inspired by XKCD 936
  - From: Ben Finney <ben+debian@benfinney.id.au>
- Re: Bug#768772: ITP: xkcdpass -- secure passphrase generator inspired by XKCD 936
  - From: Jonas Smedegaard <dr@jones.dk>
- Removing duplication: Word lists of common words in languages (was: Bug#768772: ITP: xkcdpass …)
  - From: Ben Finney <ben+debian@benfinney.id.au>
- Re: Removing duplication: Word lists of common words in languages
  - From: Simon McVittie <smcv@debian.org>

Prev by Date: Re: Let's abandon debian-devel.
Next by Date: Re: free choice in installer?
Previous by thread: Re: Removing duplication: Word lists of common words in languages
Next by thread: Re: Removing duplication: Word lists of common words in languages
Index(es):
- Date
- Thread