[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#341903: libc6: strfry() gives skewed distributions



Package: libc6
Version: 2.3.2.ds1-22
Severity: minor
Tags: patch

strfry() tries to shuffle its string using random swaps, but it uses the
wrong strategy, and thus not all permutations are equally likely. The
code doing the shuffling itself looks like this:

  len = strlen (string);
  for (i = 0; i < len; ++i)
    {
      int32_t j;
      char c;

      __random_r (&rdata, &j);
      j %= len;

      c = string[i];
      string[i] = string[j];
      string[j] = c;
    }

In other words, for the string 'abc' j will always be between 0 and 2
(inclusive), giving the following possibilities (all equally likely):

  j0 j1 j2  result
   0  0  0  cab
   0  0  1  bca
   0  0  2  bac
   0  1  0  cba
   0  1  1  acb
   0  1  2  abc
   0  2  0  bca
   0  2  1  abc
   0  2  2  acb
   1  0  0  cba
   1  0  1  acb
   1  0  2  abc
   1  1  0  cab
   1  1  1  bca
   1  1  2  bac
   1  2  0  acb
   1  2  1  bac
   1  2  2  bca
   2  0  0  acb
   2  0  1  bac
   2  0  2  bca
   2  1  0  abc
   2  1  1  cab
   2  1  2  cba
   2  2  0  bac
   2  2  1  cba
   2  2  2  cab

Sorting and counting gives us the following distribution:

   4  abc
   5  acb
   5  bac
   5  bca
   4  cab
   4  cba

In other words, this is clearly skewed; some strings will appear 25% more often
than others.

I propose changing the code to:
  
  len = strlen (string);
  for (i = 0; i < len - 1; ++i)
    {
      int32_t j;
      char c;

      __random_r (&rdata, &j);
      j %= (len-i);

      c = string[i];
      string[i] = string[j+i];
      string[j+i] = c;
    }

which will turn it into a proper Fisher-Yates shuffle. This gives us exactly n!
paths for a string with N characters, and since there are n! possible
permutations, we get a one-to-one mapping, and all possibilities are equally
likely.

I'm also a bit disappointed that strfry() is locale-unaware... surely such an
important function should be able to shuffle at least UTF-8 strings correctly.

-- System Information:
Debian Release: 3.1
Architecture: i386 (x86_64)
Kernel: Linux 2.6.12-rc4
Locale: LANG=en_DK.UTF-8, LC_CTYPE=en_DK.UTF-8 (charmap=UTF-8)

Versions of packages libc6 depends on:
ii  libdb1-compat                 2.1.3-7    The Berkeley database routines [gl

-- no debconf information



Reply to: