[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#828787: ITP: libdisorder -- library for entropy measurement of byte streams and other data



On Mon, Jun 27, 2016 at 10:07:04PM +0200, Andreas Tille wrote:

>   Description     : library for entropy measurement of byte streams and other data
>  libdisorder is a small, simple C library for use by programmers in
>  other programs. There is a small test program included that opens
>  /dev/urandom and calls libdisorder in the `test/' directory. There is
>  also a command-line tool, `ropy' in the `tool/' directory for reporting
>  on the entropy of normal files.
>  .
>  You will probably want to pipe the output of libdisorder to some other
>  math analysis or graphing environment (e.g., gnuplot).
>  .
>  The library's primary function reports entropy in bits: essentially,
>  this is the number of bits necessary to encode the actual level of
>  information contained in the data passed to the library: it is the
>  theoretical maximum amount of compression possible.

I hope you will fix this description. I'd only keep the last paragraph,
and then also explain what algorithm it actually uses to measure the
entropy (Shannon's source coding theorem). This theorem is actually only
usable in the context of an input of "independent and identically
distributed random variables", it does not apply to every kind of input.
In particular, it only looks at the histogram of byte values; if you
feed it a file with totally predictable increasing byte values 0, 1, 2,
etc., it will report an entropy of 8. Many compression algorithms,
especially those for sound and images, look at differences between
consecutive values or have other means to detect such predictable
sequences. So make it clear that it just implements Shannon's H function
and that it also only works on bytes.

I also want to point out that this library is not thread-safe, something
which could easily be fixed. It also gives the wrong answer when you
have an input with more than 2^31-1 of the same bytes in the input, even
though it pretends to handle inputs up to 2^63 in length.

> Remark: The code of libdisorder appeared in two other targets of Debian
> Med and to avoid code duplication this library is packaged separately.

Although normally I would applaud deduplication, I personally think this
shouldn't get its own package. It looks like one of those things you'd
find npm.

-- 
Met vriendelijke groet / with kind regards,
      Guus Sliepen <guus@debian.org>

Attachment: signature.asc
Description: Digital signature


Reply to: