[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: is statistical data extracted from web DFSG compliant?



On Mon, 2008-11-10 at 10:06 +0800, Kov Chai wrote:
> i am working on a Chinese input method engine [1,2]. This input method
> engine is based on the statistical language model. We extract the
> language model from a 150MiB corpus collected from some choosed
> Chinese websites using a training algorithm. The extracted data -- the
> language model -- does not contain any text from these websites. only
> statistics for the frequencies of occurrence of given character
> sequences are stored in a binary format.

Then there is no copyright over the statistical data, only the
statistical analysis tools. The creativity lies in writing the code that
generates the data, not the data itself. Think of it like a piece of
networking software, the creativity (and therefore the copyright) is in
the design of the pipe, not what goes through the pipe.

The binary file is not a problem in and of itself - the package must
simply support modification of the binary and regeneration of a new
binary should the old one get deleted or should the package be shown to
have miscalculated some of the data. It is the support for modification
that matters.

> The upstream released both the data and the source code for
> reading/writing such kind of binary file under dual license of CDDL
> and LGPLv2.1 . So I believe this software (and data) is free. But I am
> afraid that this package is not compatible to DFSG#2 in some sense. Am
> I right?

The data has to be capable of being regenerated / refreshed using the
free software provided in the package (or some other free software). If
there is a bug in the algorithm that miscalculates the frequency of one
symbol, the package must still be capable of regenerating corrected
data. Just because the required input data (the websites) cannot also be
packaged does not mean that the generated data is not free. As long as a
random user can use the package to update the binary file in the case of
bugs, it meets DFSG requirements.

Therefore, whether Debian should be expected to keep such large binary
files on all the mirrors is a different question. How long does it take
to generate the binary file? Could it be generated by the user? Does the
package support reading the data from a proxy or from content on a
regular filesystem?

There are reasons why other packages can require access to specific
websites but those packages do so to obtain data that has some
copyright / creative input, e.g. a shared SVN repository, or where it is
unreasonable to package the data along with the package.

> Is there anyway to fix this? Is removing the non-free piece the only
> way? I've put the explanation in README.Debian of this package.

Statistical data has no copyright, it merely needs to generated and
updated. The question is not about whether the data meets DFSG, the
question is whether the data should be in the package in the first
place.

I would be tempted to package the free software that generates the stats
data and let the user create the data after installation. I don't see
why the mirrors should keep copies of such large amounts of generated
data.

It's a bit like packaging a TCP/IP client and bundling a generated block
of packets as well. The data does not need to be distributed AFAICT, it
would presumably need to be updated from time to time anyway.

The data itself would appear to have no intrinsic value to the package
beyond saving time for the user. Personally, I do not consider that a
worthwhile excuse for taking up so much space on the Debian mirrors /
DVD.

It comes down to this:
1. The package must work with any binary file generated using the
package itself (or dependencies), including files generated by any
random user, with or without the original data file. That is a DFSG
requirement. I would consider the package to be non-free if the package
insists on going to specific websites directly - it *must* be possible
to generate the statistical data from a local copy of the websites as
well (i.e. offline) or via a proxy or via a completely arbitrary set of
web content on the filesystem. 
2. The data contains none of the original copyrighted data - the actual
values in the data file are therefore completely arbitrary and any
realistic permutation of the numbers could exist in the binary.
3. Re-generating the binary file after installation must not change the
functionality of the package, it merely changes the data that it
handles.
4. Given the above, there is no reason to package the data beyond saving
time for the user.

-- 


Neil Williams
=============
http://www.data-freedom.org/
http://www.nosoftwarepatents.com/
http://www.linux.codehelp.co.uk/


Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: