On Mon, 2008-11-10 at 10:06 +0800, Kov Chai wrote: > i am working on a Chinese input method engine [1,2]. This input method > engine is based on the statistical language model. We extract the > language model from a 150MiB corpus collected from some choosed > Chinese websites using a training algorithm. The extracted data -- the > language model -- does not contain any text from these websites. only > statistics for the frequencies of occurrence of given character > sequences are stored in a binary format. Then there is no copyright over the statistical data, only the statistical analysis tools. The creativity lies in writing the code that generates the data, not the data itself. Think of it like a piece of networking software, the creativity (and therefore the copyright) is in the design of the pipe, not what goes through the pipe. The binary file is not a problem in and of itself - the package must simply support modification of the binary and regeneration of a new binary should the old one get deleted or should the package be shown to have miscalculated some of the data. It is the support for modification that matters. > The upstream released both the data and the source code for > reading/writing such kind of binary file under dual license of CDDL > and LGPLv2.1 . So I believe this software (and data) is free. But I am > afraid that this package is not compatible to DFSG#2 in some sense. Am > I right? The data has to be capable of being regenerated / refreshed using the free software provided in the package (or some other free software). If there is a bug in the algorithm that miscalculates the frequency of one symbol, the package must still be capable of regenerating corrected data. Just because the required input data (the websites) cannot also be packaged does not mean that the generated data is not free. As long as a random user can use the package to update the binary file in the case of bugs, it meets DFSG requirements. Therefore, whether Debian should be expected to keep such large binary files on all the mirrors is a different question. How long does it take to generate the binary file? Could it be generated by the user? Does the package support reading the data from a proxy or from content on a regular filesystem? There are reasons why other packages can require access to specific websites but those packages do so to obtain data that has some copyright / creative input, e.g. a shared SVN repository, or where it is unreasonable to package the data along with the package. > Is there anyway to fix this? Is removing the non-free piece the only > way? I've put the explanation in README.Debian of this package. Statistical data has no copyright, it merely needs to generated and updated. The question is not about whether the data meets DFSG, the question is whether the data should be in the package in the first place. I would be tempted to package the free software that generates the stats data and let the user create the data after installation. I don't see why the mirrors should keep copies of such large amounts of generated data. It's a bit like packaging a TCP/IP client and bundling a generated block of packets as well. The data does not need to be distributed AFAICT, it would presumably need to be updated from time to time anyway. The data itself would appear to have no intrinsic value to the package beyond saving time for the user. Personally, I do not consider that a worthwhile excuse for taking up so much space on the Debian mirrors / DVD. It comes down to this: 1. The package must work with any binary file generated using the package itself (or dependencies), including files generated by any random user, with or without the original data file. That is a DFSG requirement. I would consider the package to be non-free if the package insists on going to specific websites directly - it *must* be possible to generate the statistical data from a local copy of the websites as well (i.e. offline) or via a proxy or via a completely arbitrary set of web content on the filesystem. 2. The data contains none of the original copyrighted data - the actual values in the data file are therefore completely arbitrary and any realistic permutation of the numbers could exist in the binary. 3. Re-generating the binary file after installation must not change the functionality of the package, it merely changes the data that it handles. 4. Given the above, there is no reason to package the data beyond saving time for the user. -- Neil Williams ============= http://www.data-freedom.org/ http://www.nosoftwarepatents.com/ http://www.linux.codehelp.co.uk/
Attachment:
signature.asc
Description: This is a digitally signed message part