[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: is statistical data extracted from web DFSG compliant?



Thanks a lot for your insightful analysis, Neil. =)
But I am still confused about some problems.

On Mon, Nov 10, 2008 at 10:34:59AM +0000, Neil Williams wrote:
> On Mon, 2008-11-10 at 10:06 +0800, Kov Chai wrote:
> > i am working on a Chinese input method engine [1,2]. This input method
> > engine is based on the statistical language model. We extract the
> > language model from a 150MiB corpus collected from some choosed
> > Chinese websites using a training algorithm. The extracted data -- the
> > language model -- does not contain any text from these websites. only
> > statistics for the frequencies of occurrence of given character
> > sequences are stored in a binary format.
> 
> Then there is no copyright over the statistical data, only the
> statistical analysis tools. The creativity lies in writing the code that
> generates the data, not the data itself. Think of it like a piece of
> networking software, the creativity (and therefore the copyright) is in
> the design of the pipe, not what goes through the pipe.
> 
> The binary file is not a problem in and of itself - the package must
> simply support modification of the binary and regeneration of a new
> binary should the old one get deleted or should the package be shown to
> have miscalculated some of the data. It is the support for modification
> that matters.

Actually, it is another package (sunpinyin-slm) that supports generation 
of the data. sunpinyin-slm is still in its ITP phase [1]. And it only 
supports generation/merge/query of data. Sunpinyin-SLM stores the data 
in a trie for character sequences of different lengths. So I guess it is 
easier for this software to regenerate language model from modified corpus 
to fix bugs in the language model than modify generated file.

Will this change your assessment? Or is the acceptance of sunpinyin-slm the
prerequisite of the package (sunpinyin) in question?

--
[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=490970

> 
> > The upstream released both the data and the source code for
> > reading/writing such kind of binary file under dual license of CDDL
> > and LGPLv2.1 . So I believe this software (and data) is free. But I am
> > afraid that this package is not compatible to DFSG#2 in some sense. Am
> > I right?
> 
> The data has to be capable of being regenerated / refreshed using the
> free software provided in the package (or some other free software). If
> there is a bug in the algorithm that miscalculates the frequency of one
> symbol, the package must still be capable of regenerating corrected
> data. Just because the required input data (the websites) cannot also be
> packaged does not mean that the generated data is not free. As long as a
> random user can use the package to update the binary file in the case of
> bugs, it meets DFSG requirements.
> 
> Therefore, whether Debian should be expected to keep such large binary
> files on all the mirrors is a different question. How long does it take
> to generate the binary file? Could it be generated by the user? Does the
> package support reading the data from a proxy or from content on a
> regular filesystem?
> 

The size of generated lexicon and language model data file is 30.6MiB in
total. Is this unacceptable? It takes about 50 minutes on normal desktop
computer (1.9GHz processor). Yes, the binary file can be generated by user
from a text file on a regular filesystem. But before that user needs to
collect *lots* of text in Chinese as raw corpus to feed it for preprocessing,
and the preprocessed corpus will be in turn send to the training algorithm.

Obviously, most users do not have corpus of that large (our middle-sized
corpus is 150MiB). So, to get enough and balanced raw corpus, the most
feasible way is to download webpages from different websites. But due to the
speed limit of user's internet connection and certain website's policy, it
always takes days to get a large-enough raw corpus. It takes the upstream
author 2~3 days to download the 150MiB raw corpus.


> There are reasons why other packages can require access to specific
> websites but those packages do so to obtain data that has some
> copyright / creative input, e.g. a shared SVN repository, or where it is
> unreasonable to package the data along with the package.
> 
> > Is there anyway to fix this? Is removing the non-free piece the only
> > way? I've put the explanation in README.Debian of this package.
> 
> Statistical data has no copyright, it merely needs to generated and
> updated. The question is not about whether the data meets DFSG, the
> question is whether the data should be in the package in the first
> place.
> 
> I would be tempted to package the free software that generates the stats
> data and let the user create the data after installation. I don't see
> why the mirrors should keep copies of such large amounts of generated
> data.
> 
> It's a bit like packaging a TCP/IP client and bundling a generated block
> of packets as well. The data does not need to be distributed AFAICT, it
> would presumably need to be updated from time to time anyway.

You are right. The language model needs to be updated, but the input method
engine is also designed to learn the user's language by building a user
specific lexicon and language model while the user types text with this input
method. This function offsets the need to update the data a little bit
somehow.

> 
> The data itself would appear to have no intrinsic value to the package
> beyond saving time for the user. Personally, I do not consider that a
> worthwhile excuse for taking up so much space on the Debian mirrors /
> DVD.
Do you still think so, even it takes lots of time to prepare the raw data and
generate the data? Or we can simply tell user to download from a certain site
as as an alternative if he/she does not want to create by himself/herself?

> 
> It comes down to this:
> 1. The package must work with any binary file generated using the
> package itself (or dependencies), including files generated by any
> random user, with or without the original data file. That is a DFSG
> requirement. I would consider the package to be non-free if the package
> insists on going to specific websites directly - it *must* be possible
> to generate the statistical data from a local copy of the websites as
> well (i.e. offline) or via a proxy or via a completely arbitrary set of
> web content on the filesystem. 
Yes, sunpinyin-slm does generate the statistical data from a text file on 
regular filesystem with an original lexicon file.

> 2. The data contains none of the original copyrighted data - the actual
> values in the data file are therefore completely arbitrary and any
> realistic permutation of the numbers could exist in the binary.
> 3. Re-generating the binary file after installation must not change the
> functionality of the package, it merely changes the data that it
> handles.
> 4. Given the above, there is no reason to package the data beyond saving
> time for the user.

Thanks, I got it. If the effort/time to generate the data file can not offset
the its size, we can hardly justify the including of the data file. 

Since my package has been stuck in the NEW queue for quite a while, I really
want to know what I can do to push it forward. Thanks again.

-- 
Regards,
Kov Chai
2008.11.10 Mon

--
Each problem that I solved became a rule which served afterwards to 
solve other problems
                                   -- R. Decartes

Attachment: signature.asc
Description: Digital signature


Reply to: