legal questions regarding machine learning models
In Machine Learning (a branch of Artificial Intelligence), algorithms
are developed to "compile" ("train") models from data. The goal is to
estimate the parameters (e.g. real numbers) of a model so that the
model best fits the data. Because the number of parameters can
possibly be great (e.g. thousands), this can be a complex optimization
problem, the computations of which may take up to days to perform for
some applications. For example, in speech recognition, speech models
are trained from databases of speech and their corresponding annotated
text. The models can then be used to recognize speech. To summarize
the "training" procedure with a black box:
input: data => [ training algorithm] => output: model
As can be seen from the arrows, this is a "one way" transformation,
i.e. it is possible to transform the data into a model but it's not
possible to transform the model back into exactly the same data. The
only possibility for someone to find whether his/her data were used to
create the model is to reproduce exactly the same training conditions
and train the data again to see if the resulting model is the same.
However, two implementations of the same algorithm may differ due to
design choices and algorithms themselves can have several parameters,
so it's not easy to reproduce the exact same training conditions. Even
then, there's no proof that some other data cannot lead to the same
model in some other training conditions.
For efficient storage, the model may be stored in binary format but
human-readable formats (such as XML) may be used, thus allowing easy
access to the parameters of the models.
My first question is : is it possible to distribute the model under a
free software license without distributing the original data that were
used to train the model? Likewise, is it possible to package directly
a model in Debian? Although it's very unlikely, I could pretend that I
found the parameters of the models by hand. In that case, the
parameters can be seen as "magical numbers" with no explanation
whatsoever as to how I found them.
If the answer is yes, it means one can potentially use non-free data
to create models and distribute them under a free license.
If the answer is no, then there is some practical problems to solve.
As noted earlier, it can take days to train the model from data.
Furthermore, the data take usually much more space (from a few mega to
a few giga bytes) than the model (usually from a few kilo to a few
mega bytes). In other words, it may not be practical to ask Debian
developers (and to a greater extent, end-users) to rebuild the model
from the data. In that case distributing the model directly is the
only practical solution. Would that be enough to add a README file
indicating the URL where the original data can be downloaded together
with the build script, in the model distribution?
Depending on the application domain, it's not easy to collect data for
use in training because the data need be annotated with their
corresponding label. This is a major barrier to build free software
application using machine learning techniques. One effort called
"Voxforge" aims to collect speech samples from contributors and
license them under GPL license.
My second question is: Given the difficulty to prove what data were
actually used to train a model, how can we prevent non-free software
to use free data such as those of Voxforge?
I realize that my two questions have an opposite interest but this was
to show the two sides of the coin. Anyway, I think that this kind of
legal issue may arise for free software in such fields as speech
recognition, handwriting recognition, machine translation, and. all
other kinds of applications that make use of machine learning
techniques so it's worth discussing it.
PS: please add me to CC in your replies