[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



Aigars Mahinovs <aigarius@gmail.com> writes:

> If we take as a given that copyright does *not* survive the learning
> process of a (sufficiently complex) AI system, then it is *not* necessary
> that all training *data* for training a DFSG-free AI to also be DFSG-free.
> It is however necessary that:
> * software needed for inference (usage) of the AI model to be DFSG-free
> * software needed for the training process of the AI model to be DFSG-free
> * software needed to gather, assemble and process the training data to be
> DFSG-free or the manual process for it to be documented

Without necessarily disagreeing with this, I want to highlight that
licensing is only *one* of the considerations behind the DFSG and we
shouldn't fixate only on it. The other question is whether the training
data constitutes source code in the sense of DFSG 2. I think there's at
least a prima facie case that it is: The final training model is quite
clearly not the preferred form of modification, and anyone who wanted to
retrain the model would normally prefer to start with the existing
training data set (and then possibly augment or filter it).

Historically, we have not done this analysis, and we've basically ignored
this problem. I packaged gnubg for years and never included the training
data and treated the model weights like they were the source code, and no
one really noticed or complained. But I'm not sure that was a defensible
position. It was just something I did by default without really thinking
about it. Now that the topic has come up and I've had a chance to think
about it properly, I'm not at all sure that was correct.

DFSG 2 is an independent requirement. Even if the source code to a package
is clearly DFSG-free, we still require that the source code be in main,
not off somewhere else where we promise it exists, really (but which is
not under our control). We have historically not applied that to the
training data for models, and maybe that's correct, but the correctness of
that position is certainly not obvious to me from the wording of the DFSG.

-- 
Russ Allbery (rra@debian.org)              <https://www.eyrie.org/~eagle/>


Reply to: