Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
On Wed, 14 May 2025 at 08:58, Simon Josefsson <simon@josefsson.org> wrote:
> To me I think we have at least two camps:
>
> 1) We must have DFSG-compliant licensing of source code for everything
> in main, and that source code should encompass everything needed for a
> skilled person to re-create identical (although possibly not bit-by-bit
> identical) artifacts.
>
2) We must have DFSG-compliant licensing of source code for everything
in main, but training data is not part of source code. Instead source code for
training models would be code and protocol describing how to generate
or gather training data in such a way that a skilled person would be able to
re-create functionally the same (although not identical) artifacts. If
re-creation
is impractical (due to compute costs) then the model must also be modifiable
after training by a skilled person with tooling in the archive.
This matches the meaning of https://opensource.org/ai definition, just mapped
onto DFSG criteria (using the "Data Information" definition from OSI).
Or, reformulated extremely concisely as clarifications to DFSG scope:
1) AI training data is source code.
2a) AI training data is not source code.
2b) AI training data is not source code, but "Data Information" is source code.
2b+) AI models must either be easily retrainable from training data *or* have to
be easily modifiable and adaptable after training to satisfy DFSG.3
The 2a is likely an obsolete, maximally permissive option in this
discussion context.
And the combination of 2b and 2b+ is for me the preferable position.
> Neither position has much to do with AI models as far as I can tell.
It is a bit more clearly to do with AI after reformulation.
> Is there any complication beyond size and infrastructure to recreate models
> that are a factor here? Or is this "just" a re-hash of the perpetual
> main vs non-free discussion?
Whether an OSI-free LLM would be acceptable to distribute in non-free
as it is right
now is an interesting side question. non-free does not have a strict
full source code
requirement. But does have a binary redistributability requirement.
IMHO this right now could be opposed with either moral argumentation against
non-consensually trained AIs (like Russ described) or by arguing the
legal position
that the model itself is a derived work from all of its training data
and so we can not
trust the copyright or license terms that the creators of the LLM
claim and thus we
can not have the rights to redistribute the model.
While I find the legal component of these arguments to be shaky, the
moral argument
is a matter of opinion. I do not agree with that opinion, but I can
see how that is a
perfectly valid and consistent opinion to hold.
--
Best regards,
Aigars Mahinovs
Reply to: