[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Proposal Alternative: A Model Can Be a Preferred form of Modification



I'm not sure if this is too late. The mail to debian-devel-announce was
kind of late, and I hope there is still some discussion time left.

It is late enough that I am immediately seeking seconds for the
following proposal.
I am also open to wordsmithing if we have time.

If we decide to take more time to think about this issue and build
project consensus, I would be delighted if we did not vote now.

Rationale:

TL;DR: If in practice we are able to modify the software we have, and
the license is DFSG free, then I think we meet DFSG 2 and the software
should be DFSG free.

This proposal extends on the comments I made in
https://lists.debian.org/tsled098ieb.fsf@suchdamage.org


It's been my experience that given the costs of AI training, often the
model itself is the preferred form of modification. I find this
particularly true in the case of LLMs based on my experience over the
last year.  I particularly disagree with Russ that doing a full
parameter fine tuning of a model is anything like calling into a
library; to me it seems a lot more like  modifying a Smalltalk world or
changing a LambdaMoo world and dumping a new core. Even LORA style
retraining looks a lot like the sort of patch files permitted by DFSG 4.
I disagree with those who claim that if we had the original training
data we would choose to start there when we want to modify a model.

But this GR is about more than LLMs and as Ansgar has indicated, the
practical affects are most pronounced in other areas.
I suspect that incremental training would also work in cases like GnuBG
and possibly even in cases like Tesseract.

On the LLM front, I want to reward projects for making their fine tuning
available and for making it easy to change out LLMs, even if the
original training data for the base LLM is not available.

On the general AI front, I care a lot more that we have a credible plan
for making modifications to software than that we have original training
data. In cases where a full retraining of a machine learning model is
practical, original training data is a great option.
In some cases, even if we have original training data, we might not
choose to use it.
I explicitly want to permit people who are liberating software--taking
something that is DFSG licensed but problematic to modify and using that
as a basis for something going forward.
So long as the upstream actually gives us their preferred form of
modification I am fine with that.

For myself, I do not find Thorsten's interpretation of copyright law or
the ethics around copyright compelling.
I am explicitly saying that my reasoning is incompatible with his
position.


***Proposal Text***

Choice 2: Software incorporating AI Models Released under DFSG Licenses
free Must provide for
Practical Modification to Comply with DFSG

The project asks those charged with interpreting the DFSG to require
that software incorporating AI models have a preferred form of
modification for the models and that we provide our users the ability to
modify these models in order to be included in the main section of the
archive. Examples  of such a preferred form of modification can include
the original training data for the model. Alternatively, a base model
(especially when the base model can be replaced and multiple options are
available) along with training data for any fine tuning that has been
performed is acceptable. In some cases a model along with necessary
tools to perform incremental fine tuning may be acceptable if doing
additional incremental training is actually the approach that the
upstream project uses to modify the model. As with other interpretations
of the DFSG, something cannot be the preferred form of modification if
the upstream of the software under consideration has a more preferred
form of modification that is not public.

Attachment: signature.asc
Description: PGP signature


Reply to: