I'm not sure if this is too late. The mail to debian-devel-announce was kind of late, and I hope there is still some discussion time left. It is late enough that I am immediately seeking seconds for the following proposal. I am also open to wordsmithing if we have time. If we decide to take more time to think about this issue and build project consensus, I would be delighted if we did not vote now. Rationale: TL;DR: If in practice we are able to modify the software we have, and the license is DFSG free, then I think we meet DFSG 2 and the software should be DFSG free. This proposal extends on the comments I made in https://lists.debian.org/tsled098ieb.fsf@suchdamage.org It's been my experience that given the costs of AI training, often the model itself is the preferred form of modification. I find this particularly true in the case of LLMs based on my experience over the last year. I particularly disagree with Russ that doing a full parameter fine tuning of a model is anything like calling into a library; to me it seems a lot more like modifying a Smalltalk world or changing a LambdaMoo world and dumping a new core. Even LORA style retraining looks a lot like the sort of patch files permitted by DFSG 4. I disagree with those who claim that if we had the original training data we would choose to start there when we want to modify a model. But this GR is about more than LLMs and as Ansgar has indicated, the practical affects are most pronounced in other areas. I suspect that incremental training would also work in cases like GnuBG and possibly even in cases like Tesseract. On the LLM front, I want to reward projects for making their fine tuning available and for making it easy to change out LLMs, even if the original training data for the base LLM is not available. On the general AI front, I care a lot more that we have a credible plan for making modifications to software than that we have original training data. In cases where a full retraining of a machine learning model is practical, original training data is a great option. In some cases, even if we have original training data, we might not choose to use it. I explicitly want to permit people who are liberating software--taking something that is DFSG licensed but problematic to modify and using that as a basis for something going forward. So long as the upstream actually gives us their preferred form of modification I am fine with that. For myself, I do not find Thorsten's interpretation of copyright law or the ethics around copyright compelling. I am explicitly saying that my reasoning is incompatible with his position. ***Proposal Text*** Choice 2: Software incorporating AI Models Released under DFSG Licenses free Must provide for Practical Modification to Comply with DFSG The project asks those charged with interpreting the DFSG to require that software incorporating AI models have a preferred form of modification for the models and that we provide our users the ability to modify these models in order to be included in the main section of the archive. Examples of such a preferred form of modification can include the original training data for the model. Alternatively, a base model (especially when the base model can be replaced and multiple options are available) along with training data for any fine tuning that has been performed is acceptable. In some cases a model along with necessary tools to perform incremental fine tuning may be acceptable if doing additional incremental training is actually the approach that the upstream project uses to modify the model. As with other interpretations of the DFSG, something cannot be the preferred form of modification if the upstream of the software under consideration has a more preferred form of modification that is not public.
Attachment:
signature.asc
Description: PGP signature