Aigars Mahinovs <aigarius@gmail.com> writes:
> If we take as a given that copyright does *not* survive the learning
> process of a (sufficiently complex) AI system, then it is *not* necessary
> that all training *data* for training a DFSG-free AI to also be DFSG-free.
> It is however necessary that:
> * software needed for inference (usage) of the AI model to be DFSG-free
> * software needed for the training process of the AI model to be DFSG-free
> * software needed to gather, assemble and process the training data to be
> DFSG-free or the manual process for it to be documented
Without necessarily disagreeing with this, I want to highlight that
licensing is only *one* of the considerations behind the DFSG and we
shouldn't fixate only on it. The other question is whether the training
data constitutes source code in the sense of DFSG 2. I think there's at
least a prima facie case that it is: The final training model is quite
clearly not the preferred form of modification, and anyone who wanted to
retrain the model would normally prefer to start with the existing
training data set (and then possibly augment or filter it).
Yes, that is a very important problem for Debian and, for example, the Desert Island test would be well applicable here. If re-training the model would require downloading half the Internet, then it is pretty obvious that someone on a desert island without a network connection will not be able to do this. *However*, models again are substantially different from regular software (that gets modified in source and then compiled to a binary) because such a model can be *modified* and adapted to your needs directly from the end state. In fact, for adjusting a LLM for use in a particular domain or a particular company it actually *is* the "binary" that is the *preferred* form to be modified - you take a model that "knows" a lot in general and "knows" how your language works and you train the model further by doing specialisation training for your, specific data set. And a result you get from one "generic" binary another - "specialized" binary.
So, very precisely speaking, modification of a LLM does *not* require the original training data. Recreating a LLM does. Also developing a new LLM with different training methods or training conditions does need some training data (ideally the original training data, especially to compare end performance). But all in all a developer on a Desert Island would be better off with a "binary" model to be modified than without it.
Say for example that an IDE saves its configuration state not in a common text file, but as a binary memory dump. Say the maintainer of such a package would use their experience of the IDE and years of development to go through the GUI of this software to assemble a great setup configuration that is great for anyone starting to use the IDE and also has clues left around it how to tailor it further for your needs. This configuration (as a binary memory dump of the software state) is then distributed to the users as the default configuration. What is "the source" of it? Isn't this binary (that the GUI can both read and write) not the preferred form for modification? The maintainer can describe how he created the GUI state (document the training process), but not really include all his relevant experience (training data) that led him to believe that this state is the best for the new users. So what is LLama if not a **very** complex nvim configfile focused on autocomplete? :D Quite a few of those questions also apply to fonts (IMO).
We (as Debian) do approach DFSG compliance in terms of source code stricter than many licenses do. We require the source code to be on Debian servers in Debian-preferred formulation. While GPL, for example, is content with a promise to send the end user the source code on request.
That *could* be the technical difference in definitions between what is "DFSG-free AI" and what is "Debian-main-grade-free AI". Especially if Debian would decide not to want to store literal terrabytes of training data for every LLM variation. This could be worked around on a more general level with some kind of data-set preservation and indexing foundation, like the Internet Archive. In that case the Debian package could reference a particular assembled data set it used for training (for example in the form of a magnet link) and delegate storage and re-distribution of that dataset to external trusted source organisations.
Or Debian could go the MS TTF route - have the software in the archive, but no models at all. And to get the software to work users would get used to run a script that would be always pulling a model from
huggingface.co either manually or even during package installation. Possible with a barely functional placeholder model in the package that 99% of users would replace in real usage. That would keep the "evil" AI away from the archive, but will that benefit our users? Will that benefit the development of a freer and more accessible AI landscape? I would think rather opposite.