[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Mon, 28 Apr 2025 at 22:02, Russ Allbery <rra@debian.org> wrote:
Aigars Mahinovs <aigarius@gmail.com> writes:

> *However*, models again are substantially different from regular
> software (that gets modified in source and then compiled to a binary)
> because such a model can be *modified* and adapted to your needs
> directly from the end state. In fact, for adjusting a LLM for use in a
> particular domain or a particular company it actually *is* the "binary"
> that is the *preferred* form to be modified - you take a model that
> "knows" a lot in general and "knows" how your language works and you
> train the model further by doing specialisation training for your,
> specific data set. And a result you get from one "generic" binary
> another - "specialized" binary.

I have to say that I'm not convinced by this argument that models are any
different than other types of software. To me, this type of "modification"
is akin to using code as a library without modifying it.

This feels like a false equivalence to me.

I do not believe that training data for a model is actually source code. And I do not believe that a model itself is software.

We are used over the decades to think about software freedom and licensing in terms like "source code" and "binary" (and "derived work"). These terms, meanings and consequences are very well established and are based on a solid foundation that the copyright law provides. And I understand the temptation to push everything we encounter into one of those two holes. I am still trying to properly wrap my brain around this whole situation as well.

However, here we have a clear and fundamental change happening in the copyright law level - there is a legal break/firewall that is happening during training. The model *is* a derivative work of the source code of the training software, but is *not* a derivative work of the training data. This means that we also have to consider what exactly is training data and how to deal with it, without automatically falling back to equating it with source code.

In the same way we should not be equating the resulting model with a binary either. At least not automatically. It has much more in common with a database or with map data or with a configuration file than with an executable binary. Both in the way it is used and in the way that it is being modified. Many of those data aggregations also have lots of different actual, original sources that are burdened by various licenses and restrictions and are not, themselves redistributable. All of those data aggregations also have actual software that loads this data and provides services to the user using this data. For example, navigation software that loads the map data and uses that, along with the user's request, to assemble the best route to the user's intended destination. Map data file here is not software.

In the same way as a developer on an island is not expected to be recreating OpenStreetMap geographical data from scratch (including not only big imports from government geoinformatic systems, but also lots of tiny corrections that humans made to fix misrepresentations in their local area), in the same way they are not really expected to be recreating a LLM model loaded up with language and relationship information from the whole world's knowledge.

They can still compile from sources the software that allows modifying the OSM map data (if they happen to have an offline copy of it) and add the mapping for their island or even delete all map data of a country they dislike. The same way they can add island-specific knowledge to a LLM without needing its original training data and without doing a full re-training. They can also limit the outputs of the LLM to remove or censor certain information, also without a retraining. In fact the freedom to do such things could be a good criteria for a "DFSG-free LLM" (among others).

I do agree with the original proposal in the positive terms: yes, if all the software with its full source code *and* all the training data is both available and redistributable under DFSG-free conditions then the resulting model is a DFSG-free model and can be put into Debian main.

But the point that I am trying to argue is that, in addition to that, the same kind of model trained on training data that is *not* redistributable under DFSG-free conditions (but is generally available) is also a DFSG-free model. We just need to figure out a technical solution to the legal problem of training data that is non-redistributable (at least for Debian).

It does not mean that all the many models existing right now would be able to get into main with that second definition - they would still need to first package all the training and inference software, they would need to document the training protocol and they would need to provide some technical way for third parties to reliably acquire a specific snapshot of the training data used for training that particular version of that model. And having a DFSG-free LLM defined in such a way would have the effect of incentivising the LLM developers to actually do just that - to provide sufficient free software and sufficiently specific training data references to allow a third party to produce an equivalent model. It would also more clearly motivate companies to produce two copies of a LLM - the full size copies they make now and a reduced copy that is trained solely on DFSG-free material (but is otherwise identical to the full LLM).

It might be possible to also come up with some intermediate form for the training data - something that is functionally identical for training purposes and can be easily modified (as in adding and removing content), but is processed from its original point enough so that it is no longer protected by the original copyrights. For example, in early neural network training days the first step in training data processing was transforming the text into a (per-document) table of word sequence vectors and their probabilities. These word sequences were short enough that they were below the originality threshold that the copyright law cares about and so were no longer copyrightable. And it is a deterministic, one-way process. It could be possible to redistribute such intermediate training data sets with enough metadata on the documents to allow removing unwanted content or identifying outdated content that is being replaced by new data added to the data set.

Would that be a way to solve the legal challenges?
--
Best regards,
    Aigars Mahinovs

Reply to: