Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
On Thu, 8 May 2025 at 12:46, Wouter Verhelst <wouter@debian.org> wrote:
>
> On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote:
> > The transformative criteria here is that the resulting work needs to be
> > transformed in such a way that it adds value. And generating new texts
> > from a LLM is pretty clearly a value-adding transformation compared to
> > the original articles. Even more so than the already ruled-on Google
> > Books case.
>
> OK, let me change it around a bit, because I don't think this discussion
> is going in any direction that is relevant for Debian.
>
> The only way in which you can build a model is by taking loads and loads
> of data, running some piece of software over it, and storing the result
> somewhere.
>
> How can we do this legally, reproducibly, and openly if we do not have
> the rights to redistribute the said "loads and loads of data"?
>
> The answer is, we can't.
Sure we can. It is a technical problem, actually. As long as the data
is still available, you can store and redistribute information about
which data you gathered, from where and how it looked like - hashes of
copyrigthed content are not copyrighted ;)
We don't *need to* redistribute the data itself.
In a more organised structure a developer of an LLM would simply write
down that they used the "Reddit all comments corpus" version
20250301-2 with sha256 hashsum of XXX and available over this magnet
link (the link itself is just a dressed up hashsum). This is a fully
sufficient Training Data Information input to allow a different
developer to acquire the same data set (or a newer version of the same
data set if they wish to) and proceed to conduct the same training.
Saying "get the latest https://dumps.wikimedia.org/enwiki/latest/
dump" (or a live text download/dump from any other public website) is
no different technically, just makes every recreation to use the
newest state of the source data instead of a frozen snapshot. Might be
sub-optimal for stable. But then we have this problem anyway with many
data sets or software packages that do not really make sense in a
frozen state after a few months or years (like virus definitions).
Debian or the developers in question do NOT need to have the legal
rights to *redistribute* this data. They only need to have the rights
to acquire it and to use it for training. Which is (expected to be)
covered by fair use exception in the USA law and by data mining
exception in EU law.
The whole point of the OSI definition is to make sure that a skilled
person with enough resources *does* have enough information available
to retrace the steps that created the model.
> Therefore, I conclude that, practically, we cannot include models in
> Debian if we want them to be reproducible.
Adding reproducibility to DFSG as criteria for software to become
non-free would be a *very* different GR.
> The fact that the model does something vaguely and remotely similar to a
> biological process of training and learning in humans, and that
> therefore some people have taken to naming the process of running
> advanced statistical analysis over data to build such a model also
> "training" is a red herring. The two processes are very different and
> cannot be compared as a practical matter.
It is very much training. A LLM does not memorise or copy or compress
its inputs. It *learns* the statistical probabilities of certain words
following certain other words in a certain context. That is literally
the only thing that the LLM model is - a list of propabilities. It
does not *understand* what it is learning - it does not construct an
internal model of the world, of objects in it and of their
interactions, but it is for sure learning.
--
Best regards,
Aigars Mahinovs
Reply to: