[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote:
>    The transformative criteria here is that the resulting work needs to be
>    transformed in such a way that it adds value. And generating new texts
>    from a LLM is pretty clearly a value-adding transformation compared to
>    the original articles. Even more so than the already ruled-on Google
>    Books case.

OK, let me change it around a bit, because I don't think this discussion
is going in any direction that is relevant for Debian.

The only way in which you can build a model is by taking loads and loads
of data, running some piece of software over it, and storing the result
somewhere.

How can we do this legally, reproducibly, and openly if we do not have
the rights to redistribute the said "loads and loads of data"?

The answer is, we can't.

Therefore, I conclude that, practically, we cannot include models in
Debian if we want them to be reproducible.

I think we have a goal, as a project, to make Debian reproducible. I
think the reproducibility of our software -- *all* our software, not
just the programs and libraries but also the data -- is an important
goal with important repercussions.

Dropping that goal would be required if we were to accept models in
main.

We also declared, over 20 years ago, that "Debian will remain 100%
free". Not just the programs and libraries, but *everything* -- also the
data.

Ergo, either we need to drop our dual goals of becoming reproducible and
remaining Free, or models without training data can't go into main.

This to me is one practical reason as to why training data is part of
the source code of a model: because without it, we can't build the model
and we can't build it reproducibly.

The fact that the model does something vaguely and remotely similar to a
biological process of training and learning in humans, and that
therefore some people have taken to naming the process of running
advanced statistical analysis over data to build such a model also
"training" is a red herring. The two processes are very different and
cannot be compared as a practical matter.

I have noticed that you have gone ahead and proposed an alternative
ballot option to accept this misguided idea that the training data is
not source. I can't tell you how much I disagree with that option. The
OSI has already taken steps to legitimize this blatantly obviously wrong
idea; if Debian were to follow down those tracks (and this would
definitely do that, IMO), I have to seriously reconsider whether I want
to still be a part of this.

Thanks.

-- 
     w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

I will have a Tin-Actinium-Potassium mixture, thanks.

Attachment: signature.asc
Description: PGP signature


Reply to: