Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

To: Stefano Zacchiroli <zack@debian.org>
Cc: Simon McVittie <smcv@debian.org>, debian-vote@lists.debian.org, "M. Zhou" <lumin@debian.org>
Subject: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
From: Russ Allbery <rra@debian.org>
Date: Sun, 27 Apr 2025 15:02:57 -0700
Message-id: <[🔎] 87v7qpqn3i.fsf@hope.eyrie.org>
In-reply-to: <[🔎] 20250427211010.n3qbo7ggrjehbe76@upsilon.cc> (Stefano Zacchiroli's message of "Sun, 27 Apr 2025 17:10:10 -0400")
References: <[🔎] 6a60f2f9e7e719aab39e5d21a623d8bac848b9ab.camel@debian.org> <[🔎] aAgCfqRJXK1-qukG@remnant.pseudorandom.co.uk> <[🔎] 87ecxjlwzp.fsf@hope.eyrie.org> <[🔎] 20250427211010.n3qbo7ggrjehbe76@upsilon.cc>

Stefano Zacchiroli <zack@debian.org> writes:

> FWIW, I looked specifically in the gnubg case a while ago, because it
> was an interesting test case for this discussion.

Oh, thank you! I very much appreciate you doing the work to uncover actual
facts as opposed to my mostly uninformed speculations.

> Here's what I found out:

> - The training program (using the language from the GR draft) is
>   allegedly available and licensed under GPL3.

> - The training data is allegedly available as well, but comes without
>   any declared license. I tend to concur with you, Russ, that it's very
>   likely non-copyrightable material. But that's only partly reassuring
>   to me, because I'm not sure how Debian would practically go about
>   ruling that certain stuff that comes without copyright/license is fine
>   for main, whereas other stuff in the same situation is not.

Yes, this is the tricky part for any sort of general "AI" policy (I agree
with Holger that this term is annoying propaganda, but we're probably
stuck with it). Right now, people are mostly thinking about LLMs, which
are trained on large amounts of writing, which is almost always
copyrighted because it's one of the core types of artistic creativity
recognized by copyright laws. (Likewise for image generators, which are
trained on art.)

There are a bunch of other things that fall into the AI bucket, however,
and many of them predate the invention of LLMs. Some of them will have
similar challenges with training data (translation software is probably
also trained on writing, for instance, and voice recognition software is
probably trained on voice samples that are often copyrighted). Some of
them, however, will be trained on things that are widely recognized to be
non-copyrightable facts, such as records of backgammon, chess, or go
games.

However, even that is tricky, because the *annotations* on chess games can
be copyrighted. What is the line beyond which the game annotations are
copyrighted material? I personally have no idea; I don't know if tagging
moves with !, !!, ?, and ?? but no other commentary would constitute
copyrightable material. I also don't know if chess engines use such
annotations in their training.

The simplest and most ideologically consistent position that we could
take, at least from my perspective, would be to decide that any data file
in the form of distilled neural network weights or similar encoded
training data is the "binary" output of a "compilation" process and the
training data is the source code for that binary, which means that under
the DFSG the source code not only has to be free software but has to be
included in the archive. This is pleasingly ideologically coherent and
mostly avoids weird and uncomfortable ethical compromises.

However, I'm not sure it's very *practical* unless our position is that
we're simply not going to package software that uses machine learning
models (a decision that we could certainly make, but which seems a bit
contrary to our normal desire to be a universal operating system).
Problems just off the top of my head include:

1. This data is often huge and also of very little interest to anyone
   other than people attempting to confirm the free software status of the
   resulting model. Unlike the more typical forms of source code, I
   suspect it's rare to want to tweak the training data to fix some bug or
   add some feature and then "recompile." I certainly had never considered
   doing such a thing when maintaining gnubg, but I patched the more
   conventional source code quite frequently.

2. Using the data to reproduce the model often takes significant amounts
   of computing resources, quite possibly more than we would like to spend
   on such a task. But if we don't do that work, we don't really know if
   we have the real sources.

3. It's quite likely, as I understand it, that the training process is not
   going to be deterministic, so we may not easily be able to process the
   training data and get back the original weights. My understanding is
   that training tends to involve some randomization for technical
   reasons. Also, even if it's *possible* to design a reproducible
   training process, I suspect many upstreams will not have bothered.

4. As you discovered, finding the training data, even when upstream has
   retained it (which I suspect will not always be the case, since I
   expect in at least some cases upstream would just start over if they
   wanted to retrain the model and therefore would view at least some of
   the training data as equivalent to ephemeral object files they would
   discard), is not going to be easy since almost no one cares. This is of
   course not a new problem in free software, and we have long experience
   with telling upstreams that no, we really do care about all of the
   source code, but it is incrementally more work of a type that most
   Debian packagers truly dislike doing.

I'm a bit worried that people have the specific case of LLMs in mind,
which are almost always going to pose copyright problems and derivative
work problems. I'm sure I'm not the only one here who is a general LLM
skeptic who has been underwhelmed by the quality of the output LLM
advocates claim to find useful, and therefore would find it quite easy to
say no to LLMs in Debian without feeling like the project was missing
anything of significance.

But machine learning is a lot older than LLMs and has a lot of useful
applications other than mediocre text generation, and training data for at
least some of those models doesn't look anything like LLM training data
and may have entirely different licensing properties. It feels likely to
me that there are some babies in that bathwater.

Maybe we've been ethical hypocrites all along about machine learning
applications packaged in Debian, and the current LLM craze is a good
opportunity to clean house and reaffirm a strict free software policy
including training data. I'm rather sympathetic to that argument, frankly,
just because the simplicity of the "source code for everything, no
exceptions" position is comfortable in my brain. But we should be fairly
sure about what we're agreeing to before making that decision.

-- 
Russ Allbery (rra@debian.org)              <https://www.eyrie.org/~eagle/>

Reply to:

Follow-Ups:
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Matthias Urlichs <matthias@urlichs.de>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Stefano Zacchiroli <zack@debian.org>

References:
- Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: "M. Zhou" <lumin@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Simon McVittie <smcv@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Russ Allbery <rra@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Stefano Zacchiroli <zack@debian.org>

Prev by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Previous by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Index(es):
- Date
- Thread