Re: Non-LLM example where we do not in practice use original training data

To: debian-vote@lists.debian.org
Subject: Re: Non-LLM example where we do not in practice use original training data
From: Stefano Zacchiroli <zack@debian.org>
Date: Tue, 6 May 2025 13:58:57 +0200
Message-id: <[🔎] 20250506115857.mr44k5tqywisx75h@upsilon.cc>
In-reply-to: <[🔎] 878qnaiwvd.fsf@hope.eyrie.org>
References: <[🔎] tslecx2dcr4.fsf@suchdamage.org> <[🔎] 00e0aaaedf5c050a7b09b53c880aecbf9b9220b0.camel@debian.org> <[🔎] tslzffqbwrb.fsf@suchdamage.org> <[🔎] 878qnaiwvd.fsf@hope.eyrie.org>

On Mon, May 05, 2025 at 02:13:58PM -0700, Russ Allbery wrote:
> However, I am very leery about extending that exception to cases where
> people are intentionally creating that situation by deleting the input
> data on purpose.

I agree with you on this. I do wonder however where you would place the
case where the training data is available (possibly: publicly
available), and the model trainers would even want to distribute it, but
cannot due to unclear licensing terms. Would you say that it is a "less
nasty" case than that where training data is deleted on purpose, or
would you consider it as bad?

FWIW, in terms of free software ethics, I consider non-open data to be
"less nasty" than non-free code. That's because with code we can take
the activist approach of just rewriting it under a free software license
(provided enough development resources are available). With non-open
data, there are cases in which you cannot just recreate and release it
under a free license, no matter how many resources you have.

The ability to exploit non-open-data to serve the needs of free software
(as it would be the case with DFGS-free models, trained on non-DFSG-free
data) is something I hesitate giving up on.

Cheers
-- 
Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack  _. ^ ._
Full professor of Computer Science              o     o   o     \/|V|\/
Télécom Paris, Polytechnic Institute of Paris     o     o o    </>   <\>
Co-founder & CSO Software Heritage            o o o     o       /\|^|/\
Mastodon: https://mastodon.xyz/@zacchiro                        '" V "'

Attachment: signature.asc
Description: PGP signature

Reply to:

Follow-Ups:
- Re: Non-LLM example where we do not in practice use original training data
  - From: Sam Hartman <hartmans@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Simon McVittie <smcv@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>

References:
- Non-LLM example where we do not in practice use original training data
  - From: Sam Hartman <hartmans@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Ansgar 🙀 <ansgar@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Sam Hartman <hartmans@debian.org>
- Re: Non-LLM example where we do not in practice use original training data
  - From: Russ Allbery <rra@debian.org>

Prev by Date: Re: Proposal Alternative: A Model Can Be a Preferred form of Modification
Next by Date: Re: Non-LLM example where we do not in practice use original training data
Previous by thread: Re: Non-LLM example where we do not in practice use original training data
Next by thread: Re: Non-LLM example where we do not in practice use original training data
Index(es):
- Date
- Thread