Re: Non-LLM example where we do not in practice use original training data

To: debian-vote@lists.debian.org
Subject: Re: Non-LLM example where we do not in practice use original training data
From: Russ Allbery <rra@debian.org>
Date: Thu, 08 May 2025 11:48:59 -0700
Message-id: <[🔎] 87bjs3q6p0.fsf@hope.eyrie.org>
In-reply-to: <[🔎] aBzv8vl6ZBGXhPqV@scru.org> (Clint Adams's message of "Thu, 8 May 2025 17:54:58 +0000")
References: <[🔎] aBzv8vl6ZBGXhPqV@scru.org>

Clint Adams <clint@debian.org> writes:

> I'm not sure that these are quite the right terms.  This email itself
> is non-free software, but if Sam wants to train some kind of deep
> learning model on it and release the model, without training data,
> under the Expat license, I definitely would not refer to the model
> as non-free.  Would I prefer that copyright law be abolished and
> there be no impediments to providing the training data as well?
> Of course I would.  But, absent that, there would be no way for Sam
> to distribute the training data as free software.

I'm not sure that I agree that it would be great if copyright law were
abolished. I think it's deeply flawed and I can certainly imagine
different legal structures for achieving some of the same goals that I
think would be superior, but right now, for all of its many problems,
copyright law is one of the few tools that we have for consent. One of my
problems with the stance that Aigars has summarized (not his fault -- it's
a common view) is that consent should not be required to train models.

I think your point is that someone training a Bayesian filter on my email
messages should not require my consent. My views on that are more
complicated. I think there are circumstances when it shouldn't require
that consent and circumstances when it should, and it's a tricky moral
question that, for me, is heavily influenced by how the model is used.

But let me slide down the slippery slope a bit farther and present a case
that I think is a natural extension of that position. Suppose that instead
of training a Bayesian spam filter on a bunch of mail messages without
explicit consent, someone instead gathered every email message that I had
ever sent to a public mailing list and used them to train an LLM to
impersonate me.

I don't think someone should be allowed to do that without my consent.
Right now, the tool I have for expressing that consent is based on
copyright law, for better or worse.

Now, there is a pretty good argument here that copyright law is the wrong
tool to prevent that and we should have other laws that tackle that
directly, such as the laws now being passed to prohibit "nudification"
image transformation models that do not rely on copyright law. And I would
agree! But those laws largely don't exist right now and copyright law does
and until someone fixes the problem in some other way, I don't want to
give up the protection that I may still have, even if it's murky and
contingent.

This is about larger questions of morality and law, but what I would say
about Debian's rules specifically is that we should have some obligation
to behave ethically. That's going to mean different things to different
people, and we quite rightly don't incorporate in our foundation documents
ethical principles beyond the scope of free software. But I still have my
personal ethics and those will guide my vote on questions of what ethics
Debian should adopt around free software.

I think using other people's work without their consent is sometimes
unethical. It depends a *lot* on the circumstances to me, but I think
machine learning models, and LLMs and image manipulation models in
particular, have opened new frontiers for unethical things that can be
done using other people's work.

This is not equivalent to the existing human capability to do the same
thing manually precisely because the whole point of writing computer
programs to do something is that you can do that thing cheaply and at
scale. Some other human being can, today, study my writing style and try
to impersonate me, and I can't stop that with copyright law. I understand
that. But also this is hard and manual and it's very difficult for someone
to keep that up at length. An LLM trained on my writing can potentially
impersonate me trivially and extensively, for essentially free.

Debian's free software principles cannot solve all, or even most, problems
in this world. But I think they are both directly relevant and rather good
at addressing at least Debian's involvement in this sort of activity.
Applying free software rules to training data is a bit of a heavy hammer
and maybe it's too much, but it does hold an ethical line about consent
that I think we should hold. Maybe there's a different way to hold that
line, and I'm open to being convinced by a different approach, but I don't
want to give up this ethical line completely.

-- 
Russ Allbery (rra@debian.org)              <https://www.eyrie.org/~eagle/>

Reply to:

Follow-Ups:
- Re: Non-LLM example where we do not in practice use original training data
  - From: Aigars Mahinovs <aigarius@gmail.com>

References:
- Re: Non-LLM example where we do not in practice use original training data
  - From: Clint Adams <clint@debian.org>

Prev by Date: Re: Non-LLM example where we do not in practice use original training data
Next by Date: Re: Archive section for open source models
Previous by thread: Re: Non-LLM example where we do not in practice use original training data
Next by thread: Re: Non-LLM example where we do not in practice use original training data
Index(es):
- Date
- Thread