Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Aigars Mahinovs <aigarius@gmail.com> writes:
> It would be a lot easier to have a conversation with you, if you would
> spend more time articulating and detailing *your own* position, instead
> of guessing about the positions of others (and then talking down to
> those positions). Ideally in the actual manner that matters to you.
I feel like I did this already, and even reiterated part of it in the
message that you are responding to. I'll go over this one more time, and
then I'm going to stop responding to this thread until such time as we
have another GR before us because I find this specific political debate
incredibly annoying and am not going to voluntarily subject myself to more
of it if there is no active GR.
This is going to be a really long mail message and I'm sorry. I'm not
making it long to try to browbeat people; I'm making it long because I
don't know how to express how I feel in fewer words and still try to
capture the nuance and complications.
I care about three different things when it comes to machine learning
models in Debian.
1. Source
I care about the standard principle of free software ethics that we should
include the source code for the software that we ship (and I consider AI
models to be software). My personal definition of source code is
expansive: I think that it not only should satisfy the "preferred form of
modification" test that we take from the GPL, but that it should also be
auditable and transparent and should reveal the way the software is
constructed to humans or their tools.
I have read the arguments that weights constitute source and am not
convinced by any of them. They are clearly sufficient for *some*
modifications that people want to make, but I am unconvinced that they are
sufficient for *all* modifications people want to make. More profoundly,
they don't appear to serve any of the other purposes of source. I cannot
analyze them to understand how the model was constructed, what choices
were made in labeling, or any of the other things I think constitute the
normal expectations one has of free software.
The strongest argument that I see against providing source for machine
learning models is that the training data is voluminous and we don't
really know how to archive or provide it. I agree that this poses serious
practical issues, but I don't think it's a good enough excuse to abandon
our ethical policy of providing source.
I continue to hold this position even if upstream never retained the
training data, because I think source in free software should mean more
than a theoretically even footing with upstream (and also because
constantly scraping the web for training data is actively hostile to the
Internet and is making it increasingly difficult to run independent web
sites [1], but that's another rant).
[1] https://lwn.net/Articles/1008897/
2. Malicious or doctored models
This is to some extent an extension of the previous point, but it's
important enough to me that I want to break it out separately.
There is already some literature on how to implant a sort of "back door"
in an LLM that controls responses to specific types of prompts if you have
control over a small portion of its training data. This joins an already
substantial and long-standing literature on how to do the same with
simpler machine learning models such as image classifiers. As use of
machine learning models grows, these sorts of attacks will become more
sophisticated, and the motives for doing this sort of tampering will
expand. To take an obvious example, there is a clear financial incentive
for companies releasing open weights LLMs to find ways to embed
advertising bias in those models, so that the models will generate
positive descriptions of that company's products or the products of people
who paid them for this service. I'm sure there are many other variations
on this that I haven't thought of (and I assume the concerns about racial,
religious, and ethnic bias are so obvious as to not need discussion in
detail).
Detecting this sort of tampering is not going to be easy; detecting back
doors in source code never is. Having source code (i.e., training data) at
least makes it *possible*. Even if that data is very large, I am sure that
people will build tools to try to find it, quite possibly using machine
learning models! (Hopefully tools with free training data.)
If we don't have source code for the models, then detecting this sort of
tampering becomes a sort of reverse engineering problem. I am sure that
people will also develop tools to do that work, but it's a far harder
problem, precisely because of the violation of free software ethics that
hides much of the available information that otherwise could be used to
look for problems.
Free software's position on this has always been that you are allowed to
embed advertising and similar properties into your free software, but
everyone else is allowed to remove them if they want. That is one of the
rights that free software protects. This principle is maintained by having
training data available, so that hopefully the relevant training data can
be found and isolated and, at least for smaller machine learning models,
the model can be rebuilt without the bias.
Abandoning our commitment to source availability makes it more difficult
to remove this sort of back door, if it is even possible without
compromising the utility of the model. (The details of this are obviously
going to vary wildly depending on the specific nature of the model.)
3. Consent
This is going to be really long becuse my position has a lot of nuance. I
apologize for that in advance.
I'm also going to say up-front that I'm not willing to try to respond to
line-by-line criticism of the specifics of the argument (indeed, I'm
probably not going to continue the discussion on the list at all), because
that's just not where I want to spend my energy. Some of the details are
doubtless wrong in some specifics and I do not have hours to fact-check
myself or I'll never write this at all. Please take the thrust of the
argument as a whole in the spirit in which I'm trying to offer it: an
answer to your question about what my ethical and moral vision is in this
debate.
3.1. The current compromise
I believe that we (as in "most countries of the world" we) have agreed to
a tricky and complicated, conditional and awkward, flawed and abused, but
nonetheless very widely implemented compromise between two competing
interests: the right of creators [1] to profit from and maintain what they
consider to be the artistic integrity of their work, and the right of
society to reuse, rework, and build upon common cultural artifacts. The
nature of that compromise varies in some details from place to place, but
we have been quite earnest about attempting to universalize it with laws
and treaties.
[1] Ugh, I don't really like that word, but it seems to be the one that
the English-speaking world has standardized on to refer to artists
broadly, including writers, musicians, etc.
This message is, as requested, a discussion of the moral and ethical
principles that I personally am arguing for, so I want to be clear that
I'm talking about the general principle that backs popular support for the
copyright system, not about the specific legal details of that system.
There are things that I do not like about the Berne Convention
specifically, and there are far more things about copyright law writ large
that I think are actively abusive and unethical. I truly do not care at
all, for the purposes of this argument, whether my ethics exactly match
what is currently written down in copyright law. I know that they do not;
I did not arrive at my ethical or moral position because of what the law
says.
We have a fairly good idea of what would happen if we simply had no
copyright because, as others have pointed out, that compromise is
relatively recent in some places and, even where it has been available
domestically for a long time, was often not enforced internationally. We
can therefore look at history to see what happens: Corporations take any
work that they find interesting and capture the market with their products
based on that work, usually squeezing the original creator out of the
market and ensuring that most or all of the profits from that work flow to
the corporation that had no part in creating it. This is "good for
consumers" in some sense because the corporation is driven only by supply
and demand and is usually happy to make lots of cheap copies and thus make
copies of the art quite affordable. Nonetheless, we decided that wasn't
the society that we wanted to live in (and again, the "we" here is pretty
broad and encompasses most countries in the world to at least some
extent).
A central property of this compromise is that it is time-limited. The
creator gets control of their work for a limited period of time, and then
it stops being property and is freely available to all. A lot could be
said about the *length* of that time period (the point at which I most
vigorously disagree with the specifics of the current compromise is that I
think this period is much too long), but I agree with this basic
principle, or at least I consider it a workable compromise.
3.2. Fair use
Another very complex part of the current compromise is around "fair use,"
[2] which is, in essence, the rules for when you have to ask the consent
of the creator and when you don't. I completely agree with the ethical
principle behind the existence of fair use: no one should own an idea, I
should be able to review books and include plot summaries and quotes (and
indeed I do, quite reguarly), I should be able to sing a song to a friend
or learn to play it on a guitar, and people should not have to constantly
worry about incidental use of portions of things they've read or heard. We
live in a society and art is part of society.
[2] This is US terminology. Different countries call this different
things, but my understanding is that there is some similar principle
of boundaries around what requires creator consent in pretty much all
legal systems with copyright, although the boundaries are put in
different places.
The rules around fair use are absurdly complicated, vary between countries
even more than most aspects of the current compromise, and are, in short,
a confusing mess. I do not know of anyone, including practicing copyright
attorneys, who likes the current rules or thinks they're coherent or sane.
But the *idea* is sound.
One of the most common claims in this debate is that training a model on
artistic work should be fair use. This is not, on the surface, an
obviously unreasonable claim, particularly as a legal claim given the
complexity and vagueness of the current compromise here.
I do think *some* training of models on an artistic work is fair use, but
I do not believe training models on artistic works in the way that LLMs do
is fair use. This is not a position based on an analysis of the current
legal framework (like I said, I disagree with a lot of the current legal
framework around fair use). It's an ethical and a political position based
on what I see as the likely outcomes of allowing machine learning models
to be freely trained on artistic work within the period of creator
control.
I also think there are some ways to train models on artistic work that
should be legal (fair use) but which are not free software. This is
primarily cases where the amount of data extracted by the model is a very
small portion of the information in the work, and the model itself is
intended for an entirely separate area of endeavor and in no way competes
with the creator's market for their work. [3]
[3] Yes, these are two of the standard US criteria for free use, and I
have probably been influenced by that, but I do separately think both
of these principles make sense ethically.
For example, I think training an image classifier to recognize cats is
probably fair use of photographs of cats, regardless of the license of the
photographs. However, I don't think such a model can be free software
unless the consent of the photographers has been obtained because the
labeled training data cannot otherwise be redistributed under free
software terms, which means the result does not have source. This is, to
me, a form of proprietary software, just as if some useful piece of
software had a chunk of source code under a proprietary license that
prevented it from being distributed under free software terms.
However, I think LLMs, and some other machine learning model cases, fail
even this test, and I think it should be illegal (and certainly is
unethical) to train them in that way for anything other than private,
personal use without the permission of the creators of the works they are
trained on. I think this breaks the copyright compromise. This is because
LLMs do not extract only limited information from the work. They do deep
statistical data mining of the work with an explicit goal of being able to
create similar works. They therefore directly compete with the creator's
market for their work. I consider this an unethical and hostile act
towards the creator in our current copyright compromise.
3.3. Ethics of consent
There are a lot of different ways to arrive at this conclusion, but
fundamentally my argument is that creating an artistic work is a special
human endeavor that does, and should, hold a special place in our system
of morality. Artistic creation is fundamental to so many things that are
integral to being human: communication, empathy, creativity, originality,
and self-expression. Even if you disagree with my moral position below, I
think there are problems of politics and practical ethics that argue
substantive use of the work within the time-limited span of copyright
should require the consent of hte artist.
One reason is that the number of people (and collectives of people, such
as corporations) who will use other people's works maliciously is
significant, ranging from plagiarism through conterfeiting to fraud. I
know that the counterargument here is that each of those malicious
activities can be outlawed independently without needing the copyright
compromise, but I think that position is wildly unrealistic. Society is
not suffering from an excess of tools to prevent people from using other
people's work maliciously; quite the opposite at the moment. As a matter
of practical politics, I am opposed to discarding existing tools for doing
so, as long as those tools are ethical, and I think the copyright
compromise is.
The other reason, which I've talked about a lot, is that this is how
creators can afford to make art as their job without being independently
wealthy or using only their spare time. That in turn provides the world
with better artistic works, not to mention being the mechanism whereby
society demonstrates its value that art is important and should be
encouraged. This is, for example, the stated reason for copyright law in
the United States, and I believe in other countries.
I think there was a comment in this thread that copyright enforcement is
only a tool for rich people because only rich people can sue. This is
definitely not the case. The structure of copyright law is the entire
reason why, for example, the book publishing market exists in its current
form and forms the legal framework behind the author compensation model
(which is very much *not* work for hire in the common case). People, even
people who are not rich, do sue to enforce their rights (and win), and it
doesn't take very much of that to create a deterrence effect, which is how
the law is generally designed to work.
I've also seen comments that all art should be funded by payment for the
creation of the art, not by royalties after it's complete. This is
equivalent to arguing all art should be funded using a Kickstarter model,
and I hope it's obvious why that isn't viable. (Among many other problems,
this mostly only works if you're already famous.) This is one of the
places where I have to urge people to go listen to what creators say about
their funding models. They're not shy about how important the financial
structures enabled by copyright are, or why the alternatives would often
make it impossible for them to continue to make art. I am personally the
most familiar with the book publishing industry, and I could list many,
many writers who are not in any way wealthy (who probably have less money
than most of the people reading this) who are quite eloquent on exactly
how they rely on the copyright compromise to be able to write at all.
Now, I am from the US, and in another part of this thread I've been told
that the EU has a totally different funding model for creators and
supports them without the need for them to sell their work for money, and
therefore these are more US-specific problems that we should fix by fixing
our societal method for supporting artists. I freely admit that I don't
know very much about EU law or about EU creator support mechanisms, so
maybe this is correct, and if so, that's fantastic. I am all in favor of
the US fixing all sorts of things we're doing poorly, including that one.
I am a *little* dubious because I have followed the work a lot of
creators, including ones from the EU, and I've never heard this from a
creator. They all still seem quite concerned with the income they can
derive from their work. But I will freely admit that a better economic
support model for creators would remove a chunk of this argument.
However, I don't think it removes all of the argument. In addition to the
point above about misuse of artistic work, I also consider it a moral
imperative to obtain the consent of the creator if the intent of the use
is to make substantial use of the work. This is my personal moral belief
and I don't expect everyone to share it, nor do I think it's necessary for
the rest of my argument, but, well, you asked for my moral position. I
believe that someone's artistic work is often a deeply meaningful personal
communication and that should be treated with respect. I consider this to
be part of a basic obligation to respect human dignity and the special
place of art in human society. This does fade with time; eventually, the
work has become part of the culture and gains some separation from the
artist. But I don't think this should happen immediately.
(How does this apply to corporations? Hang on, I'm getting to that.)
3.4. Computers are not the same as humans
One of the arguments that has come up in this discussion is that one can
model the human brain as a type of computer and therefore "doing deep
satistical data mining of the work with an explicit goal to be able to
create similar works" is just a description of a huge percentage of normal
human activity throughout all of history. This is what we all do when we
learn something artistic: we look at the work of people who already know
how to do it and we figure out how they did it and we learn by copying
them. And therefore it should also be permitted to do this with a
computer; it's just the automation of a normal human activity.
I mostly agree with all of this except the last sentence. It is not, in
fact, moral to automate any and all human activity. It is, in fact,
different when a human does something, because often our laws and even the
basic principles of our society are designed for human capabilities and
would catastrophically fail under corporate capabilities.
Furthermore, my morality and ethics are centered around humans and, to a
somewhat lesser extent, other sentient beings. I care about human
learning, human flourishing, human art. I could not possibly care less
about computer flourishing or computer art because computers are not
sentient, they don't feel pain, and they are not moral actors. Computers
are a complicated tool that we make for human activities.
I am not one of the people who thinks it is theoretically impossible to
*ever* make a sentient being on some sort of silicon platform. If we ever
invent Commander Data, I agree that we will have some challenging ethical
decisions to make. But I think LLMs are so far away from that today that
they are not in the same galaxy, and I am quite confident this will not
happen in my lifetime. I don't believe it's even theoretically possible to
do so with LLM technology due to how LLMs work, so if we do someday
manage, it will be with some different technology than this.
This is not a point on which I'm going to try to convince people. I know
that some people disagree with me on this, and I think those people are
quite obviously wrong, and I'm afraid that's all you're going to get from
me on that topic because to me the differences are so obvious that I don't
think I have enough of a common frame of reference with people who
disagree to have an intelligent debate.
The relevant point for consent is that allowing humans to learn from
artistic work is part of our copyright bargain (and has, indeed, been part
of human understanding of art for as long as we have had art), but
allowing computers to be trained on artistic work is *not*. Training
computers does not automatically generalize from training humans because
humans get special status, not only in law but also in morality and
ethics.
One of the concrete practical reasons for this is that humans have rights:
they have to be paid a fair wage for their work, they cannot be enslaved
[5], and they are legally and morally independent of their employers. None
of this is true of computers, and therefore allowing computers to do
things is practically and politically equivalent to allowing corporations
to do those things at scale. Differences in scale often become differences
in kind, and this is one of them. Corporations (and states, and other
collectives that are not individual humans) wield levels of economic power
far beyond that of individual humans and can crush individual humans if
that power is not balanced. One of the ways that human societies balance
that power is by extending special rights to humans that are not available
to corporate machinery. I believe learning from art with or without the
consent of the artist is, and should be, one of those special rights.
[5] I know, I know, I know, I'm again laying out my moral beliefs, not the
horrific abuses "my" government participates in.
3.5. Corporate abuse of copyright
I grew up on the Internet, I copied music from my friends and shared music
with my friends, I remember the RIAA lawsuits against college students,
and I am quite viscerally aware that all of the principles I am talking
about are abused by corporations, often (but not always) directly against
the wishes of the humans who made that art.
I'm also involved in the free softare movement and therefore am of course
aware of the ways that copyright has been abused to take away power from
people over their own devices and tools of everyday living (even medical
devices that are literally keeping them alive). I obviously do not support
that or I wouldn't be here to write this.
I have been sympathetic to the argument that we should throw out copyright
entirely for multiple decades. I understand where it comes from, and I do
think there is a moral argument there. But I don't think it's entirely
correct; I think it's too absolute of a position and will hurt individual
human creators, ones who are not wealthy and who are not abusing their
copyright, more than it will hurt corporations.
If you don't agree with that, and I realize many people here won't, I'm
not sure that I can give you a compelling argument that will convince you.
I can only state the basis of my own moral position, which is that I know
a whole lot of people who make art of various kinds, many of whom hate
corporate abuses of copyright as much as any of you do, and I have
listened to them talk about how central the (broken, badly-designed,
always in need of reform) copyright compromise is to their ability to
continue to make art.
The way I personally reconcile these two positions is two-fold.
First, just like I don't consider computers to be the moral equivalent of
humans, I *certainly* do not consider corporations to be the moral
equivalent of humans, and I would be quite happy to substantially diminish
their ability to hold or enforce copyrights as long as the rights of the
underying humans are preserved. Our current legal system is not set up to
do this, but I can imagine ones that would be, and I would be
wholeheartedly in favor of those. I'm of course also opposed to the
excesses of the corporate copyright system, such as disproportional
penalties and intrusive software limitations. Just because I agree in
principle with requiring the consent of the human creator does not mean I
agree with many of the mechanisms that are used to, in theory, enforce
that consent, but in practice to line the pockets of corporations with
little or no benefit to the creator.
Second, in the specific case of *software*, I think our current compromise
is over-broad in what it protects. Software is frequently *not* a deeply
meaningful creative human communication that reflects its creator. It's
often algorithmic, mechanical, and functional, attributes that, elsewhere
in our copyright compromise, define works that are not protected by
copyright. I don't consider protecting every software program as strongly
as a novel or painting to be morally justifiable.
I am running out of energy for this write-up (this is just absurdly long),
so I'm not going to go back and show how I would test software against all
of the principles I talked about earlier, but my summary is that I think
these types of creator rights should only apply to artistic works that
are, well, artistic. Some programs qualify; most probably don't. It
matters, very deeply, for my moral position whether the work of art is a
work of personal expression, and I do think that our current copyright
compromise has this balance badly wrong for software in particular.
On top of that, there is the very strong argument that people should have
a right of control over objects they have purchased and other sorts of
personal property, and most obviously medical devices that are keeping
them alive. This is a powerful moral principle and to me it overrides some
of the rights of the creator when it comes to software because, again,
software is functional and we cannot allow the protection of the creative
component to destroy people's right to control their own lives. This
argument does not apply to things like novels or paintings in the same
way; it is much harder to construct a scenario where one must be able to
make copies of some specific novel in order to exercise personal freedom,
because those types of art are not functional in the same way.
3.6. Opt-out
I can't help myself -- I have to say something about this absurd idea that
an opt-out mechanism is sufficient to establish consent, because I find
the entire idea morally repugnant.
Opt-out systems are often the first refuge of scoundrels who had no intent
of honoring consent at all but got caught and realized that position was
too unpopular to be viable. In practice, they are almost always frauds.
The point is usually to defuse the moral argument of the small minority in
any debate who have the time, energy, and knowledge to vociferously
object, while continuing to ignore the wishes of everyone else. There is a
very good reason why corporations almost immediately turn to opt-out
systems as their "compromise" position; they know that they largely don't
work and will give them effective carte blanche to do nearly all of the
things they wanted to do.
"I can do whatever you want unless you yell 'no' loudly enough and in
precisely the right way" is not a defensible moral position.
I think one should be highly suspicious of even *out-in* systems when they
involve significant imbalances of power, because often the consent,
although explicit, is partly coerced in all sorts of subtle ways. But
opt-in is the bare minimum; opt-out is just a public relations campaign
for ignoring consent.
3.7. Conclusion
Probably no one is still reading, because this is an exercise in "be
careful what you wish for." :) But for those who jumped to the bottom,
I'll try to sum up my third concern.
Creation of art is a special and morally significant aspect of humanity
that I believe warrants respect and careful ethical treatment. For works
of artistic personal expression (often *not* the case for software), I
think the best ethical path is to start from an assumption that consent
from the creator is required for substantive use within some time-limited
period of protection. We can then carve out some sensible exceptions, but
this should be the default. I personally do not particularly care about
corporate consent, only about human consent, but for Debian's purposes we
probably don't have a reasonable way to draw that distinction.
The practical impact for machine learning models and free software is
that, under this moral principle, models that make substantive use of the
work (including but not limited to the kind of statistical extraction done
by LLM training) should be trained on consentual training data. The
license of the training data is how the free software community
establishes creator consent.
There is space here for machine learning models that I consider ethical
with respect to creator consent, but do not consider free software. For
example, a creator could consent to the use of their work to train the
model but not consent to that work becoming publicly available; that's an
entirely reasonable thing that I could see a creator doing (in exchange
for money, presumably), and that's their choice. I don't see anything
unethical about that. The result just wouldn't be free software or
therefore eligible for Debian main due to other free software principles
discussed above.
--
Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>
Reply to: