[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



Aigars Mahinovs <aigarius@gmail.com> writes:

> *However*, models again are substantially different from regular
> software (that gets modified in source and then compiled to a binary)
> because such a model can be *modified* and adapted to your needs
> directly from the end state. In fact, for adjusting a LLM for use in a
> particular domain or a particular company it actually *is* the "binary"
> that is the *preferred* form to be modified - you take a model that
> "knows" a lot in general and "knows" how your language works and you
> train the model further by doing specialisation training for your,
> specific data set. And a result you get from one "generic" binary
> another - "specialized" binary.

I have to say that I'm not convinced by this argument that models are any
different than other types of software. To me, this type of "modification"
is akin to using code as a library without modifying it. Yes, that is a
thing that people often want to do. It is by far the most common way to
use a library, because this is the whole point of a library. But we still
hold libraries to the DFSG. It's very, very rare for me to want to modify
libc, or for that to be a good idea, but we wouldn't ship libc without
source code, because sometimes we really do want to modify the library
itself.

One of the reasons why I'm so leery of a theoretical argument that tries
to say that a machine learning model isn't really software in the sense
that we think of it is that this conclusion is appealingly convenient.
It's very impractical and difficult to treat the training data as source
code, so I have a subconscious temptation to find some reason to justify
why it's not, which can lead to magnifying differences to a point that I
worry isn't justified.

I think I'd personally be more comfortable with tackling the real problem
head on: We're probably not capable, in general, of treating the training
data like source code, so now what? But I am one of those people who
prefers a system of broad and conflicting rights that require thoughtful
balancing, rather than a system of narrow and absolute rights.

> So, very precisely speaking, modification of a LLM does *not* require
> the original training data. Recreating a LLM does. Also developing a new
> LLM with different training methods or training conditions does need
> some training data (ideally the original training data, especially to
> compare end performance). But all in all a developer on a Desert Island
> would be better off with a "binary" model to be modified than without
> it.

This last argument is true of all proprietary software, though. One is
always better off, at least in some immediate practical sense, having
something with severe usage restrictions than not having anything at all.
This isn't the test we use for the DFSG, though. Debian's position is that
if we can't offer you all of the DFSG freedoms, we don't put the software
in main, even if it would still be very useful within those restrictions.

> Say for example that an IDE saves its configuration state not in a
> common text file, but as a binary memory dump. Say the maintainer of
> such a package would use their experience of the IDE and years of
> development to go through the GUI of this software to assemble a great
> setup configuration that is great for anyone starting to use the IDE and
> also has clues left around it how to tailor it further for your needs.
> This configuration (as a binary memory dump of the software state) is
> then distributed to the users as the default configuration. What is "the
> source" of it?

I agree that in this case there is no separate source code and this binary
data structure is the preferred form of modification. But that's because
this data structure was created by a human directly, not by an automated
process. It is a configuration file that a user wrote via an editor (the
IDE).

> Isn't this binary (that the GUI can both read and write) not the
> preferred form for modification? The maintainer can describe how he
> created the GUI state (document the training process), but not really
> include all his relevant experience (training data) that led him to
> believe that this state is the best for the new users.

I guess all I can say is that I disagree with this way of analyzing the
situation on a whole lot of levels, philosophical, practical, and legal.
To me, this is making the unwarranted leap to assuming that machine
learning models are like Commander Data from Star Trek: independent life
forms that are morally equivalent to a human being and therefore should
receive the same special treatment in free software ethics as human
beings. To me, this is just obviously not the case, and I have absolutely
no qualms about treating human activity as fundamentally and completely
different than computer activity in our ethics and in our free software
guidelines.

> Or Debian could go the MS TTF route - have the software in the archive,
> but no models at all. And to get the software to work users would get
> used to run a script that would be always pulling a model from
> huggingface.co either manually or even during package installation.
> Possible with a barely functional placeholder model in the package that
> 99% of users would replace in real usage. That would keep the "evil" AI
> away from the archive, but will that benefit our users?

I would echo the pleas elsewhere to avoid loaded terms like "evil" or
"toxic" or whatever, because we don't have to agree on a morality in order
to agree on an ethical structure for deciding what is and isn't free
software.

I personally do not believe proprietary software is evil in some greater
moral sense. I know there are people in the free software community who
believe this, but I do not, and I am not required to believe this to
participate in Debian. All that I'm required to do is to agree that Debian
is for a specific type of software that meets a set of ethical
requirements, and that software that does not meet those ethical
requirements, whether good or bad, useful or not useful, should not be
part of Debian. If I want to work on such software, I am free to do that,
just not here. Debian provides a general-purpose computing platform that I
can (and do) use to do all sorts of things that fall outside the scope of
the Debian Project.

We don't need to, and should not, decide that everything that falls
outside of Debian's DFSG is evil. That's not the purpose of our
guidelines. The purpose is to set the boundaries of what the project is
for. Different people in the project will agree to those boundaries for
different reasons and with entirely different personal perspectives on the
morality of them. We don't have, or need, conformity here.

My goal in this discussion is to advocate for clearly defining the
boundaries of Debian so that people can rely on those definitions when
deciding whether to do their work inside Debian or elsewhere. It's
perfectly fine for us to ask people to do some kinds of work elsewhere.
Debian is quite far from the only worthwhile software organization in the
world. It's fine for us to limit our scope for many different reasons,
including to avoid disruptive internal conflict, and that does not carry
any project-wide judgment on the things we have decided to not actively
support.

> Will that benefit the development of a freer and more accessible AI
> landscape?

This is not a goal of the Debian Project at present. It of course could be
if we decided to adopt it, but it's not at all clear to me that we would
choose to do so. (It may, of course, be a goal of some individuals within
the Debian Project, and that's fine, but that doesn't carry as much weight
in our project-wide decision-making process.)

-- 
Russ Allbery (rra@debian.org)              <https://www.eyrie.org/~eagle/>


Reply to: