Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

To: debian-vote@lists.debian.org
Subject: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
From: Soren Stoutner <soren@debian.org>
Date: Wed, 14 May 2025 12:12:53 -0700
Message-id: <[🔎] 3014771.o0KrE1Onz3@soren-desktop>
In-reply-to: <[🔎] CABpYwDV4tAFt50=EtApnR+guugYCYbGwHG4bO=MNFYtgB8ZFKA@mail.gmail.com>
References: <[🔎] aBdC1-OCYhVx3xl0@pc220518.home.grep.be> <[🔎] 4750026.CvnuH1ECHv@soren-desktop> <[🔎] CABpYwDV4tAFt50=EtApnR+guugYCYbGwHG4bO=MNFYtgB8ZFKA@mail.gmail.com>

On Wednesday, May 14, 2025 5:31:53 AM Mountain Standard Time Aigars Mahinovs 
wrote:
> On Wed, 14 May 2025 at 00:03, Soren Stoutner <soren@debian.org> wrote:
> > On Tuesday, May 13, 2025 12:06:05 PM Mountain Standard Time Ilu wrote:
> > > 2. What is the preferred form of modification? This is IMHO the
> > > deciding, relevant question.
> > > Aigars says weights and I've heard that from several other people active
> > > in machine learning. OSI says the same.
> > > Mo Zhu says training data is. I haven't heard that from anybody else.
> > 
> > I thought several other people besides Mo Zhu had also said that on this
> > list, but just in case they haven’t, I would like to go on the records 
that
> > I also feel that training data is one of the preferred forms of
> > modification in machine learning and should be thus considered for 
anything
> > being included in main.
> 
> Could you expand a bit on this topic, so I can understand this position
> better?
> 
> Say that we are talking about an otherwise-free LLM model trained on a
> multi-gigabyte data set. Data from the dataset may be downloaded from
> the Internet (but may not redistributed by Debian). Let's assume that
> the source code of the LLM also includes a script that would, if
> executed, do all the downloading and formatting of the training data
> from Internet sources for you. The data *may* even be binary identical
> to the original training data (if it is only trained on snapshotted
> data mining collections that one can download from torrent via a
> magnet link for example), or it may be in a newer state than when it
> was trained originally (if you choose to switch to newer snapshots or
> if data collection happens directly from source servers or their
> proxies). You can add, remove or filter data sources to modify the
> contents of the training data on a high or granular level.
> 
> Would that be a sufficient definition of training data to satisfy the
> preferred form of modification criteria for you?

If Debian cannot redistribute the training dataset (part of your description 
above), then it cannot be in main.  If the LLM model source code is DFSG-free 
but depends on this non-DFSG free training data or weights derived from it, 
then it is fine if it goes in contrib.  The weights derived from this non-DFSG 
free training data can go in non-free as long as Debian can redistribute them.

If there is a scenario where the LLM can work with several different data sets 
derived from different training data, and some of those data sets are DFSG-
free while others are not, then the free data sets and the model can go in 
main.  It can depend on, recommend, or suggest the DFSG-free weights and data 
sets in main.  But it can only suggest those in non-free.

I find this understanding to accomplish two things.

1.  It is a consistent application of DFSG principles to machine learning 
applications.

2.  It makes the benefits of non-DFSG ML applications available in non-free to 
those who would like to use them.

> If any use of the original training data (or of its description as
> above) requires 100 000 Nvidia H100 cards running for a month using a
> few billion USD of investment and several million dollars of
> electricity, does that training data *still* satisfy the criteria for
> "preferred form of modification"?

I find discussions about how much hardware it takes to process the training 
data to be orthogonal to a discussion of whether a ML training dataset is 
DFSG-free, so I don’t feel it is useful to discuss here.

> And, to ask explicitly, is raw training data a better form of
> modification for you compared to a description of that same training
> data, in automated form that would generate the training data for you
> on request?

1.  Raw training data is non-negotiably required for me to consider a ML 
application DFSG-free.

2.  Additionally, a description of that training data would also be nice, but 
I don’t think it would be non-negotiably required.  However, I might be open 
to arguemnts that both should be required.

> Is it important for you if the training data *only* comes to you from
> Debian mirrors? Or is the same data coming to you from other sources
> also fine?

For main, yes, I think it must come from Debian mirrors.

For non-free, I don’t see a difference between Debian mirrors or a script that 
downloads the data from some other source on the internet, as your describe in 
your example above.

I should note that I do not feel as strongly about this point as I do about 
the training data being available under a DFSG-free license to be in main.  
So, if Debian decides that hosting the training data on some Debian approved 
location that is not an official Debian mirror is acceptable, I wouldn’t push 
back against that as long as the training data itself was DFSG-free could be 
included in main in the future if we ever decided to do so.

> > In my opinion, it is fine to include otherwise distributable ML 
applications
> > without available training data in non-free.
> 
> Technically - yes, and I would be fine to include OSI-free AI in
> Debian non-free, but IMHO it does nothing to resolve ethical concerns.
> If we limit that to only OSI-free AI then that would also be giving
> the same kind of guidance to the AI community - with both upsides and
> downsides.

I would go beyond that to say that we can host things on non-free that even 
OSI does not consider free as long as we have the rights to distribute it.  We 
already do that for a number of other things in non-free.

I think the practical result of the policy I describe above would be that most 
ML applications would end up in non-free.  But I also think that a smaller 
number of ML applications would end up in main, especially as developers start 
intentionally creating fully DFSG-free ML applications and training data sets.

Also, as has been mentioned on this list, the discussion about LLMs has caused 
us to look more closely at other MLs already in main, like some games or image 
and audio processing applications.  In the past, I, along with others, didn’t 
think too deeply about the training data used to create the weights used by 
those MLs.  If the above standards are adopted, it would require moving some 
of these games and image and audio processing applications to contrib with 
their weights in non-free.  In other cases, it might be possible to retrain 
them on DFSG-free training data sets, especially if upstream is interested in 
doing so.

I think this would actually benefit the free-software movement in the long 
term, even if it requires a bunch of work to address it now.  If such a change 
were made, I would be in favor of doing so at the beginning of a release cycle 
so that Debian and upstream developers have a couple of years to figure out 
how to either keep the software in main or move it to contrib and non-free in 
such a way that it is not disruptive to users.

-- 
Soren Stoutner
soren@debian.org

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply to:

Follow-Ups:
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Aigars Mahinovs <aigarius@gmail.com>

References:
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Wouter Verhelst <wouter@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Soren Stoutner <soren@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Aigars Mahinovs <aigarius@gmail.com>

Prev by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Previous by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Index(es):
- Date
- Thread