Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
On Wed, 14 May 2025 at 21:13, Soren Stoutner <soren@debian.org> wrote:
>
> On Wednesday, May 14, 2025 5:31:53 AM Mountain Standard Time Aigars Mahinovs
> wrote:
> > On Wed, 14 May 2025 at 00:03, Soren Stoutner <soren@debian.org> wrote:
> > > On Tuesday, May 13, 2025 12:06:05 PM Mountain Standard Time Ilu wrote:
> > > > 2. What is the preferred form of modification? This is IMHO the
> > > > deciding, relevant question.
> > > > Aigars says weights and I've heard that from several other people active
> > > > in machine learning. OSI says the same.
> > > > Mo Zhu says training data is. I haven't heard that from anybody else.
> > >
> > > I thought several other people besides Mo Zhu had also said that on this
> > > list, but just in case they haven’t, I would like to go on the records
> that
> > > I also feel that training data is one of the preferred forms of
> > > modification in machine learning and should be thus considered for
> anything
> > > being included in main.
> >
> > Could you expand a bit on this topic, so I can understand this position
> > better?
> >
> > Say that we are talking about an otherwise-free LLM model trained on a
> > multi-gigabyte data set. Data from the dataset may be downloaded from
> > the Internet (but may not redistributed by Debian). Let's assume that
> > the source code of the LLM also includes a script that would, if
> > executed, do all the downloading and formatting of the training data
> > from Internet sources for you. The data *may* even be binary identical
> > to the original training data (if it is only trained on snapshotted
> > data mining collections that one can download from torrent via a
> > magnet link for example), or it may be in a newer state than when it
> > was trained originally (if you choose to switch to newer snapshots or
> > if data collection happens directly from source servers or their
> > proxies). You can add, remove or filter data sources to modify the
> > contents of the training data on a high or granular level.
> >
> > Would that be a sufficient definition of training data to satisfy the
> > preferred form of modification criteria for you?
>
> If Debian cannot redistribute the training dataset (part of your description
> above), then it cannot be in main.
That is not what I asked. Redistributing is a completely different
question from a different point of DFSG and even from interpretation
of whether DFSG even applies to the training data as such. And that in
turn very specifically depends on a very isolated question - what is
the preferred form of modification. And that is why I am
*specifically* asking how your opinion that "training data is the
prefered form of modification" works in real world examples.
Only that specific criteria. Not about Debian, not about main or
non-main. Not for other people or for the project.
What does "preferable form of modification" mean for *you*? For
example in that case above. Is the raw training data *really* _the_
preferable form of modification? Or is it the data definition? Which
would you *prefer* to *modify*?
>
> > If any use of the original training data (or of its description as
> > above) requires 100 000 Nvidia H100 cards running for a month using a
> > few billion USD of investment and several million dollars of
> > electricity, does that training data *still* satisfy the criteria for
> > "preferred form of modification"?
>
> I find discussions about how much hardware it takes to process the training
> data to be orthogonal to a discussion of whether a ML training dataset is
> DFSG-free, so I don’t feel it is useful to discuss here.
It is absolutely critical to a very specific DFSG question on what is
the "prefered form of modification".
IMHO a form can *not* really be the "prefered form of modification"
if, in reality, basically noone in the world is actually able to use
this form to modify the result. If noone can do that in that
particular way, how can that be the prefered way?
Claiming that something is or is not DFSG-free _while we are
discussing about the meanings of specific DSFG definitions_ is just
circular logic. There can not be "DFSG-free" or "non-DFSG-free"
determination made until the preferred form of modification is
determined *first*.
If you could try again, specifically talk about what you think about
the cases I laid out in the original mail, IMHO that could narrow down
what is and what isn't in discussion.
--
Best regards,
Aigars Mahinovs
Reply to: