[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Wed, 14 May 2025 at 00:03, Soren Stoutner <soren@debian.org> wrote:
>
> On Tuesday, May 13, 2025 12:06:05 PM Mountain Standard Time Ilu wrote:
> > 2. What is the preferred form of modification? This is IMHO the
> > deciding, relevant question.
> > Aigars says weights and I've heard that from several other people active
> > in machine learning. OSI says the same.
> > Mo Zhu says training data is. I haven't heard that from anybody else.
>
> I thought several other people besides Mo Zhu had also said that on this list,
> but just in case they haven’t, I would like to go on the records that I also
> feel that training data is one of the preferred forms of modification in
> machine learning and should be thus considered for anything being included in
> main.

Could you expand a bit on this topic, so I can understand this position better?

Say that we are talking about an otherwise-free LLM model trained on a
multi-gigabyte data set. Data from the dataset may be downloaded from
the Internet (but may not redistributed by Debian). Let's assume that
the source code of the LLM also includes a script that would, if
executed, do all the downloading and formatting of the training data
from Internet sources for you. The data *may* even be binary identical
to the original training data (if it is only trained on snapshotted
data mining collections that one can download from torrent via a
magnet link for example), or it may be in a newer state than when it
was trained originally (if you choose to switch to newer snapshots or
if data collection happens directly from source servers or their
proxies). You can add, remove or filter data sources to modify the
contents of the training data on a high or granular level.

Would that be a sufficient definition of training data to satisfy the
preferred form of modification criteria for you?

If any use of the original training data (or of its description as
above) requires 100 000 Nvidia H100 cards running for a month using a
few billion USD of investment and several million dollars of
electricity, does that training data *still* satisfy the criteria for
"preferred form of modification"?

And, to ask explicitly, is raw training data a better form of
modification for you compared to a description of that same training
data, in automated form that would generate the training data for you
on request?

Is it important for you if the training data *only* comes to you from
Debian mirrors? Or is the same data coming to you from other sources
also fine?

> In my opinion, it is fine to include otherwise distributable ML applications
> without available training data in non-free.

Technically - yes, and I would be fine to include OSI-free AI in
Debian non-free, but IMHO it does nothing to resolve ethical concerns.
If we limit that to only OSI-free AI then that would also be giving
the same kind of guidance to the AI community - with both upsides and
downsides.
-- 
Best regards,
    Aigars Mahinovs


Reply to: