Re: Proposal Alternative: A Model Can Be a Preferred form of Modification

To: debian-vote@lists.debian.org
Subject: Re: Proposal Alternative: A Model Can Be a Preferred form of Modification
From: Otto Kekäläinen <otto@debian.org>
Date: Tue, 6 May 2025 10:51:32 -0700
Message-id: <[🔎] CAOU6tAC39QSDB7DBCryfnju-O7tebhJgLSqOhpX7AE0cxsoXZA@mail.gmail.com>
In-reply-to: <[🔎] CABpYwDWEh1Z-MMQwuFtq4x9iJ-8qvXBt9ptGAu8xhvGDG8g50w@mail.gmail.com>
References: <[🔎] tslr012dg8i.fsf@suchdamage.org> <[🔎] CABpYwDWEh1Z-MMQwuFtq4x9iJ-8qvXBt9ptGAu8xhvGDG8g50w@mail.gmail.com>

Hi,

>> ***Proposal Text***
>>
>> Choice 2: Software incorporating AI Models Released under DFSG Licenses
>> free Must provide for
>> Practical Modification to Comply with DFSG
>>
>> The project asks those charged with interpreting the DFSG to require
>> that software incorporating AI models have a preferred form of
>> modification for the models and that we provide our users the ability to
>> modify these models in order to be included in the main section of the
>> archive. Examples  of such a preferred form of modification can include
>> the original training data for the model. Alternatively, a base model
>> (especially when the base model can be replaced and multiple options are
>> available) along with training data for any fine tuning that has been
>> performed is acceptable. In some cases a model along with necessary
>> tools to perform incremental fine tuning may be acceptable if doing
>> additional incremental training is actually the approach that the
>> upstream project uses to modify the model. As with other interpretations
>> of the DFSG, something cannot be the preferred form of modification if
>> the upstream of the software under consideration has a more preferred
>> form of modification that is not public.
>
>
> Another, simpler, alternative would be to vote on the Debian project endorsing https://opensource.org/ai/open-source-ai-definition
>
> It basically translates the four freedoms into AI freedoms and introduces "Data Information" as a substitute for (potentially unredistributable) original training data - a description of what data was used for training and how it was acquired and processed. With the key that a sufficiently skilled person should be able to reproduce the data and then the model using this information.

The OSI definition of open source was originally derived from the
Debian DFSG, but then they published that Open Source AI some people
who objected it created https://opensourcedefinition.org/ and
https://openweight.org/ to emphasize that weights are not open source
without the training data. For background see
https://www.einpresswire.com/article/779177703/open-weight-definition-owd-delivering-clarity-while-protecting-the-integrity-of-open-source-ai

Currently the top voted model at
https://huggingface.co/models?sort=likes is DeepSeek-R1, which is
under MIT but of course no training data is available. While
Huggingface has a good UI there does not seem to be any way to search
for models that have an open license AND training data available. It
would be reassuring to see a list of those models and be able to
assess if they are likely to grow and evolve to make sure that Debian
does not adopt an overly strict stance that ends up in a situation
where Debian is void of even small spelling and grammar checking
models that could offer great value to end users.

Reply to:

References:
- Proposal Alternative: A Model Can Be a Preferred form of Modification
  - From: Sam Hartman <hartmans@debian.org>
- Re: Proposal Alternative: A Model Can Be a Preferred form of Modification
  - From: Aigars Mahinovs <aigarius@debian.org>

Prev by Date: Re: Non-LLM example where we do not in practice use original training data
Next by Date: Draft: Proposal Alternative: Traning data is not source code
Previous by thread: Re: Proposal Alternative: A Model Can Be a Preferred form of Modification
Next by thread: Re: Proposal Alternative: A Model Can Be a Preferred form of Modification
Index(es):
- Date
- Thread