[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Non-LLM example where we do not in practice use original training data



(While I find the tone of the email a bit exasperated, I will try to
reply factually and I hope it will be received as such.)

On Wed, 7 May 2025 at 11:34, Simon Josefsson <simon@josefsson.org> wrote:
>
> Aigars Mahinovs <aigarius@gmail.com> writes:
>
> > On Wed, 7 May 2025 at 02:56, Russ Allbery <rra@debian.org> wrote:
> >
> >>
> >> I think if any of the options in the current GR except Aigars's (and maybe
> >> Sam's?) passes, that would effectively be a change in our current policy,
> >> even if the current policy is not precisely intentional.
> >
> >
> > IMHO my option will also be a change in our current policy, but, instead of
> > requiring the training data itself, my option would just require adding a
> > documentation section describing how to create/gather and process data
> > required to train such models *if* someone would want to reproduce them.
>
> Would failure for anyone else to be able to reproduce them be a RC bug?

Depends on the clarity and explicticity of the instruction. OSI uses
the criteria that a skilled person should be able to build a
substantially equivalent system with the given instructions. So it
would be technically sufficient if *someone* is able to reproduce
sufficiently similar results. Others may be unable to due to
committing some mistake in the process, which often includes a value
judgement (like recognising when a model is overfitted).

> Do the tools required for reproducing the model have to be in Debian
> main, or are non-free or external proprietary tools okay?

Yes, all software required for creating the training data set,
transforming the training data set, training the model and using the
model has to be DFSG-free software in Debian main. That part was never
in question in any definition being discussed AFAIK.

> Do the toolchain for LLM models support bit-by-bit reproducible outputs?

AFAIK - no. Bit-by-bit reproducibility is also not a DFSG criteria.

> Is a Build-Depends on such a LLM-model acceptable?  Then we could
> eventually replace the source code for `sudo` in Debian with a LLM
> prompt like "write me a secure replacement for sudo and output a
> executable ELF binary for my host architecture".  In fact, with a bit of
> more irony, we could replace a lot of insecure source code this way.

That is a fun question, but you would get the same exact answer
regardless of what training data was used to train such LLM. Even if a
LLM were to be created that was *only* trained on contents of Debian
main. Replacing source code of a package with a call to a generator
would be silly in many different ways. (And it would not really
generate a binary, that's not how LLMs work - they still output
words.) However, there is nothing problematic about a developer using
an LLM to generate source code, that *after developers* review becomes
part of a wider code base implementing useful functionality. This
could also be very productively used to generate drafts of API
documentation and unit tests. It is no different from templating and
scaffolding. The developer executing those requests and reviewing the
code owns the copyright of the generated material.

> I'm not convinced this approach leads to something desirable.  I fear it
> means people will have yet another way to add proprietary content into
> Debian, and that Debian give up further on caring about user freedom.

Being able to reproduce Debian binaries bit-by-bit from
developer-readable source code is a good feature, but it does not
really appear in the user freedoms defined in the Debian Social
Contract. Even the ability to fully automatically rebuild a particular
Debian package can disappear over time as external dependencies and
environments change and these changes also require adaptations to be
made by a skilled person to be able to again build a substantially
equivalent system.

Pushing towards models that provide a way to modify the model
behaviour *after* base training *adds* to the user freedoms for
modification and derived works compared to fixed software binaries we
have now. Pushing towards descriptions of recreation processes for
model weights and of training data *adds* to the freedoms for access
to the real source code to the users compared to distributing only the
already distilled expert knowledge like we do right now.

-- 
Best regards,
    Aigars Mahinovs


Reply to: