Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

To: debian-vote@lists.debian.org
Subject: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
From: Matthias Urlichs <matthias@urlichs.de>
Date: Tue, 29 Apr 2025 07:36:22 +0200
Message-id: <[🔎] a519b113-f0b2-4344-b90f-4d1c445c86f7@urlichs.de>
In-reply-to: <[🔎] CABpYwDUBjmsaED7KRCscQCz9V4apZesYKeyJwpAq2UDcn6UKYQ@mail.gmail.com>
References: <[🔎] 6a60f2f9e7e719aab39e5d21a623d8bac848b9ab.camel@debian.org> <[🔎] aAfPA6IqfoDLnAhs@layer-acht.org> <[🔎] 40e7d297d72014365dad8be242a359c2b06ac7d3.camel@debian.org> <[🔎] a351e052-ab6c-4f66-9f6c-0db8064e990c@urlichs.de> <[🔎] CABpYwDUeRawmtUqjnQTYhZ5Kwt+82PFPUXZK2LN1O9GV8CSkOQ@mail.gmail.com> <[🔎] 87a580s0b5.fsf@hope.eyrie.org> <[🔎] CABpYwDUBjmsaED7KRCscQCz9V4apZesYKeyJwpAq2UDcn6UKYQ@mail.gmail.com>

On 28.04.25 21:24, Aigars Mahinovs wrote:

So, very precisely speaking, modification of a LLM does *not* require the original training data. Recreating a LLM does.

IMHO that's a rather academic distinction. Yes *some* modifications don't require original training data, much like some modifications to libc don't require source code (I'm doing it myself; in one of my projects I patch the libc loader to search in /v/u/lib instead of /usr/lib because of pseudo multi-arch) but most do.

However, and returning to the root of this discussion: When we talk about source as the preferred way of modifying something, we need to ask *whose* preferred way. The user's? Certainly not, otherwise we wouldn't need to ship nvim's source code.

Thus it's the developers' preferred source, which leaves pre-built models out in the cold.

However² IMHO we need to distinguish between things like gnubg or tesseract, and today's LLMs or similar "large" models.

We can, absent no copyright restrictions, more-or-less-easily recreate the former's models from their training data.

We can't do that with LLMs or similar-sized models, even if we had source code.

Their developers create a model's architecture, presumably some Python-or-whatever source code and/or a descriptive language, which *is* their source. We don't get that. This source gets compiled to whatever (we also don't get these binaries). The result is then run in training mode on a large corpus which Debian can't distribute (a) for copyright reasons but also (b) because it's too damn large, end up with a base model which they don't give us either and which gets tweaked by further training and human feedback (partly by poorly-paid gig workers in developing countries), then distilled down to manageable size (but still too large for us to distribute in many cases).

So our choice is basically between shipping something we don't control and can't introspect, and, well, not doing so.

There is no third choice of distributing a free alternative, because even if we get the architecture's source code and aside from the copyright issue and the humongous-size issue and the multiple-manual-build-steps issue and the shouldn't-we-save-energy-dammit issue there's the looming problem that almost(?) none of us have even remotely enough GPUs to reproduce the resulting model in the first place.

My vote is on not doing so. We might want to ship the requisite tools in contrib and let people download the models from huggingface, but that's as far as I want to take Debian in that direction.

-- 
-- regards
-- 
-- Matthias Urlichs

BEGIN:VCARD
VERSION:4.0
N:Urlichs;Matthias;;;
NICKNAME:Smurf
EMAIL;PREF=1:matthias@urlichs.de
TEL;TYPE=work;VALUE=TEXT:+49 911 59818 0
URL;TYPE=home:https://matthias.urlichs.de
END:VCARD

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply to:

Follow-Ups:
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Aigars Mahinovs <aigarius@gmail.com>

References:
- Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: "M. Zhou" <lumin@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Holger Levsen <holger@layer-acht.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Ansgar 🙀 <ansgar@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Matthias Urlichs <matthias@urlichs.de>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Aigars Mahinovs <aigarius@gmail.com>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Russ Allbery <rra@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Aigars Mahinovs <aigarius@gmail.com>

Prev by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Previous by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Index(es):
- Date
- Thread