[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On 28.04.25 21:24, Aigars Mahinovs wrote:
So, very precisely speaking, modification of a LLM does *not* require the original training data. Recreating a LLM does.

IMHO that's a rather academic distinction. Yes *some* modifications don't require original training data, much like some modifications to libc don't require source code (I'm doing it myself; in one of my projects I patch the libc loader to search in /v/u/lib instead of /usr/lib because of pseudo multi-arch) but most do.

However, and returning to the root of this discussion: When we talk about source as the preferred way of modifying something, we need to ask *whose* preferred way. The user's? Certainly not, otherwise we wouldn't need to ship nvim's source code.

Thus it's the developers' preferred source, which leaves pre-built models out in the cold.

However² IMHO we need to distinguish between things like gnubg or tesseract, and today's LLMs or similar "large" models.

We can, absent no copyright restrictions, more-or-less-easily recreate the former's models from their training data.

We can't do that with LLMs or similar-sized models, even if we had source code.

Their developers create a model's architecture, presumably some Python-or-whatever source code and/or a descriptive language, which *is* their source. We don't get that. This source gets compiled to whatever (we also don't get these binaries). The result is then run in training mode on a large corpus which Debian can't distribute (a) for copyright reasons but also (b) because it's too damn large, end up with a base model which they don't give us either and which gets tweaked by further training and human feedback (partly by poorly-paid gig workers in developing countries), then distilled down to manageable size (but still too large for us to distribute in many cases).

So our choice is basically between shipping something we don't control and can't introspect, and, well, not doing so.

There is no third choice of distributing a free alternative, because even if we get the architecture's source code and aside from the copyright issue and the humongous-size issue and the multiple-manual-build-steps issue and the shouldn't-we-save-energy-dammit issue there's the looming problem that almost(?) none of us have even remotely enough GPUs to reproduce the resulting model in the first place.

My vote is on not doing so. We might want to ship the requisite tools in contrib and let people download the models from huggingface, but that's as far as I want to take Debian in that direction.

-- 
-- regards
-- 
-- Matthias Urlichs
BEGIN:VCARD
VERSION:4.0
N:Urlichs;Matthias;;;
NICKNAME:Smurf
EMAIL;PREF=1:matthias@urlichs.de
TEL;TYPE=work;VALUE=TEXT:+49 911 59818 0
URL;TYPE=home:https://matthias.urlichs.de
END:VCARD

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


Reply to: