However² IMHO we need to distinguish between things like gnubg or
tesseract, and today's LLMs or similar "large" models.
Yes, that could be a useful thing to do.
We can, absent no copyright restrictions, more-or-less-easily
recreate the former's models from their training data.
We can't do that with LLMs or similar-sized models, even if we
had source code.
Their developers create a model's architecture, presumably some
Python-or-whatever source code and/or a descriptive language,
which *is* their source. We don't get that. This source gets
compiled to whatever (we also don't get these binaries). The
result is then run in training mode on a large corpus which Debian
can't distribute (a) for copyright reasons but also (b) because
it's too damn large, end up with a base model which they don't
give us either and which gets tweaked by further training and
human feedback (partly by poorly-paid gig workers in developing
countries), then distilled down to manageable size (but still too
large for us to distribute in many cases).
The proposals so far have been to agree to ship the end-model inside Debian, if all the software used in the training process is DFSG-free and if the training process is documented (or scripted with included DFSG-free scripts) and if the training corpus is also either available or shipped with the source. Technically it should not be a big problem to capture the training corpus of a model and save it before the training starts - it all needs to be downloaded and packaged for the training program to consume anyway. Where problems start are the legal (and technical) hurdles in redistributing this training corpus as such assembly of direct copies of copyrightable works would be encumbered by the copyrights of its individual parts. The source *could* be out there, but we as Debian would not have the legal rights to redistribute it. Even if any developer would have the rights to acquire such data set and use it to train the model (if they had sufficient resources at their disposal).
This basically raises a fundamental question of whether the training data actually is source code. Or if it needs a different legal and technical definition with different rules for handling it. The copyright law seems to currently be interpreted so that training data is *not* the same as the source code. IMHO we should not do that as well.
The question of how to handle the additional training that involves model refinement using humans was not considered at all so far. This could be made to be DFSG-free both from the licensing and from testing protocol perspective. But practical reproducibility would then have similar barriers as needing billions of dollars worth of hardware and millions of dollars worth of electricity per reproduction for the model itself.
So our choice is basically between shipping something we don't
control and can't introspect, and, well, not doing so.
There is no third choice of distributing a free alternative,
because even if we get the architecture's source code and aside
from the copyright issue and the humongous-size issue and the
multiple-manual-build-steps issue and the
shouldn't-we-save-energy-dammit issue there's the looming problem
that almost(?) none of us have even remotely enough GPUs to
reproduce the resulting model in the first place.
My vote is on not doing so. We might want to ship the requisite
tools in contrib and let people download the models from
huggingface, but that's as far as I want to take Debian in that
direction.
That might be the most likely outcome. But in that case IMHO it would be of benefit for the community guidance to differentiate between what we see as free AI (and what not) and what subset of those free AI models we consider practically includable in Debian archive with additional criteria to input data packaging and availability as well as what amount of resources are needed for re-creation of the model (with a limit to what Debian can actually have in terms of hardware and afford to spend in terms of power/rental costs if needed).
Debian has historically been a very important community voice in defining clear criteria and targets that the rest of the community then used to rally around. Starting with DFSG and analysis of specific copyright licenses and to more recent projects like reproducible builds.