[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[RFC] Counter-Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA384

Cover letter
============

(Please do keep me in Cc, I’m not subscribed to the list.)

Hi! I had not realised it’s going to GR with this, so I’ve drafted
a counter proposal, based on the thread on debian-private around
<93d2028888fce48ab4b4609d59f7a72c9edc916e.camel@debian.org> and
earlier thoughts I’ve collected regarding this topic, such as on
https://evolvis.org/~tg/cc.htm and the interpretation guidelines
on https://mbsd.evolvis.org/MirOS-Licence.htm (this is a mirror on
a more capable VM).

I’m not sure how quickly I’ll need seconds, but I would also welcome
input on this proposal (including from the l10n-en team as I’m not a
native English speaker).

I’m PGP-signing this with my DD key, as, for the avoidance of doubt,
should time be short indeed I’m submitting this as a choice. If time
isn’t short, I’m tentatively submitting it, with working in feedback
and updating it first as an option.


Counter-Proposal -- Interpretation of DFSG on (AI) Models
=========================================================

Please see the original proposal for background on this.

The counter-proposal is as follows:

The Debian project requires the same level of freedom for AI models
than it does for other works entering the archive.

Notably:

1. A model must be trained only from legally obtained and used works,
   honour all licences of the works used in training, and be licenced
   under a suitable licence itself that allows distribution, or it is
   not even acceptable for non-free. This includes an understanding
   that “generative AI” output are derivative works of their inputs
   (including training data and the prompt), insofar as these pass
   threshold of originality, that is, generative AI acts similar to
   a lossy compression followed by decompression, or to a compiler.

   Any work resulting from generative use of a model can at most be
   as free as the model itself; e.g. programming with a model from
   contrib/non-free assisting prevents the result from entering main.

   The "/usr/share/doc/PACKAGE/copyright" file must include copyright
   notices from all training inputs as required by Policy for “any
   files which are compiled into the object code shipped in the binary
   package”, except for inputs already separately packaged (such as
   the training software, libraries, or inputs already available from
   packages such as word lists also used for spellchecking).

   Regarding availability of sources used for training, the normal
   rules of the non-free archive apply.

2  Models are not suitable for the non-free-firmware archive.

3. For a model to enter the contrib archive, it may at runtime require
   components from outside of Debian main, but the model itself must
   still comply with the DFSG, i.e. follow below requirements for
   models entering main. If a model requires a component outside of
   main at build or training time, it is only admissible to non-free.

4. For a model to enter the main archive, all works used in training
   must additionally be available, auditable, and under DFSG-compliant
   licencing. All software used to do the training must be available
   in Debian main.

   If the training happens during package build, the sources must be
   present in Debian packages or in the model’s source packages; if
   not, they must still be available in the same way.

   This is the same rule as is used for other precompiled works in
   Debian packages that are not regenerated during build: they must
   be able to be regenerated using only Debian tools, waiving the
   requirement to actually do the regenerating during package build
   is a nod to realistic build time and resource usage.

5. For a model to enter the main archive, the model training itself
   must *either* happen during package build (which, for models of
   a certain size, may need special infrastructure; the handling of
   this is outside of the scope of this resolution), *or* the model
   resulting from training must build in a sufficiently reproducible
   way that a separate rebuilding effort from the same source will
   result in the same trained model. (This includes using reproducible
   seeds for PRNGs used, etc.)

   For realistic achievability of this goal, the reproducibility
   requirement is relaxed to not require bitwise equality, as long
   as the resulting model is effectively identical. (As a comparison,
   for C programs this would be equivalent to allowing different
   linking order of the object files in the binary or embedded
   timestamps to differ, or a different encoding of the same opcodes
   (like 31 C0 vs. 33 C0 for i386 “xor eax,eax”), but no functional
   changes as determined by experts in the field.)

6. For handling of any large packages resulting in this, the normal
   processes are followed (such as discussing in advance with the
   relevant teams, ensuring mirrors are not over-burdened, etc).

The Debian project asks that training sources are not obtained
unethically, and that the ecological impact of training and using
AI models be considered.

[End of proposal.]

-----BEGIN PGP SIGNATURE-----

iQIcBAEBCQAGBQJoCV23AAoJEHa1NLLpkAfgfQcP/jDN+p+rY0fPhQUZ/HpJadkJ
BawiUYp+TMjsXowrXXy9Mp7FyrlWrj+zROfA1tup2+TkdlQSY8A62aWYS62y5z9y
x5TxqwS3+xH6UmtchmX7alxy7u9vUrcsdUM9NKt1DZQANyqq8+pVTpMKauNNsXr+
L8zq/37ludyjCf+c9pnJ066CUaLBBMQGWmfPO8c1mjYWNnACXgYuUH1cw8Sgzr5u
vQrdURGfebrmTCQBbmCO5FOzQ3Q/uLjl5CocC8HWF0TBh7vcVtnYCkrvalECJpO5
PlCMUZ0MApuEJ1UTUcj+5lDxdH02dcMdFd7v+OB7+E5Jr+MHDR0wWoVaScm9MYno
Eip0sxbzVRqozeAH5bKKSaIQN+4KL/pVB2bYxwR4N5/W/9cxDsJmF/uoB1lZNtL8
DOvLar3RmHNVbaXin/E3afhw5L3O7JeppTSCby9Unyow8hmRjfjhz//ApEbOrWfv
CNH7sdM2mkEe0SXoxLyX7wfmZuWQ2SUZ4nwbj3vmHvM6jrVragCJxibQyVEIzuSQ
1FB0MsFa1TrYN4tnR7/q9AiskcHKiTwcdJh0LFCiLZ2F2d2sd4ne60qQTCpmjzzG
WkhgeTOeLPCDgkHmC+oUEzGpQruKI/surQ9NSGWbFDyEPTGf9rVzMNlVRp0jJSob
2PclqIcmvlO8Krw+9klA
=U1FJ
-----END PGP SIGNATURE-----


Reply to: