[RFCv2] Counter-Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA384
Hi Simon,
>Okay, I see what you are trying to get at, but I'm not certain things
>can be separated that easily.
hm, true. We probably can also not know whether firmware blobs have
this inside or not. Perhaps it’s best to just leave non-free-firmware
out of this, for now, at least in explicit mentions.
So, with all the updates, maybe something like this?
Counter-Proposal -- Interpretation of DFSG on (AI) Models (v2)
=========================================================
Please see the original proposal for background on this.
The counter-proposal is as follows:
The Debian project requires the same level of freedom for AI models
than it does for other works entering the archive.
Notably:
1. A model must be trained only from legally obtained and used works,
honour all licences of the works used in training, and be licenced
under a suitable licence itself that allows distribution, or it is
not even acceptable for non-free. This includes an understanding
that “generative AI” output are derivative works of their inputs
(including training data and the prompt), insofar as these pass
threshold of originality, that is, generative AI acts similar to
a lossy compression followed by decompression, or to a compiler.
Any work resulting from generative use of a model can at most be
as free as the model itself; e.g. programming with a model from
contrib/non-free assisting prevents the result from entering main.
The "/usr/share/doc/PACKAGE/copyright" file must include copyright
notices from all training inputs as required by Policy for “any
files which are compiled into the object code shipped in the binary
package”, except for inputs already separately packaged (such as
the training software, libraries, or inputs already available from
packages such as word lists also used for spellchecking).
Regarding availability of sources used for training, the normal
rules of the non-free archive apply.
2. For a model to enter the contrib archive, it may at runtime require
components from outside of Debian main (such as drivers for specific
hardware it is designed to run on), but the model itself (including
any training input that ends up in the model) must still comply with
the DFSG, i.e. follow below requirements for models entering main.
If a model requires a component outside of main at build or training
time that changes the model itself (e.g. training data, or training
software part of which ends up in the trained model), it is only
admissible to non-free.
3. For a model to enter the main archive, all works used in training
must additionally be available, auditable, and under DFSG-compliant
licencing. All software used to do the training must be available
in Debian main.
If the training happens during package build, the sources must be
present in Debian packages or in the model’s source packages; if
not, they must still be available in the same way.
This is the same rule as is used for other precompiled works in
Debian packages that are not regenerated during build: they must
be able to be regenerated using only Debian tools, waiving the
requirement to actually do the regenerating during package build
is a nod to realistic build time and resource usage.
4. For a model to enter the main archive, the model training itself
must *either* happen during package build (which, for models of
a certain size, may need special infrastructure; the handling of
this is outside of the scope of this resolution), *or* the model
resulting from training must build in a sufficiently reproducible
way that a separate rebuilding effort from the same source will
result in the same trained model. (This includes using reproducible
seeds for PRNGs used, etc.)
For realistic achievability of this goal, the reproducibility
requirement is relaxed to not require bitwise equality, as long
as the resulting model is effectively identical. (As a comparison,
for C programs this would be equivalent to allowing different
linking order of the object files in the binary or embedded
timestamps to differ, or a different encoding of the same opcodes
(like 31 C0 vs. 33 C0 for i386 “xor eax,eax”), but no functional
changes as determined by experts in the field.)
5. For handling of any large packages resulting in this, the normal
processes are followed (such as discussing in advance with the
relevant teams, ensuring mirrors are not over-burdened, etc).
The Debian project asks that training sources are not obtained
unethically, and that the ecological impact of training and using
AI models be considered.
Transitional provisions:
ⅰ. Any bugs resulting from this GR shall not be release-critical
before Debian trixie has been released as stable.
ⅱ. Any existing package with a “model” inside that already had the
very same model before 2020-01-01 has an extra four years time
before bugs regarding these models may become release-critical.
[End of proposal.]
Thanks,
//mirabilos
- --
21:41⎜«Tonnerre:#nosec» Do at least one thing every day which makes
⎜ inspirational quotes lovers sad
-----BEGIN PGP SIGNATURE-----
iQIcBAEBCQAGBQJoHR8cAAoJEHa1NLLpkAfg12UP/0+8R57hvjUPkQVJ6hRxAkH3
EgMAYKoxsQQ11d8cz/MaCFxeuM8sitGb68oP3JTzYKkcoqtXi0pQMMNQ/xX1YeMX
YTBKkq2jWIYD3z1hLTqyQcW/2G8a9yygEXsqt7Jm53b3vdkCo6UoyiaizvJ71pok
9cB7U27UhaLq2Ay32EB5bfFRQ8qOHapnRMWpHf0gDYSB2rG4PjjOkH8xijkvnbOB
mdUAMtQvAGa0goYmdFiamYNxS/7J6NQJCz0myg9eYuh8egajNIrO8DoSOeoDiC4y
Oc0i+vQ3du7ynExQ6EHOwVKl8cTHJcxtfwbC100ktZGUuMXTvjmDQg4exWxbfYpD
3IQS3ueAgrbc6K1+UA7xlJD3zNIAg/hsjpVeW2cbz4VJ8cioPOb+F6OrXk+GP8Fk
KLM8Nhc2IvCAOmOwl6ZbkrrrSZrpk6STHlpcyz3NsdowrfwUxm6ZXqI7ELUiwIaJ
cm+5Z0wrG+tJZr+Ia3BAKjPlxIbp3wNDts555NdeY+BxVKCXPyYZmnBGSpAJug6Q
Thu3upZjGL+hCx8U4UyQC8ypM8pEUKSwNDxI8SE5sopf7fgrbEa3N3h7OZwL4Qhz
OmT51eJ+rLifSBztFcHVNe0ztQsoagVZzb/bE7WV5nJwvR4GFSbO17LzCi5ouBYx
1rRGgalT5MTCFDJnewPJ
=xeEf
-----END PGP SIGNATURE-----
Reply to: