[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Draft: Proposal Alternative: Traning data is not source code



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

** Proposal Text **

Choice 3: Training data for training of AI models is not to be
considered "source code" in the context of DFSG. Instead the real
source code in such a case is "Training Data Information" and the
training data itself is an intermediate build artifact.

AI models are compatible with DFSG only if they provide complete
"Training Data Information". AI models whose reproduction from
training data or from training data information is prohibitively
expensive or is impractical are compatible with DFSG only if they
provide ways to modify the AI model and create derivative works
directly from the trained model.

The meaning of "Data Information" is based on definitions and
explanations from https://opensource.org/ai/open-source-ai-definition

** Rationale **

The problem of collection and distribution of training data sets can
be fully avoided by going another step back - seeing the "training
data information" as the *actual* source code and the "training data"
itself only being an intermediate build artifact. While this might not
guarantee a fully reproducible rebuild of a model (even if that could
be a possibility in some cases by identifying the exact version of the
source data with use of hashes), it does a step better - it makes it
possible (if enough resources are invested) to create a new version of
the model with new and updated data. And it does not put the onus on
Debian to redistribute this intermediate data.

The definition maintains all guidelines of DFSG intact, but add two
clarifications:

1) Training data is not source code. There is a difference in
copyright case law between source code and training data. It is really
clear that a compiled binary is a derived work of the source code.
However, there is no direct copyright relationship between training
data and an AI model or its outputs. That is why it should be
considered separately. The specifics may need to be adjusted based on
future court orders. For example, it might be necessary to include
protections against AI regurgitation.

2) An AI model that is both prohibitively expensive to reproduce and
is not easily modifiable does not (de-facto) satisfy the DFSG derived
works requirement.
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEFmwrqIlWRDzdY39G+mQ7ph0ievsFAmgaWqoACgkQ+mQ7ph0i
evtd0Q/9FrxQqHQI94GBNQF3uA+BbcghJYq4ZSLxGdrhS6g2IQpZ3Vq+dMJ9WKrr
5Wbmct2u2vt2Mk36WFnuTQDkEv6Cx9QN/lMUfMhcnBVnt8hL1XjRCbQGCMqiUcRz
/QFAGbhjuxwvLWPDAKs3AEWbv0nPTmacEzMVA7s8629ZnRq9sV9fzcnP0jqBBQq0
lvaeDJBiKgpmM3b/ENeyKopmuRroCpqpG2OTghAsMSa7JHqfibgqamHmFkeDaOJt
5HveKmcm9AV2PwVP6UZHpyDciCCPFkZSpor1V+02qhEZBtHKNxGNgAYb/Edxnsxh
1W7MRQrwi8alPXeFKYLKNbD1ZP7WUDjvEXVJF1ucmir0599us+soPjN9VFNkr58F
5ugoubQN+rcz989tPbSnUst6wSPkDgRlkjtaF+uPn6LCIFuvCt3GH+OxJlmYG/K+
1C9Ea60WMkn38b6Yn9gW7WYq09hnP6kpPeXfmD68Ac0YxWKoj18FPD3WDwTc5/S5
Fp+LpJ3vd1PpcYfacA0a+l7H0Vc5K4woRjzCU4KTVeYpBZSe4hRuOn3igFx6Z53E
cUwjoZqnCLU7SoiDP9xXSBTF3UBM/iTcrW33gBE3ujKyv+p2z74eUvrn302ZFA9G
JlDoRdmTHqLlNncEA04FdJ6+VBNY6GZKGXK5r0vDMnQ26MMHWdU=
=imT+
-----END PGP SIGNATURE-----


-- 
Best regards,
    Aigars Mahinovs        mailto:aigarius@debian.org


Reply to: