Re: A policy on use of AI-generated content in Debian
On 5/2/24 14:47, Dominik George wrote:
That's entirely not the point.
It is not about **the tool** being non-free, but the result of its use being non-free.
Generative AI tools **produce** derivatives of other people's copyrighted works.
Yes. That includes the case where LLMs generates copyrighted contents with
large portions of overlap. For instance,
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
because those copyrighted contents are a part of their original training
dataset.
Apart from the LLMs (large language models), the image generation models and
other generative AIs will also do something similar, partly copying
their copyrighted
training data to the generated results, to some extent.
That said, we already have the necessary policies in place:
* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification
Both are not possible using generative AI.
Both are possible.
For example, if a developer uses LLM to aid programming, and the LLM copied
some code from a copyrighted source. But the developer is very unlikely able
to tell whether the generated code contains verbatim copy of copyrighted
contents, lest the source of that copyrighted parts if any.
Namely, if we look at new software projects, we do not know whether the code
files are purely human-written, or with some aid from AI.
Similar things happens with other file types, such as images. For
instance, you
may ask a generative AI to generate a logo, or some artworks as a part
of a software
project. And those generated results, with or without further
modifications, can be
stored in .ico, .jpg, and .png formats, etc.
Now, the problem is, FTP masters will not question the reproducibility of
a code file, or a .png file. If the upstream author does not acknowledge
the use
of AI during the development process, it is highly likely that nobody
else on
the earth will know that.
This does not sound like a situation where we can take any action to
improve.
My only opinion towards this is to trust the upstream authors'
acknowledgements.
BTW, ML-Policy has foreseen such issue and covered it to some extent:
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst
See the "Generated Artifacts" section.
It seems that the draft Open Source AI Definition does not cover
contents generated
by AI models yet:
https://discuss.opensource.org/t/draft-v-0-0-8-of-the-open-source-ai-definition-is-available-for-comments/315
Reply to: