[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: A policy on use of AI-generated content in Debian




On 5/2/24 14:47, Dominik George wrote:
That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.

Yes. That includes the case where LLMs generates copyrighted contents with
large portions of overlap. For instance,
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
because those copyrighted contents are a part of their original training dataset.
Apart from the LLMs (large language models), the image generation models and
other generative AIs will also do something similar, partly copying their copyrighted
training data to the generated results, to some extent.

That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.

Both are possible.

For example, if a developer uses LLM to aid programming, and the LLM copied
some code from a copyrighted source. But the developer is very unlikely able
to tell whether the generated code contains verbatim copy of copyrighted
contents, lest the source of that copyrighted parts if any.

Namely, if we look at new software projects, we do not know whether the code
files are purely human-written, or with some aid from AI.

Similar things happens with other file types, such as images. For instance, you may ask a generative AI to generate a logo, or some artworks as a part of a software project. And those generated results, with or without further modifications, can be
stored in .ico, .jpg, and .png formats, etc.

Now, the problem is, FTP masters will not question the reproducibility of
a code file, or a .png file. If the upstream author does not acknowledge the use of AI during the development process, it is highly likely that nobody else on
the earth will know that.

This does not sound like a situation where we can take any action to improve. My only opinion towards this is to trust the upstream authors' acknowledgements.

BTW, ML-Policy has foreseen such issue and covered it to some extent:
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst
See the "Generated Artifacts" section.

It seems that the draft Open Source AI Definition does not cover contents generated
by AI models yet:
https://discuss.opensource.org/t/draft-v-0-0-8-of-the-open-source-ai-definition-is-available-for-comments/315


Reply to: