Re: A policy on use of AI-generated content in Debian

To: Dominik George <natureshadow@debian.org>, debian-project@lists.debian.org
Subject: Re: A policy on use of AI-generated content in Debian
From: Mo Zhou <lumin@debian.org>
Date: Thu, 2 May 2024 15:54:24 -0400
Message-id: <[🔎] fde6864f-6bf9-45c4-b824-7190c3c67570@debian.org>
In-reply-to: <[🔎] C1EAB351-06BD-4414-AE24-CDB6A4F654C0@debian.org>
References: <[🔎] 3qxsesyoouxh2h6fodosnln4wsyl3tpmnbcu6pqzekqkz6k577@a2gos5jbaowf> <[🔎] 293ea612c85c613358d695474f1a36b65a6e1090.camel@43-1.org> <[🔎] C1EAB351-06BD-4414-AE24-CDB6A4F654C0@debian.org>


On 5/2/24 14:47, Dominik George wrote:

That's entirely not the point.

It is not about **the tool** being non-free, but the result of its use being non-free.

Generative AI tools **produce** derivatives of other people's copyrighted works.


Yes. That includes the case where LLMs generates copyrighted contents with
large portions of overlap. For instance,
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html

because those copyrighted contents are a part of their original trainingdataset.

Apart from the LLMs (large language models), the image generation models and

other generative AIs will also do something similar, partly copyingtheir copyrighted

training data to the generated results, to some extent.

That said, we already have the necessary policies in place:

* d/copyright must be accurate
* all sources must be reproducible from their preferred form of modification

Both are not possible using generative AI.


Both are possible.

For example, if a developer uses LLM to aid programming, and the LLM copied
some code from a copyrighted source. But the developer is very unlikely able
to tell whether the generated code contains verbatim copy of copyrighted
contents, lest the source of that copyrighted parts if any.

Namely, if we look at new software projects, we do not know whether the code
files are purely human-written, or with some aid from AI.

Similar things happens with other file types, such as images. Forinstance, youmay ask a generative AI to generate a logo, or some artworks as a partof a softwareproject. And those generated results, with or without furthermodifications, can be

stored in .ico, .jpg, and .png formats, etc.

Now, the problem is, FTP masters will not question the reproducibility of

a code file, or a .png file. If the upstream author does not acknowledgethe useof AI during the development process, it is highly likely that nobodyelse on

the earth will know that.

This does not sound like a situation where we can take any action toimprove.My only opinion towards this is to trust the upstream authors'acknowledgements.


BTW, ML-Policy has foreseen such issue and covered it to some extent:
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.rst
See the "Generated Artifacts" section.

It seems that the draft Open Source AI Definition does not covercontents generated

by AI models yet:
https://discuss.opensource.org/t/draft-v-0-0-8-of-the-open-source-ai-definition-is-available-for-comments/315

Reply to:

References:
- A policy on use of AI-generated content in Debian
  - From: Tiago Bortoletto Vaz <tiago@debian.org>
- Re: A policy on use of AI-generated content in Debian
  - From: Ansgar 🙀 <ansgar@43-1.org>
- Re: A policy on use of AI-generated content in Debian
  - From: Dominik George <natureshadow@debian.org>

Prev by Date: Re: A policy on use of AI-generated content in Debian
Next by Date: Re: A policy on use of AI-generated content in Debian
Previous by thread: Re: A policy on use of AI-generated content in Debian
Next by thread: Re: A policy on use of AI-generated content in Debian
Index(es):
- Date
- Thread