Re: Brief update about software freedom and artificial intelligence

On Mon, 27 Feb 2023 at 19:08, Russ Allbery <rra@debian.org> wrote:

> No.  It's entirely possible that using databases as training sets for an
> AI/ML engine is fair use under existing United States law and precedent as
> long as that use is sufficiently transformative (the first factor of the
> test, and I suspect the most important one here).

Considering what you reported in the previous e-mail about US national
law in 17 U.S.C. § 107 in 1976, It is not possible to use an entire or
a significant portion of a database for {business, commercial,
marketing} purposes without the copyright holder.

Whoever says the contrary forgot that fair use has been introduced to
allow those non-profit activities which have a social value plus few
profit activities (like journalism) that have a social role but the
former could use a very limited portion of copyrighted work. Very
simple and straightforward example is a newspaper article that cites a
couple of paragraphs from a book or some statistical data from a
private database. There is no chance that the incorporation of an
entire database (or a significant part of it) would enter into fair
use for {business, commercial, marketing} purposes otherwise the
principle of copyright would be gone.

I strongly feel that this discussion cannot continue because the
presentation of a mass of legal stuff without a comprehension of the
law principles would lead nowhere more than a show like some US trials
are. Principles cannot be bend by misinterpretation, misjudgement and
ill-written law like US national law in 17 U.S.C. § 107 in 1976 in
which point (1)...(4) are written in such a way that everyone that is
not very acknowledged about principles could misunderstand up to

This (1) does not mean that non-profit and for-profit activities are
equal in enjoy the fair use

    (1) the purpose and character of the use, including whether such use
    is of a commercial nature or is for nonprofit educational purposes;

but it means the opposite, that the two activities can fair-use a
completely different amount of the copyrighted work

   (3) the amount and substantiality of the portion used in relation to
    the copyrighted work as a whole

and in particular the (3) also means that if I write an article of a
few words, it is not fair-use 2 paragraphs of a book.

One more thing: it does not matter that two parties had N trials
settled but the agreement they had at the end - principle - because a
significant judgement is a definitive one otherwise it means that it
was not significant enough even to close that specific case.

> The obvious example is
> a search engine, which performs a similar transformation of clearly
> copyrighted works into a new service with a different purpose, without the
> explicit permission of the copyright holders.

This is another completely story for two reasons:

1. indexing by keywords - the website manager tagged that keyword, so
the content has not been accessed
2. web crawling is an automatic process that do a keyword
identification and associate them to the url

This process has nothing to do with the content unless you would
affirm that the word "cataclysm" cannot be used because it belongs to
a certain copyrighted book and moreover this process is completely
automated in which no human creativity has been involved. Moreover,
indexing and web crawling are totally different processes that lead to
totally different results and aims than those related to an AI
training. Forget to make an analogy between AI training and Google
business because they are completely different things.

> This is the reason why people have focused so much on GitHub Copilot's
> willingness to insert large blocks of code from other projects verbatim.
> Reproducing code from other projects is less transformative and looks more
> like simple copying, and therefore opens GitHub to a legal argument that
> their AI model is not sufficiently transformative to be fair use.

Transformative is not the key, incorporating large pieces of code is
not the key. This is the peak of the iceberg for which people realised
that their code has been used. The iceberg to handle is the learning
process before it happens which is about the input collection. Here we
are: the input collection of an AI/ML training system is what we want
to keep free. Why do we want to keep the input collection? Because
like in compilation we also have the entire model in freedom. This in
exchange for the right to use our code as input data.

I am pretty sure that those complaining about GitHub Copilot are not
upset because the AI is not transformative enough to masquerade their

Best regards, R-

