Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

On Sun, 4 May 2025 at 17:30, Wouter Verhelst <w@uter.be> wrote:

It is incorrect, because the New York Times did in fact file suit
against Microsoft, OpenAI, and other parties related to copyright
infringement of their large library of news articles in creating
ChatGPT[1]. The case is still in court.

[1] https://www.courtlistener.com/docket/68117049/the-new-york-times-company-v-microsoft-corporation/

Thanks for this link, it has been a very interesting read. I have not read all documents, but the ones I have read paint this picture:

NYT claims copyright infringement (document 1)

* NYT claims that Microsoft claims that " their conduct is protected as “fair use” because their unlicensed use of copyrighted content to train

GenAI models serves a new “transformative” purpose.", to which NYT disagrees

* New York Times sues them for copyright infringement on the *outputs* with the specific note: ". Because the outputs of Defendants’GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use."

* Also explicitly noted that Microsoft (via Bing) already (legally) provides users with sniplets of New York Tmes content, but in a smaller amounts that is possible to get out of theie AI models

* NYT also makes a claim that "These systems were used to create multiple

reproductions of The Times’s intellectual property for the purpose of creating the GPT models that
exploit and, in many cases, retain large portions of the copyrightable _expression_ contained in those
works."

* NYT claims that "Unauthorized Reproduction of Times Works During GPT Model Training" happened

* NYT claims that "Embodiment of Unauthorized Reproductions and Derivatives of Times Works in
GPT Models" happened and as evidence of that produce outputs from models that reproduce several paragraphs from NYT articles nearly perfectly.

* NYT claims that "Unauthorized Public Display of Times Works in GPT Product Outputs" happend as shown in the previous point

* When NYT includes queries with their exibits they are *very* specific - not asking a generic question, but specifically asking what NYT says about somethign, what is the content of a specific NYT article or what a specific NYT author wrote about a particular place

* It is not clear from the text of the claim that the actual article text is indeed inside the model or if it is being requested and mixed into the response context based on the very specific query

* NYT claims that " Unauthorized Retrieval and Dissemination of Current News" happened - similar to same with Bing news, but quoting more of the content

* and also NYT claims that model hallucinations claim that NYT published things that NYT did not publish

* notably the provided example does *not* include NYT in the text of the query so this would not trigger retrieval of specific articles for reference

* NYT claims "In the alternative, to the extent an end-user may be liable as a direct infringer based

on output of the GPT-based products, Defendants materially contributed to and directly assisted
with the direct infringement perpetrated by end-users of the GPT-based products" - specifically they claim that (in case the court decides that actual infringers are the end users) Microsoft is still liable for allowing such requests and responses

OpenAI respons with (document 52)

* Claiming that their models will not produce verbatim copies of the NYT articles in their normal use and allege manipulation of the interface, including the query including the text of the articles in the context of the question (either directly or via upload or via a gatherable link)

* OpenAI claims that it is "fair use under copyright law to use publicly accessible content to train generative AI models to

learn about language, grammar, and syntax, and to understand the facts that constitute humans’
collective knowledge" as nether facts nor rules of language are copyrightable

* “The general rule of law
is, that the noblest of human productions—knowledge, truths ascertained, conceptions, and
ideas—become, after voluntary communication to others, free as the air to common use.” is qouted as foundational part of copyright law

* OpenAI claims that some actions happened more than three years ago (like gathering the data sets) and thus can not be sued anymore

* OpenAI claims that contributing to copyright infringement by end users requires actual knowledge of specific infringement - a generic possibility is not sufficient

* During explainiung the LLM process OpenAI also describes how data sets like "WebText", WebText2 and Common Crawl were used for training - such data sets (held and distributed by third parties and not Debian) could be used for reproduction of (otherwise) free models

* OpenAI specifically calls out that their early models were surprisingly able to translate from French to English, despite being specifically cleaned from non-English data sources

* OpenAI claims that "Indeed, it has long been clear that the non-consumptive use of copyrighted material (like
large language model training) is protected by fair use" .. " Since Congress codified that doctrine in 1976 (courts should “adapt” defense to “rapid technological
change”), courts have used it to protect useful innovations like home video recording, internet search, book search tools, reuse of software APIs, and many others. "

* " These precedents reflect the foundational principle that copyright law exists to control the
dissemination of works in the marketplace—not to grant authors “absolute control” over all uses
of their works."

* " Copyright is not a veto right over transformative
technologies that leverage existing works internally—i.e., without disseminating them—to new
and useful ends, thereby furthering copyright’s basic purpose without undercutting authors’ ability
to sell their works in the marketplace "

* OpenAI claims that model regurgitation and hallucination are uncommon and undesirable properties of the models.

* The regurgitation can often happen if some text appears many times in the training data in the same for because it has already been copied to many diverse sources.

* OpenAI explains that hallucinations show the actual, statistical basis for the responses

* OpenAI claims that NYT claims are misleading as even when asked for specific qoutes from specific articles, the model would actually output random parts of those articles which NYT complain cut out to make an impression of precise recall

* OpenAI notes that NYT chose to only query articles that are between 2.5 and 20 years old (and that have, presumably, been qouted around the web)

Microsoft responds with document 65

* also detailing that NYT promts that caused qouting of the NYT articles involved *very* specific and unrealistic queries that often included whole paragraphs of the specific articles that the promt was fishing for

The documents after that are mostly fighting about rules of discovery and trying to dismiss some charges in advance of actual arguments.

Comments from the judge seem to indicate the focus on two questions:

* whether use of copyrighted material in the training process is fair use on the basis of the use being sufficiently transformative

* in what specific conditions it is or not possible to get the model to reproduce parts of copyrighted materials

My opinion:

The question of the models themselves being derrived work of the trainign data comes up only in the context of verbatim copies of inputs appearing in outputs. The provided examples of such cases look extensively doctored to me (and to OpenAI experts) to the point of the user providing half of a well qouted article and then seeing the model statistically continuing the article in some semi-random parts. At that point it is not storage, but just abuse of statistics. And modern models are protected against such problems. The size of the models is literally too small to contain all the articles of the training data, even with best compression. I believe that OpenAI and Microsoft will be able to show that queries that NYT provided as examples are themselves introducing the copyrighted material.

As for the fair use in the training process, there is quite a lot of precedent for this, there are a lot of different examples in https://fairuse.stanford.edu/overview/fair-use/cases/

I, particularly, see as most relevant the thumbnail case and the case of Richard Prince collage and Google Books case.

The thumbnail case shows that ever technically trivial and deterministic pure-software transformation of an image to a smaller image can be fair use in the right context.

And the second case shows that even fully reporoducing a full copy of multiple copyrightable works in a new work and distributing that can be a transformative fair use case.

Google Books indexed millions of books submitted by libraries for full text search across the books (a database of the actual texts of the books) and would provide the users with excepts from the copyrighted books as part of responses to the queries.

All those cases were seen as fair use and thus not infringing on copyright of the original works.

Best regards,
Aigars Mahinovs