[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Brief update about software freedom and artificial intelligence

"Roberto A. Foglietta" <roberto.foglietta@gmail.com> writes:

> A totally automatic procedure like web crawling and web indexing
> re-enter in your example, perfectly. However, the input collection that
> a ML/AI training system needs is a protectable work because the data
> should be structured, selected and properly labeled even if these
> activities are done with rules like it happens using SQL for
> databases.

Yes, I agree, I think that a trained AI model is a protectable work.
However, it is not protectable *by you* unless you're the one who wrote
the model and chose its training.

Therefore, putting a clause in your copyright license saying that if your
work is incorporated into an AI model, that AI model as a collection is
covered by some particular license is not really a thing you can do.  The
best you can do is the standard GPL thing of saying that you don't have to
license your collection under any particular license, but if you don't,
you don't have any right to include this specific work.  Maybe that's what
you were getting at, and I just didn't understand.

That second approach of course only works if the use of the GPL-covered
work is not fair use.  If it is fair use, then the person creating the
collection can ignore any provision of the license, so we're back to the
question of whether AI training is fair use.

> So, web indexing and statistics are created over a input collections
> that are *not* a creative works and these tools access to every
> copyrighted works in fair use as long as they respect the robots:no
> meta-tag when it is applied to a copyrighted work. Instead, training a
> ML/AI is a completely another story and their input collections are a
> protectable collection under the copyright law.

I don't think it's anywhere near that easy to distinguish a web search
index from an AI training model in copyright law.  They seem like very
similar cases to me.  A great deal of creativity and human control go into
selecting how pages are chosen for search indices (otherwise, every search
engine would be unusable due to search optimization spam), and search
engines even retain and redistribute portions of the documents they index.

My guess is that *both* of these are protectable collections.  And the
entire Internet currently assumes that building a search engine is fair
use of the Internet-accessible indexed documents, even if that search
engine is then used and marketed for commercial and business purposes, as
Google, Bing, etc. all are.

If you believe that AI training is *not* fair use, I think you're going to
have to wrestle with the substantial similarities between AI training and
the Google search engine.  I think it may prove challenging to write an
analysis that says AI training is not fair use, but Google's search
indexing is fair use.  Or, I guess, argue that Google's search indexing is
also not fair use but falls into some other exception to copyright law
like an implicit license, but there I'm *way* out of the depth of my legal

Russ Allbery (rra@debian.org)              <https://www.eyrie.org/~eagle/>

Reply to: