[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Brief update about software freedom and artificial intelligence

On Fri, 24 Feb 2023 at 05:23, Charles Plessy <plessy@debian.org> wrote:
> Dear Mo,
> thank you for the heads-up.
> I was using permissive licenses in the past thinking about making life
> easier to individuals, but I feel robbed by massive scrapping to train
> AI models.
> Just in case I updated my email signature.
> Also, is there a DFSG-free license that forces the training dataset and
> the result of the training process to be open source if a work under that
> license is present in the training data?  Would GPLv3 be sufficient?

Dear Charles,

 imagine that you have a collection of data files, images for example
and each of them has its own license and copyright owner. The ML/AI
trained on that image set will produce a neural network which for the
purpose of this example is a N-dimensional grid of floating point
numbers coded in 64bit (NN weights). To be even more clear for those
who have never trained a NN - almost all the lawyers - I will present
here a very simple and explicit example.

- 100 images is the dataset protected by GPLv3 as composition, like a database

- Joe is the ML trainer, the employee that is going to train the ML

- Joe legally acquired the dataset because all the licenses allow it

- Joe wrote a script that renames the images in 00.jpg .. 99.jpg and
run it, this new dataset is still protected by the GPLv3

- Joe wrote a script that randomly chooses 90 images as learning set
(LS) and the others as test set (TS): these are two sub-compositions
and both are covered by GPLv3 because both have more than one
file/piece of the original composition. In fact, in the way I adopted
the GPLv3 on the composition I cannot enforce it over the single
file/piece because in that case I will change the license terms
decided by the original author of that file/piece and I do not want to
do that even if I can do that (ethics).

- The ML has the aim to decide if the image contains at least a dog or
not: image input, binary output. Thus Joe can add his dog image to the
TS and then that image becomes part of the TS composition thus he
should share it under a license that can be acceptable with the GPLv3
on the composition. However, Joe is smart and he did not want to share
his dog image which is equivalent to saying that we cannot prove that
Joe put that image into the TS composition by moving the file in that
folder. However from a legal point of view the simple fact that it is
used as part of TS clearly states his will to use his dog image as
part of the training set. So, in principle Joe is smart but honest and
to avoid legal issues for his employer will share his dog image.

- So, now the sharing pool brings a little information: the LS, the
TS, and the Joe's dog picture. However, one more image is +1% but that
image can be very tricky/important for ML in the same way some patches
are a single line but make a huge difference. So quantity is not a
universal metric of contribution. Moreover, now we know which TS/LS,
Joe used to obtain the NN which in some cases could be relevant

- Joe needs also to tag every LS image in order to back-propagate the
feedback to the NN and train it. This can be a file in which filenames
are associated with a binary label dog or not. Again, Joe did not put
this file into the LS folder but as described above that file is part
of the training set in which a GPLv3 composition belongs. IMHO this
means that Joe should also share this file. Also this information
could be relevant because the most expensive job is labelling the

- Joe trains the NN with a ML engine which produces the NN weights
matrix (BIN). This binary object is a derived work of a GPLv3
composition like a binary executable is a derived work of a GPLv3
source code. Thus Joe should share the BIN as well under GPLv3 terms
which enforces him also to explain the inner coding (BIN + format
specifications). As you can imagine this is another step towards
freedom even if that BIN is supposed to run on a patented hardware
because we know the format specification we can write an emulator
-much slower and without a commercial value due the performances but
it can be used for learning purposes or check a questionable NN

- Joe tests the NN with the 10+1 images of TS and decides if the NN is
fine or not. If he decides that it is fine and it can go into
production, then Joe's employer should share all above stated.
Instead, if he decides that it is crap, he will trash it and he can
not share anything because the sharing will have zero value for
anyone. This is compliant with the clause of fair use in which I
explicitly added "testing" as a condition to avoid sharing. After all,
if there is no value produced why should we force Joe to share his
failure? In particular cases a failure (vulnerability) is valuable
information but for security reasons it is better that Joe is not
forced to comply with the GPLv3 terms. It is better to give Joe the
freedom to share only those information that he considers safe to
share in public. However, if Joe's company does a business with this -
providing a PoC to a client - then they have to comply with GPLv3
because the statements for which commercial and business are covered
by GPLv3.

- Joe is a student at university and his work has nothing to do with
commercial / business purposes. However, if his university decides to
use Joe's work for doing commercial or business then they should ask
Joe all the information which needs to be shared under GPLv3 terms.
This forces Joe to share that information when he delivers his work to
his teacher in such a way the university can also store the
information that might or might not be shared in the future. Again, no
value produced then no need to share. After all, the work of Joe could
be a completely useless failure and then rejected. We do not need to
know about it.

To invalidate the GPLv3 application to the NN binary someone
should explain in a legal terms compliant with some law that training
a neural network is a completely different thing than compiling a
binary from source code. In the same analogy, compiling GPLv3 source
code does not imply that you have to share under GPLv3 the proprietary
compiler that it has been used for, right? So the same for the ML
training engine.

Please feel free to contact me in person in order to get deep into
some aspects which as AI experts or law experts you might want to
challenge or improve. I will be happy to read/hear about you. Just
take in consideration that every relevant discovery (good or bad)
about this new way of using the GPLv3, will be shared here or
everywhere I decide to share it. So, if you are under NDA, I am not
and thus do not write/talk to me or otherwise do it at your own risk. :-)

I hope this helps,
Roberto A. Foglietta

Reply to: