Some fun facts about AI and a few thoughts about the software ecosystem trend

To: debian-project@lists.debian.org
Cc: debian-ai@lists.debian.org
Subject: Some fun facts about AI and a few thoughts about the software ecosystem trend
From: "M. Zhou" <lumin@debian.org>
Date: Wed, 29 Mar 2023 00:35:48 -0400
Message-id: <[🔎] fa5688680f7830453f67c48559002533cf5a6caf.camel@debian.org>

Hi folks,

I seem to be good at starting lengthy mailing list threads. That said, I encountered
some fun facts while reading papers today. I think this is also a good chance for
me write a batch of some other relevant thoughts.

TLDR; well, I wrote too much again. I have thrown my mail to ChatGPT for a brief
summary. The following are two versions. They are plain copies without my edits.

<<< begin TLDR version 1 generated by chatgpt
The email consists of various interesting points related to AI and licensing issues in software. The first point
highlights how some state-of-the-art LLMs avoid using GPL code. The author discusses the potential audiences for a
license revision or a brand new license. The second point mentions the importance of reproducibility of LLMs. The third
point explains how AI's impact on the software ecosystem is increasing, which is inevitable, and it is challenging to
enforce the declaration of AI usage. Finally, the author discusses some ongoing and future works on the Debian side
related to AI and licensing issues.
>>> end TLDR version 1 generated by chatgpt

<<< begin TLDR version 2 generated by chatgpt
The email discusses some fun facts and relevant thoughts related to licensing issues and the impact of AI in the
software ecosystem. The author mentions that the licensing issue of trained deep neural networks and their outputs is
complicated, and some AI software projects may avoid GPL code usage in their training data to prevent potential
licensing issues. The author also discusses the increasing impact of AI in the software ecosystem, and its potential use
in generating code, images, and texts. The email concludes by mentioning ongoing and future works on the Debian side.
>>> end TLDR version 2 generated by chatgpt

--[[ Fun Fact 1: GPL code usage may be avoided in state-of-the-art LLM [2]

LLaMA [3] is one of the state-of-the-art LLMs that you can download and deploy
on a local machine. Its training data involves Github, but the authors only use
the software projects licensed under Apache-2, BSD, and MIT.

The licensing issue of trained deep neural networks, as well as the outputs
of neural networks (such as generated texts, generated code, generated images,
etc) is already a mess. That said, at least a part of the research community
surely knows the complicated implication of using GPL code for training.
Or they don't have to avoid using a pile of high quality code.

People mentioned some potential licensing work in the previous related thread [1].
But I don't see a clear and practical goal for free software community to reach.
There are two types of potential audiences for a license revision or a brand new license.

(1) the first type is the free software authors. If the authors do not want their
code become a part of the super AI that will destroy the world someday[4],
some special licenses or some special clauses can be used to prevent the
AI training dataset usage. But, isn't it funny that "training a neural network"
is excluded from software freedom?

Meanwhile, excluding these code from the training datasets won't hurt
the LLM trainers because a large portion of differently licensed projects
are still usable.

(2) The second type of potential audience is AI software upstream. In my opinion,
I'd say there is almost nothing to do for free software communities.
If we write some license terms that look funny to the AI software upstreams,
they will simply not play with these licenses.

--[[ Fun Fact 2: Reproducibility of LLMs

The LLaMA paper [3] emphasized that the training set of these models only involve
publically available datasets (no proprietary hidden datasets, no undocumented datasets).
I can see that before the downstream software communities complain about the
reproducibility, the research community will complain about the same thing far
in advance.

--[[ Recall 1: ML-Policy

If I have to trim the ML-policy into one single sentence, then it will be the definition
of "toxic candy" -- A pre-trained neural network, that somehow (very likely incorrectly
licensed under an open source software license is still very likely problematic.

This will be more and more useful, as long as more software projects try to integrate
neural networks for interesting applications. It works as a warning when you see
a giant binary blob (sometimes the network can be small... only several megabytes or so)
in the upstream source regardless of its license.

--[[ Fun Fact 3: AI's Impact to software ecosystem is increasing

Even if the licensing of neural networks, as well as the copyright/licensing issue
of neural networks is still a mess, the trend is not stoppable. If you kept an
eye on the github trending list, you will see the ratio of ai software climbing.
Even if we hesitate to introduce some AI software into our archive, the impact
of AI will gradually flow into our free archive, inevitably:

(1) a code snippet might be generated by AI, and modified by the upstream
author without declaring the participation of AI.
(2) documentation texts might be generated by AI. With the state of the art
LLM, you can simply throw your undocumented code snippet and let
it explain what the piece of code does.
(3) pictures, icons, svgs, generated by AI.
(4) ...

It is impossible to enforce the declaration of AI usage everywhere applicable.
Even worse, detecting the AI generated results is largely a deadend --
the goal of generative AI is exactly to produce indistinguishable results.
As long as the AI is strong enough, detecting it will be nearly impossible.
There are some papers about the detection, but I refrain from excessively
expanding this.

--[[ Recall 2: SIMDebian

This is a deprecated attempt that tries to bump the ISA baseline for using
the modern CPU intrinsics. One of my motivations for proposing this is --
neural network computation can be brutal. Bumping the ISA baseline
will significantly help if you run it on CPU.

Just for reference. However, as long as the user has GPU, running neural
network on CPU is almost nothing beyond a waste of time.

--[[ Recall 3: Debian User Package Repository

This is a deprecated attempt that tries to create a ebuild-like source-based
distribution for .deb packages. One of the motivations for proposing this
is -- redistributing AI software with neural networks through archive is
problematic ... but it is ok if the neural network is downloaded by the
user through the script locally, and the package is built locally bu the
end user. As for the problematic licensing issue... anyway the software
works, and the components in question are not distributed by us.

Just for reference. This is not important now. Surely there are too many
non-standard ways to install software.

--[[ Some ongoing and future works on the Debian side

Debian always provides a solid base system [5], upon which some upper
layer application collections like pypi, anaconda, and docker worked very
well. Due to many intricate reasons, such as the clearly limited volunteer
bandwidth, Debian archive is not suitable as an alternative to these ecosystems.
I'll refrain from expanding this to avoid going off topic. Please request
if you want to read more on this.

That said, we can still incorporate some of the most important software
infrastructure in our archive, such as deep learning frameworks, and the
neural network acceleration libraries. The upper layer applications are
not discussed.

PyTorch is currently the most prevalent deep learning framework. It is in
good shape in our archive as well. A random trending AI project on github
will largely be based on PyTorch nowadays. I have just uploaded the
CUDA version of pytorch to NEW queue recently. While I can still handle
this package on my own, its compilation and testing is brutal [7]. Welcome
to join me for the maintainance if you are interested in it...

In my opinion, TensorFlow will gradually fade away for Jax[6].
I really don't suggest anyone to pursue Tensorflow packaging as of 2023.
I have already orphaned the whole tensorflow dependency tree under my name.

(I acknowledge that I'm a PyTorch user and I have bias about TensorFlow's
obscure API and terrible documentations.)

See below if you want to get involved.

--[[ Team Advertisement

Debian Deep Learning Team <debian-ai@lists.debian.org> welcomes
new contributors. The mailing list is currently abused for general discussion
and two tracks of development works:

1. https://salsa.debian.org/deeplearning-team
Deep Learning frameworks

2. https://salsa.debian.org/rocm-team
ROCm is AMD's free software counterpart to Nvidia's proprietary CUDA.
(I wouldn't bother to create this team if it were non-free)

There could be an Intel/SYCL team in the future. But intel is not yet ready
to upstream their SYCL implementation into llvm. I'll only try this by my
self when pytorch starts to support intel/sycl.

Thanks for reading the long mail.
Hope you find some interesting topics and inspirations here.

[1] https://lists.debian.org/debian-project/2023/02/msg00017.html
[2] LLM = Large Language Model, such as GPT-3, GPT-4, etc.
[3] https://arxiv.org/pdf/2302.13971.pdf
[4] Yes, please write as many bugs as possible in your code. Your bugs could
be herotic if it chokes a super AI trying to destroy the world. (I'm not serious)
[5] IIRC, one of the UNIX philosophy goes, "do one thing, and do it well".
[6] https://github.com/google/jax
[7] Debomatic-amd64 has got an Xeon E5-2697v3 (IIRC). It takes ~3 hours
for a full build and checks for the CPU version of pytorch without ccache.
The CUDA version will only take longer time.

Reply to:

Prev by Date: Welcome new Debian Developers: jlu, hmc
Previous by thread: Welcome new Debian Developers: jlu, hmc
Index(es):
- Date
- Thread