Re: A Different Take on AI
On Fri, 7 Feb 2025 at 16:04, Stefano Zacchiroli <zack@debian.org> wrote:
> I don't think we should focus our conversation on LLMs much, if at all.
While I agree LLMs tend to be the tail wagging the dog in AI/ML
discussion, the thread focuses on LLMs and the resulting policy will
apply to them.
> The reason is that, even if a completely free-as-in-freedom (including
> in its training dataset), high quality LLM were to materialize in the
> future, its preferred form of modification (which includes the dataset)
> will be practically impossible to distribute by Debian due to its size.
There are several candidates already, including Ai2's OLMO 2[1] and Pleias[2]:
"They Said It Couldn’t Be Done[3]
Training large language models required copyrighted data until it did
not. [...] These represent the first ever models trained exclusively
on open data, meaning data that are either non-copyrighted or are
published under a permissible license. These are the first fully EU AI
Act compliant models. In fact, Pleias sets a new standard for safety
and openness."
Given these provide a foundation on which future developers can build,
as well as an example others can follow, there will be many more.
Conversely, if we propagate the myth that these are too
big/hard/costly to create with today's tools, let alone tomorrow's,
then we run the risk people believe us. Not long ago even obtaining a
computer that could download and compile software was out of the reach
of most!
On the "preferred form" (wording from the OSD rather than the DFSG),
this is subjective and will be different for one than for another.
While Sam may possess the tools and techniques to assess and address
bias to some extent with weights only, if I as a security researcher
or data protection officer need to detect and entirely eliminate
problematic content (e.g., backdoors or "right to be forgotten"
requests) then the *only* form I can accept is the training data, thus
making it my "preferred form". I can't just say to a privacy
commissioner or judge "there was only a 0.7% chance patients' medical
records would be revealed, your honour". While Sam's tools are
improving, so are tools that can reverse the training process (e.g.,
DLG/iDLG for model inversion which "stands out due to its ability to
extract sensitive information from the training dataset and compromise
user privacy"[4]).
Just as the software vendor doesn't get to tell users what constitutes
an improvement for the purposes of the free software definition, we
don't get to tell practitioners what the subjective "preferred form"
means. That's why I prefer the objective "actual form" Sam referred to
in suggesting "We look at what the software authors *actually do* to
modify models they incorporate to determine the preferred form of
modification". I guarantee some will reach for the data, so it must be
included for that freedom to be fully protected.
> So when we think of concrete examples, let's focus on what could be
> reasonably distributed by Debian. This includes small(er) generative AI
> language models, but also all sorts of *non-generative* AI models, e.g.,
> classification models. The latter do not generate copyrightable content,
> so most of the issues you pointed out do not apply to them.
We can't make a valid decision or draft a policy focusing on models
which have no ability to create output that violates copyrights, only
to then put the project, its derivitatives, and users in legal hot
water with others that do. You do raise a good point about what we can
reasonably distribute with Debian, and many models would already
exceed our current capacity (even without the dependencies required
for reproducibility). This is a solvable problem though, and it's
better to deliver utility to our users by solving it than compromise
on our principles or give up altogether. Common Crawl don't host their
own dumps, for example.
> Other issues
> still apply to them, including biases analyses (at a scale which *is*
> manageable, addressing some of the issues pointed out by hartmans), and
> ethical data sourcing.
I'm not sure I accept that relying on fair use for training only to
then incite direct infringement by users through deliberate or
inadvertent reproduction per proposed policies can be considered
"ethical data sourcing". Even if fair use did extend to cover
infringing model outputs, it would clearly be against the wishes of
the authors. This much is clear from the various generative AI
lawsuits already underway[5], including a class action against
Bloomberg[6], who joins Software Heritage in the small and shrinking
group of OSAID endorsers[7].
 - samj
1. https://allenai.org/blog/olmo2
2. https://simonwillison.net/2024/Dec/5/pleias-llms/
3. https://huggingface.co/blog/Pclanglais/common-models
4. https://arxiv.org/abs/2501.18934v1
5. https://generative-ai-newsroom.com/the-current-state-of-genai-copyright-lawsuits-203a1bd0f616
6. https://admin.bakerlaw.com/wp-content/uploads/2024/01/ECF-74-Amended-Complaint.pdf
7. https://opensource.org/ai/endorsements
Reply to: