Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

Good afternoon,

I have followed the ongoing discussion concerning the interpretation of the DFSG in relation to AI models and would like to contribute my personal perspective. To provide a coherent framework for my argument, I will begin by outlining how I perceive Debian’s role in the broader technological ecosystem, before elaborating on the implications for AI and data openness.

Debian is widely regarded as a cornerstone in the open source landscape. Its influence spans from individual desktop users to global hyperscale deployments. The project is trusted not only by private users and educational institutions, but also by major corporations such as Amazon and Microsoft. This widespread adoption is a testament to Debian’s technical reliability and its principled commitment to software freedom.

At the heart of Debian lies a shared philosophy: to promote the distribution of free and reproducible software. While the DFSG provides a more detailed policy framework, this philosophy is fundamentally rooted in the ideals of FLOSS. It is precisely this commitment that has enabled Debian to serve as the foundation for numerous derivative systems and research initiatives.

In my dual studies in Business Informatics with a specialisation in Data Science, I have consistently observed a strong alignment between the FLOSS philosophy and academic best practices, particularly in the context of transparency and reproducibility.

Today, many domains contend with vast and complex data landscapes. Analysing such data, whether for research, operational insight, or innovation, is increasingly beyond the scope of manual effort alone. Machine learning and AI methods are indispensable tools in this process. However, they must be applied in a manner that is methodologically sound and ethically responsible.

During the course of my semester thesis on Retrieval-Augmented Generation (RAG), I encountered a compelling example wherein an AI model identified a previously unknown biomarker associated with cancer. This discovery was only possible because the researchers had access to the underlying dataset. Without that access, the model’s findings would have been opaque and potentially unverifiable.

This brings me to a central concern: when data scientists are given a model to work with, their first question is often:

“What data was used to train it?”

This question is not incidental. It is fundamental to understanding the model’s behaviour, biases, and limitations. It is also essential for scientific reproducibility.

In the course of the earlier email exchange, it was argued that the hardware requirements for training large-scale models place them out of reach for anyone without a budget in the range of 100 M€. While this may be true for frontier-scale models, I believe it overlooks a significant portion of real-world use cases.

In my undergraduate work, we frequently relied on publicly available datasets from sources such as Kaggle. These enabled us to train our own models, interpret results, and explore data-driven questions in a hands-on manner. Providing access to training data empowers researchers, institutions, and independent developers to create models adapted to their specific needs. Moreover, it facilitates the composability of data, an essential feature in interdisciplinary research and real-world applications.

Debian’s commitment to reproducibility and openness logically extends to the realm of AI. Distributing a model without its corresponding training data violates this principle and undermines the ability of users to validate, audit, or adapt the model for their own contexts.

If Debian were to allow AI models to be packaged without the accompanying data, it would risk reducing its standards to those of existing platforms such as Hugging Face, where reproducibility is often not enforced. In contrast, requiring training data to be available fosters trust, academic rigour, and long-term sustainability.

The strategic value of Debian enforcing open data is clear:

Data scientists and developers can rely on Debian-hosted datasets being legally sound and freely reusable.

This lowers the barrier to entry for high-quality, ethical AI development.

It also positions Debian as a trusted ecosystem for research-grade and production-ready AI tooling.

There are, of course, multiple perspectives within this discourse, and I am fully open to engaging with alternative views or refining these points further.

In summary, my vision would include:

Making datasets available through apt or similar tools

Treating AI models as first-class citizens in Debian’s packaging ecosystem

Enforcing that models included in Debian main must be accompanied by the training data that enables their reproducibility

Kind regards,

Arian Ott

Student in Business Informatics – Data Science

Member | IEEE

Email: arian.ott@ieee.org

LinkedIn: in/arian-ott

On Wed, 14 May 2025, 23:38 Aigars Mahinovs, <aigarius@gmail.com> wrote:

On Wed, 14 May 2025 at 23:13, Soren Stoutner <soren@debian.org> wrote:
>
> On Wednesday, May 14, 2025 1:51:27 PM Mountain Standard Time Aigars Mahinovs
> wrote:
> > That is not what I asked. Redistributing is a completely different
> > question from a different point of DFSG and even from interpretation
> > of whether DFSG even applies to the training data as such. And that in
> > turn very specifically depends on a very isolated question - what is
> > the preferred form of modification. And that is why I am
> > *specifically* asking how your opinion that "training data is the
> > prefered form of modification" works in real world examples.
> >
> > Only that specific criteria. Not about Debian, not about main or
> > non-main. Not for other people or for the project.
> >
> > What does "preferable form of modification" mean for *you*? For
> > example in that case above. Is the raw training data *really* _the_
> > preferable form of modification? Or is it the data definition? Which
> > would you *prefer* to *modify*?
>
> In my opinion, the preferred form of modification is the raw training data. I
> apologize if I did not make this clear in my previous email. I thought I had.

You would *actually* technically, in reality, prefer digging through
gigabytes of text files and do some kind of manual modifications in
that sea of raw data? Modifications that are basically impossible to
track in any kind of change tracker. That are excessively hard and
time consuming to actually do and check. Instead of just adjusting
input parameters on the ingest script? *That* is what I consider to be
frankly very hard to believe.

I rather get the impression that you prefer expressing this position
because of the logical consequences on the discussion. Especially if
you immediately change the topic from prefered form of modification to
redistribution and DFSG and main and other things that are entirely
irrelevant to the question of what is the prefered form of
modification. Technically. In practice. Not morally or spiritually.

> However, as you asked my opinion of what Debian’s policy should be, I
> endorse the above.

That is *very* explicitly *not* what I asked your opinion on. I asked
you to consider very specific examples and what is the prefered form
of modification in those cases. Really consider.

--
Best regards,
Aigars Mahinovs

---

Arian

arian.ott@ieee.org