Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Thanks for your real-world input! It helped me clarify a few technical
and societal impacts.
One point I want to clarify (with the comments below) is: what is the
practical difference between Debian including in its mirrors a 100TiB
file like crawl-data/CC-MAIN-2025-18/warc.paths.gz from
https://commoncrawl.org/blog/april-2025-crawl-archive-now-available
versus having a field in a Debian source package metadata simply
including the HTTPs link to that same file? Because the latter is
*far* easier to do and maintain for Debian.
On Thu, 15 May 2025 at 02:04, Arian Ott <arian.ott@ieee.org> wrote:
> In my undergraduate work, we frequently relied on publicly available datasets from sources such as Kaggle. These enabled us to train our own models, interpret results, and explore data-driven questions in a hands-on manner. Providing access to training data empowers researchers, institutions, and independent developers to create models adapted to their specific needs. Moreover, it facilitates the composability of data, an essential feature in interdisciplinary research and real-world applications.
I wanted to highlight this part - there already exist organisations
out there that gather and maintain datasets and provide access to
them, including access to frozen snapshots that never change. Be it
Kaggle or Common Crawl or others. It would take a very specific need
for Debian to duplicate their efforts and take on *massive*
infrastructure commitments as well as the legal risiko.
> Debian’s commitment to reproducibility and openness logically extends to the realm of AI. Distributing a model without its corresponding training data violates this principle and undermines the ability of users to validate, audit, or adapt the model for their own contexts.
That is a good point. But the same can be achieved by simply pointing
to the relevant data set snapshot from a dataset provider.
> If Debian were to allow AI models to be packaged without the accompanying data, it would risk reducing its standards to those of existing platforms such as Hugging Face, where reproducibility is often not enforced. In contrast, requiring training data to be available fosters trust, academic rigour, and long-term sustainability.
To enforce reproducibility Debian would need to actually spend the
resources to rebuild the training models. I do not think that is
feasible at this point. And it is still possible to do that when data
is hosted outside Debian. The assumption here is that all models that
we are talking about are sufficiently complex that full model
retraining will not be part of the regular process of Debian source
package compilation to Debian binary package (which Debian does
*quite* often).
> The strategic value of Debian enforcing open data is clear:
> Data scientists and developers can rely on Debian-hosted datasets being legally sound and freely reusable.
> This lowers the barrier to entry for high-quality, ethical AI development.
> It also positions Debian as a trusted ecosystem for research-grade and production-ready AI tooling.
Still here as well, I do not believe it is actually necessary for
Debian to host and redistribute the data to achieve that. I do not
think there would be an additional practical benefit to doing so.
Pointing from Debian metadata to a particular snapshot of a particular
Kaggle or Common Crawl dataset would suffice for any reproduction or
modification work.
Also there is a thorny problem that most datasets may *only* be as
freely usable for researchers and for data mining purposes and very
explicitly may not be usable for other purposes.
> In summary, my vision would include:
> Making datasets available through apt or similar tools
We already have a separation between "apt-get download/install" that
gets binary packages and "apt-get source" that downloads source
packages. And the source packages can have extra targets inside their
debian/rules Makefile.
Downloading the training dataset and re-training the model could be
separate (Policy-defined) "debian/rules" targets, without having to
put multiple 100TiB files onto the Debian mirrors and store them for
decades.
> Treating AI models as first-class citizens in Debian’s packaging ecosystem
> Enforcing that models included in Debian main must be accompanied by the training data that enables their reproducibility
Enforcing that the external training data is still accessible could be
as simple as doing a HTTPs HEAD request to the specified source data
URLs as part of packagetests and failing if that file is no longer
offered or if its size changed. It would be a good addition to Policy
for such models, if we ever get that far.
--
Best regards,
Aigars Mahinovs
Reply to: