Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

To: debian-vote@lists.debian.org
Subject: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
From: Soren Stoutner <soren@debian.org>
Date: Wed, 14 May 2025 17:22:46 -0700
Message-id: <[🔎] 11728906.MucGe3eQFb@soren-desktop>
In-reply-to: <[🔎] CAFR_EqpTvNY58nvS=BHN2USUtTZXac74+rQAzf9O1CMHQbtAOQ@mail.gmail.com>
References: <[🔎] aBdC1-OCYhVx3xl0@pc220518.home.grep.be> <[🔎] CABpYwDXA-eocgT7n9XxaT+VgUJaFWFz2BuQKvVkgfA79ftcVvA@mail.gmail.com> <[🔎] CAFR_EqpTvNY58nvS=BHN2USUtTZXac74+rQAzf9O1CMHQbtAOQ@mail.gmail.com>

On Wednesday, May 14, 2025 5:04:03 PM Mountain Standard Time Arian Ott wrote:
> During the course of my semester thesis on Retrieval-Augmented Generation
> (RAG), I encountered a compelling example wherein an AI model identified a
> previously unknown biomarker associated with cancer. This discovery was
> only possible because the researchers had access to the underlying dataset.
> Without that access, the model’s findings would have been opaque and
> potentially unverifiable.
> 
> This brings me to a central concern: when data scientists are given a model
> to work with, their first question is often:
> “What data was used to train it?”
> This question is not incidental. It is fundamental to understanding the
> model’s behaviour, biases, and limitations. It is also essential for
> scientific reproducibility.

That is a good, concrete example.  It is interesting that access to the 
original training data has value that goes beyond a desire to retrain the 
model and extends into *using* the model to its fullest extent.
 
> In the course of the earlier email exchange, it was argued that the
> hardware requirements for training large-scale models place them out of
> reach for anyone without a budget in the range of 100 M€. While this may be
> true for frontier-scale models, I believe it overlooks a significant
> portion of real-world use cases.
> 
> In my undergraduate work, we frequently relied on publicly available
> datasets from sources such as Kaggle. These enabled us to train our own
> models, interpret results, and explore data-driven questions in a hands-on
> manner. Providing access to training data empowers researchers,
> institutions, and independent developers to create models adapted to their
> specific needs. Moreover, it facilitates the composability of data, an
> essential feature in interdisciplinary research and real-world applications.

Out of curiosity, how much hardware did you need to train your own models on 
these data sets?  I think sometimes we forget that many of the MLs use data 
sets that are much smaller than LLMs scraping the entire web.

-- 
Soren Stoutner
soren@debian.org

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply to:

References:
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Wouter Verhelst <wouter@debian.org>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Aigars Mahinovs <aigarius@gmail.com>
- Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
  - From: Arian Ott <arian.ott@ieee.org>

Prev by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by Date: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Previous by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Next by thread: Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models
Index(es):
- Date
- Thread