[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Wednesday, May 14, 2025 5:04:03 PM Mountain Standard Time Arian Ott wrote:
> During the course of my semester thesis on Retrieval-Augmented Generation
> (RAG), I encountered a compelling example wherein an AI model identified a
> previously unknown biomarker associated with cancer. This discovery was
> only possible because the researchers had access to the underlying dataset.
> Without that access, the model’s findings would have been opaque and
> potentially unverifiable.
> 
> This brings me to a central concern: when data scientists are given a model
> to work with, their first question is often:
> “What data was used to train it?”
> This question is not incidental. It is fundamental to understanding the
> model’s behaviour, biases, and limitations. It is also essential for
> scientific reproducibility.

That is a good, concrete example.  It is interesting that access to the 
original training data has value that goes beyond a desire to retrain the 
model and extends into *using* the model to its fullest extent.
 
> In the course of the earlier email exchange, it was argued that the
> hardware requirements for training large-scale models place them out of
> reach for anyone without a budget in the range of 100 M€. While this may be
> true for frontier-scale models, I believe it overlooks a significant
> portion of real-world use cases.
> 
> In my undergraduate work, we frequently relied on publicly available
> datasets from sources such as Kaggle. These enabled us to train our own
> models, interpret results, and explore data-driven questions in a hands-on
> manner. Providing access to training data empowers researchers,
> institutions, and independent developers to create models adapted to their
> specific needs. Moreover, it facilitates the composability of data, an
> essential feature in interdisciplinary research and real-world applications.

Out of curiosity, how much hardware did you need to train your own models on 
these data sets?  I think sometimes we forget that many of the MLs use data 
sets that are much smaller than LLMs scraping the entire web.

-- 
Soren Stoutner
soren@debian.org

Attachment: signature.asc
Description: This is a digitally signed message part.


Reply to: