On Wednesday, May 14, 2025 5:04:03 PM Mountain Standard Time Arian Ott wrote: > During the course of my semester thesis on Retrieval-Augmented Generation > (RAG), I encountered a compelling example wherein an AI model identified a > previously unknown biomarker associated with cancer. This discovery was > only possible because the researchers had access to the underlying dataset. > Without that access, the model’s findings would have been opaque and > potentially unverifiable. > > This brings me to a central concern: when data scientists are given a model > to work with, their first question is often: > “What data was used to train it?” > This question is not incidental. It is fundamental to understanding the > model’s behaviour, biases, and limitations. It is also essential for > scientific reproducibility. That is a good, concrete example. It is interesting that access to the original training data has value that goes beyond a desire to retrain the model and extends into *using* the model to its fullest extent. > In the course of the earlier email exchange, it was argued that the > hardware requirements for training large-scale models place them out of > reach for anyone without a budget in the range of 100 M€. While this may be > true for frontier-scale models, I believe it overlooks a significant > portion of real-world use cases. > > In my undergraduate work, we frequently relied on publicly available > datasets from sources such as Kaggle. These enabled us to train our own > models, interpret results, and explore data-driven questions in a hands-on > manner. Providing access to training data empowers researchers, > institutions, and independent developers to create models adapted to their > specific needs. Moreover, it facilitates the composability of data, an > essential feature in interdisciplinary research and real-world applications. Out of curiosity, how much hardware did you need to train your own models on these data sets? I think sometimes we forget that many of the MLs use data sets that are much smaller than LLMs scraping the entire web. -- Soren Stoutner soren@debian.org
Attachment:
signature.asc
Description: This is a digitally signed message part.