[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models



On Thu, May 15, 2025 at 11:36:22AM +0200, Aigars Mahinovs wrote:
> On Thu, 15 May 2025 at 10:06, Stefano Zacchiroli <zack@debian.org> wrote:
> > But I don't think it is disputable that the *most general* way of
> > modifying an ML model is achievable only starting from the full training
> > dataset and pipeline. There are simply things that you cannot do
> > starting from the trained model.
> 
> This is not quite the point I was trying to make in this specific
> thread. I was pointing out the difference between raw blob of training
> data and pipeline that creates/gathers that raw blob of training data.
[..]
> But I do think that it should be perfectly fine to have an ingest
> pipeline that simply downloads "
> https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-18/warc.paths.gz
> " for example.

Oh, I see. Thanks for clarifying, I indeed did not get that this was the
main point you were raising in this sub-thread.

FWIW, I agree that "where is it hosted?" is a less important question
wrt the one of whether the full/pristine training dataset is available,
for our users, *somewhere* in the first place. But note that if Debian
accepts not to host datasets on its own infrastructure, then a number of
practical issues arises, e.g., what do we do with the package in main
if/when the data disappears from the external hosting place? (Yes, I
know those datasets are hosted by archives, whose mission is to preserve
data in the long run, but even archives can fail, might be forced to
delete data, etc. As long as we are not in control, anything goes.)

Cheers
-- 
Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack  _. ^ ._
Full professor of Computer Science              o     o   o     \/|V|\/
Télécom Paris, Polytechnic Institute of Paris     o     o o    </>   <\>
Co-founder & CSO Software Heritage            o o o     o       /\|^|/\
Mastodon: https://mastodon.xyz/@zacchiro                        '" V "'

Attachment: signature.asc
Description: PGP signature


Reply to: