On Thu, May 15, 2025 at 11:36:22AM +0200, Aigars Mahinovs wrote: > On Thu, 15 May 2025 at 10:06, Stefano Zacchiroli <zack@debian.org> wrote: > > But I don't think it is disputable that the *most general* way of > > modifying an ML model is achievable only starting from the full training > > dataset and pipeline. There are simply things that you cannot do > > starting from the trained model. > > This is not quite the point I was trying to make in this specific > thread. I was pointing out the difference between raw blob of training > data and pipeline that creates/gathers that raw blob of training data. [..] > But I do think that it should be perfectly fine to have an ingest > pipeline that simply downloads " > https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-18/warc.paths.gz > " for example. Oh, I see. Thanks for clarifying, I indeed did not get that this was the main point you were raising in this sub-thread. FWIW, I agree that "where is it hosted?" is a less important question wrt the one of whether the full/pristine training dataset is available, for our users, *somewhere* in the first place. But note that if Debian accepts not to host datasets on its own infrastructure, then a number of practical issues arises, e.g., what do we do with the package in main if/when the data disappears from the external hosting place? (Yes, I know those datasets are hosted by archives, whose mission is to preserve data in the long run, but even archives can fail, might be forced to delete data, etc. As long as we are not in control, anything goes.) Cheers -- Stefano Zacchiroli . zack@upsilon.cc . https://upsilon.cc/zack _. ^ ._ Full professor of Computer Science o o o \/|V|\/ Télécom Paris, Polytechnic Institute of Paris o o o </> <\> Co-founder & CSO Software Heritage o o o o /\|^|/\ Mastodon: https://mastodon.xyz/@zacchiro '" V "'
Attachment:
signature.asc
Description: PGP signature