On Mon, 10 Feb 2025, Gerardo Ballabio wrote:
> Stefano Zacchiroli wrote:
> > Regarding source packages, I suspect that most or our upstream authors
> that will end up using free AI will *not* include training datasets in
> distribution tarballs or Git repositories of the main software. So what
> will we do downstream? Do we repack source packages to include the
> training datasets? Do we create *separate* source packages for the
> training datasets? Do we create a separate (ftp? git-annex? git-lfs?)
> hosting place where to host large training datasets to avoid exploding
> mirror sizes?
> I'd suggest separate source packages *and* put them in a special
> section of the archive, that mirrors may choose not to host.
> I'm not sure whether there could also be technical problems with
> many-gigabytes-sized packages, e.g., is there an upper limit to file
> size that could be hit? Can the package download be resumed if it is
Just want to chime in support of using git-annex as an underlying
technology and provide a possible sketch on a solution:
- git-annex allows for (just a few most relevant here points out
of wide range of general aspects)
- "link" into a wide range of data sources, and if needed to create
custom "special remotes" to access data.
https://datasets.datalad.org/ is a proof to that -- provides access
to 100s of TBs of data from a wide range of hosting solutions (S3,
tarballs on an http server, some rclone compatible storage solutions, ...)
- seamlessly to end-user diversify/tier data backup/storage.
To that degree, I have (ab)used claimed to be "unlimited"
institutional dropbox to backup over 600TBs of a public data archive,
and then would be easily announce it "dead" whenever data would no
longer available there
- separate "data availability" tracking (stored in git-annex
branch) from actual version tracking (your "master" branch).
This way adjustment of data availability does nohow require changes
to your "versioned data release".
- similarly to how we have https://neuro.debian.net/debian/dists/data/
of "classical" debian packages, there could be a similar suite on
debian with multi-version (multiple versions of a package allowed
within the same suite) package available which would deploy
git-annex upon installation. Then individual Debian suite (stable,
unstable) would rely on using specific version(s) of packages from the
"data" suite
- "data source package" could be just prescription on how to establish
data access, lean and nice. "binary package" also relatively lean,
since data itself accessible via git-annex
- separate service could observe/verify continued availability of data
and when necessary establish
> interrupted? Might tar fail to unpack the package? (Although those
> could all be solved by splitting the package into chunks...)
FWIW -- https://datasets.datalad.org could be considered a "single
package" as it is a single git repository leading to the next tier of
git-submodules, overall reaching into thousands of them.
But, logically, a separate git-repo could be equated to a separate
Debian package. Additionally, "flavors" of packages could subset
types of files to retrieve, e.g. smth like openclipart-png could depend
on openclipart-annex which would just install git-annex repo, and the
-png flavor - fetch all *.png files only.
Access to individual files are orchestrated via git-annex and it has
already built-in mechanisms for data integrity validation (often "on the
fly" while downloading), retries, stall detection etc.
Cheers,
--
Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
WWW: http://www.linkedin.com/in/yarik
Attachment:
signature.asc
Description: PGP signature