[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Huge data files in Debian



Troy Benjegerdes <hozer@hozed.org> writes:

> On Fri, Jul 17, 2015 at 09:38:06PM +0200, Jakub Wilk wrote:
>> * Ole Streicher <olebole@debian.org>, 2015-07-17, 10:34:
>> >But: These packages sum up to ~25 GB, with the maximal package
>> >size of 3.5 GB.
>> 
>> Well, that's a lot. Just as data points:
>> 
>> * The biggest binary package currently in the archive,
>> ns3-doc_3.17+dfsg-1_all.deb, is only ~1GB.
>> 
>> * The biggest source package, nvidia-cuda-toolkit_6.0.37-5, is only
>> ~1.5GB.
>> 
>> 
>> I'm afraid you might need to wait for the advent of data.d.o:
>> https://lists.debian.org/87tzgm6yee.fsf@vorlon.ganneff.de
>> (mind the typo: s/2 weeks/10 years/)
>> 
>
> My first thought was "well, can all of us science-type users 
> agree to host something like /afs/data.d.o/", and then I saw 
> the following:
>
> On Fri, Jul 17, 2015 at 02:03:54AM -0700, Afif Elghraoui wrote:
>> Package: wnpp
>> Severity: wishlist
>> Owner: Afif Elghraoui <afif@ghraoui.name>
>> X-Debbugs-Cc: debian-devel@lists.debian.org
>>
>> * Package name    : ori
>>   Version         : 0.8.1
>>   Upstream Author : Stanford University <orifs-devel@lists.stanford.edu>
>> * URL             : http://ori.scs.stanford.edu/
>> * License         : ori (MIT-like)
>>   Programming Lang: C++
>>   Description     : secure distributed file system
>>
>> Ori is a distributed file system built for offline operation and empowers
>> the user with control over synchronization operations and conflict
>> resolution.
>> History is provided through lightweight snapshots and users can verify that
>> the history has not been tampered with. Through the use of replication,
>> instances can be resilient and recover damaged data from other nodes.
>
> So is there any sort of reasonable internet-scale distributed 
> filesystem in use that might actually work for this?

Git-annex supports Tahoe-LAFS:

  https://git-annex.branchable.com/special_remotes/tahoe/

but given that it also supports all of these:

  https://git-annex.branchable.com/special_remotes/

I'd guess that the data would quite often reside on resources that are at
least as reliable as whatever we might set up, so one could just do it
on a case by case basis.

git-annex allows one to set the number of copies that one wants to exist
of the data, so one could perhaps insist that data have multiple
sources, and that could be checked periodically, with some plan to copy
data elsewhere if and when a source disappears.

The users of the data could be given the option to contribute to the
checking process, so that it gets done as part of the act of using the
data.

Any effort required to shift data to new resources when old sources
disappear could be done by those that benefit from the access to the
data, in a distributed manner.

Cheers, Phil.
-- 
|)|  Philip Hands  [+44 (0)20 8530 9560]  HANDS.COM Ltd.
|-|  http://www.hands.com/    http://ftp.uk.debian.org/
|(|  Hugo-Klemm-Strasse 34,   21075 Hamburg,    GERMANY

Attachment: signature.asc
Description: PGP signature


Reply to: