[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: data sets and/or access to data sets



I think putting the data itself into debian repository is problematic.  Regardless of any licensing issue, the shear amount of data is too great.  Better to let the professionals who are getting paid to manage the data (NCBI, KEGG, etc.) and download directly from those sites.  Pretty much all of them have ftp/http access to acquire data.

I like the getData effort.  Have a set of "data descriptors" with information about how/where to get data, then when requested performs the download.  This is very much the architecture I was thinking about.  I see a number of ways the project could be expanded.  I would like to hear thoughts from Steffen and Charles about getData before I jump in with a bunch of additions.

The biomaj projects looks interesting as well.  One possibility is to use it as the underlying data retrieval layer, but it also may be "too complex" for basic retrieval functions.

Scott

On Feb 15, 2011, at 5:10 PM, Yaroslav Halchenko wrote:

> just few cents.  In the domain of neuroimaging we are also confronted
> with the problem of distributing data.  Various aspects are relevant to
> this question if someone is to package data "statically" (instead of
> fetching via some data-sharing framework) into a proper Debian
> package:
> 
> 1.  with a classical Debian package large sizes of data get
>  duplicated both in source and binary packages.  
> 
>  Although could be overcome via some means, for our domain of interest,
>  http://neuro.debian.net/datasets.html  provides data in both binary
>  and source packages with the idea, that non-Debian users can still
>  simply fetch .orig.tar.gz if they need to get ahold of the data, e.g.
>  separate tarballs per subject from
>  http://neuro.debian.net/debian/pool/main/h/haxby2001/
> 
> 2.  what is the appropriate license for data ;)  in quite a few 
>   jurisdictions data is not copyrightable per se at all thus plain common
>   licenses tailored toward software are not appropriate (even CC [1]).  EU
>   has SUI generis database rights while there is no similar mechanism in
>   the states afaik, suggesting the necessity of license terms
>   addressing such differences
> 
>   so while releasing/packaging data viable description of terms
>   should be attached to be appropriate in different jurisdictions, e.g.,
>   as recommended by Hendrik Weimer on debian-legal [2] -- ODC Public
>   Domain Dedication and Licence (PDDL) [3].
> 
> 
> [1] http://bibwild.wordpress.com/2008/11/24/creative-commons-is-not-appropriate-for-data/
> [2] http://lists.debian.org/debian-legal/2011/01/msg00049.html
> [3] http://www.opendatacommons.org/licenses/pddl/1.0/
> 
> On Tue, 15 Feb 2011, Andreas Tille wrote:
> 
>> Hi Scott,
> 
>> I think your idea is quite reasonable in principle.  As far as I
>> understood (but I did not dived into this) the getData effort[1] is one
>> step into this direction and the to be soon uploaded package Biomaj does
>> something that might be helpful as well.
> 
>> Regarding to actually buold packages:  There were several ideas in the
>> past to have some data.debian.org archive which contains large data sets
>> where the packages you would suggest probably would fit into.  However,
>> to the best of my Knowledge this was not yet implemented for practical
>> use.
> 
>> Do we want to try another shot onto a Google Summer of Code project
>> into this direction?
> 
>> Kind regards
> 
>>    Andreas.
> 
>> [1] http://wiki.debian.org/getData
> -- 
> =------------------------------------------------------------------=
> Keep in touch                                     www.onerussian.com
> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic
> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-med-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> Archive: [🔎] 20110215231026.GN21658@onerussian.com">http://lists.debian.org/[🔎] 20110215231026.GN21658@onerussian.com
> 


Reply to: