[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How do you deal with datasets



Hi Christian,

> I've been sitting on the sktime [1] packaging for a while and as usual,
> most of the time is consumed by d/copyright.

Yeah, that's also where I spend most of my time packaging software.

> I've now resolved all issues, but one: there are 13 datasets included
> for which I still need to resolve the license.
>
> What are your experiences with packages providing datasets? Do you also
> (1) document them individually, or (2) upload them as separate packages
> to non-free after a check for distributability, or (3) just toss them
> out with +dfsg?

Depends how usable the software/package is without the datasets, if they
are useful (just not very fast, or exact) I think it's fine to not
package them at all, but provide a downloading script, with notes in
d/README.Debian.

> I'm honestly having a hard time motivating myself to do option (1)...
> 
> [1] https://github.com/alan-turing-institute/sktime
> 
> [2]
> https://github.com/alan-turing-institute/sktime/tree/master/sktime/datasets/data

Here's some examples of datasets with different problems:

- rtklib has datasets (is in debian)
- petitradtrans has datasets (320 gb as tar.xz)
https://gitlab.com/mauricemolli/petitRADTRANS/-/issues?scope=all&utf8=%E2%9C%93&state=closed
(not in debian)
- https://mentors.debian.net/package/spock/ also has an external dataset
(100 mb as xz), and needs some more python packages (not yet in debian)
more info at http://phd-sid.ethz.ch/debian/spock/

I'm also looking for answers to your question.

Best,

> Best,
> Christian
> 


Reply to: