Re: How do you deal with datasets
Hi Christian,
On Fri, Oct 16, 2020 at 08:48:04AM +0200, Christian Kastner wrote:
> Hello all,
>
> I've been sitting on the sktime [1] packaging for a while and as usual,
> most of the time is consumed by d/copyright.
>
> I've now resolved all issues, but one: there are 13 datasets included
> for which I still need to resolve the license.
>
> What are your experiences with packages providing datasets? Do you also
> (1) document them individually, or (2) upload them as separate packages
> to non-free after a check for distributability, or (3) just toss them
> out with +dfsg?
My personal opinion is that the proper way of handling these datasets
depends on various factors.
(1) frequently and widely used small/toy datasets, that can be even used
for sanity-testing purpose to test a machine learning dataset:
I'd suggest independent packaging in this case, e.g.
https://tracker.debian.org/pkg/dataset-fashion-mnist
(2) small datasets that are frequently updated; small datasets that are
unlikely useful for other software projects; small datasets used as a
part of unit tests:
I'd suggest simply including them in the source tarball.
IIRC scipy/sklearn contains some small datasets as well.
(3) large (in terms of size) dataset:
ML-Policy Clause 7 [External-Data]
https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.pdf
(4) non-free:
needless to say.
(5) equivocal, annoying, and not essential for the package to function
properly:
If I were you, I might directly remove them and mangle the version
with +ds
I think we should revise ML-Policy and specifically talk about datasets.
> I'm honestly having a hard time motivating myself to do option (1)...
>
> [1] https://github.com/alan-turing-institute/sktime
>
> [2]
> https://github.com/alan-turing-institute/sktime/tree/master/sktime/datasets/data
>
> Best,
> Christian
>
Reply to: