[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How do you deal with datasets



Hi Christian,

On Fri, Oct 16, 2020 at 08:48:04AM +0200, Christian Kastner wrote:
> Hello all,
> 
> I've been sitting on the sktime [1] packaging for a while and as usual,
> most of the time is consumed by d/copyright.
> 
> I've now resolved all issues, but one: there are 13 datasets included
> for which I still need to resolve the license.
> 
> What are your experiences with packages providing datasets? Do you also
> (1) document them individually, or (2) upload them as separate packages
> to non-free after a check for distributability, or (3) just toss them
> out with +dfsg?

My personal opinion is that the proper way of handling these datasets
depends on various factors.

(1) frequently and widely used small/toy datasets, that can be even used
for sanity-testing purpose to test a machine learning dataset:

	I'd suggest independent packaging in this case, e.g.
	https://tracker.debian.org/pkg/dataset-fashion-mnist

(2) small datasets that are frequently updated; small datasets that are
unlikely useful for other software projects; small datasets used as a
part of unit tests:

	I'd suggest simply including them in the source tarball.
	IIRC scipy/sklearn contains some small datasets as well.

(3) large (in terms of size) dataset:

	ML-Policy Clause 7 [External-Data]
	https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.pdf

(4) non-free:

	needless to say.

(5) equivocal, annoying, and not essential for the package to function
properly:

	If I were you, I might directly remove them and mangle the version
	with +ds
	
I think we should revise ML-Policy and specifically talk about datasets.

> I'm honestly having a hard time motivating myself to do option (1)...
> 
> [1] https://github.com/alan-turing-institute/sktime
> 
> [2]
> https://github.com/alan-turing-institute/sktime/tree/master/sktime/datasets/data
> 
> Best,
> Christian
> 


Reply to: