Re: How do you deal with datasets

To: debian-ai@lists.debian.org
Subject: Re: How do you deal with datasets
From: Mo Zhou <lumin@debian.org>
Date: Fri, 16 Oct 2020 09:09:39 +0000
Message-id: <[🔎] 20201016090939.GA766413@Macadamia>
In-reply-to: <[🔎] 8f2efabe-53b9-9a07-2eb0-95529b9ec55d@debian.org>
References: <[🔎] 8f2efabe-53b9-9a07-2eb0-95529b9ec55d@debian.org>

Hi Christian,

On Fri, Oct 16, 2020 at 08:48:04AM +0200, Christian Kastner wrote:
> Hello all,
> 
> I've been sitting on the sktime [1] packaging for a while and as usual,
> most of the time is consumed by d/copyright.
> 
> I've now resolved all issues, but one: there are 13 datasets included
> for which I still need to resolve the license.
> 
> What are your experiences with packages providing datasets? Do you also
> (1) document them individually, or (2) upload them as separate packages
> to non-free after a check for distributability, or (3) just toss them
> out with +dfsg?

My personal opinion is that the proper way of handling these datasets
depends on various factors.

(1) frequently and widely used small/toy datasets, that can be even used
for sanity-testing purpose to test a machine learning dataset:

	I'd suggest independent packaging in this case, e.g.
	https://tracker.debian.org/pkg/dataset-fashion-mnist

(2) small datasets that are frequently updated; small datasets that are
unlikely useful for other software projects; small datasets used as a
part of unit tests:

	I'd suggest simply including them in the source tarball.
	IIRC scipy/sklearn contains some small datasets as well.

(3) large (in terms of size) dataset:

	ML-Policy Clause 7 [External-Data]
	https://salsa.debian.org/deeplearning-team/ml-policy/-/blob/master/ML-Policy.pdf

(4) non-free:

	needless to say.

(5) equivocal, annoying, and not essential for the package to function
properly:

	If I were you, I might directly remove them and mangle the version
	with +ds
	
I think we should revise ML-Policy and specifically talk about datasets.

> I'm honestly having a hard time motivating myself to do option (1)...
> 
> [1] https://github.com/alan-turing-institute/sktime
> 
> [2]
> https://github.com/alan-turing-institute/sktime/tree/master/sktime/datasets/data
> 
> Best,
> Christian
>

Reply to:

References:
- How do you deal with datasets
  - From: Christian Kastner <ckk@debian.org>

Prev by Date: Re: How do you deal with datasets
Next by Date: PyTorch landed onto archive
Previous by thread: Re: How do you deal with datasets
Next by thread: PyTorch landed onto archive
Index(es):
- Date
- Thread