[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Datasets downloaded by scikit-learn as separate packages?



On Mon, 2021-09-20 at 19:52 +0200, Christian Kastner wrote:
> > 
> > Or should we not build these jupyter notebooks for the -doc package?
> 
> I don't think anyone would stop you from packaging the datasets but to
> be honest, I think that would be overkill. The -doc package has a
> popcon
> of 93, and I would assume that (like me) most users of scikit-learn use
> upstream's online documentation directly.

Many machine learning-related packages require external datasets,
and the upstream usually provide APIs for the users to automatically
download them if they are really useful for a large number of audience.
I vote for "packaging a dataset is not necessary", and we may use
pytest marker to skip the tests requiring external data.

I refrained from uploading any datasets except for

 $ apt list dataset\*
 Listing... Done
 dataset-fashion-mnist/unstable,unstable,now

as it can be used as a universal sanity test dataset for any machine
learning tool sanity test dataset. (in academics, people use the
dataset named MNIST. the above Fashion-MNIST is an MIT-licensed
alternative).


Reply to: