[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Datasets downloaded by scikit-learn as separate packages?



Hi Steffen,

I did a few of the last uploads of scikit-learn, so maybe there is
something that I can share that might help.

On 19.09.21 19:36, Steffen Möller wrote:
> Hello,
> 
> This goes out to the good fellas who care about scikit-learn. There is
> tutorial for the qiime package that has classifier prepared that only
> works with the latest stable version (0.24.2). We are at 0.23.2 in Debian.
> 
> I gave an update of Scikit-Learn a shot and while the main build was
> fine, I was eventually greeted with many download errors for datasets
> that it uses as examples, as in
> 
> /home/steffen/Science/scikit-learn/examples/inspection/plot_partial_dependence.py
> 
> failed leaving traceback:
> Traceback (most recent call last):
>   File
> "/home/steffen/Science/scikit-learn/examples/inspection/plot_partial_dependence.py",
> 
> line 50, in <module>
>     cal_housing = fetch_california_housing()
>   File
> "/home/steffen/Science/scikit-learn/.pybuild/cpython3_3.9/build/sklearn/utils/validation.py",
> 
> line 63, in inner_f
>     return f(*args, **kwargs)
>   File
> "/home/steffen/Science/scikit-learn/.pybuild/cpython3_3.9/build/sklearn/datasets/_california_housing.py",
> 
> line 134, in fetch_california_housing
>     archive_path = _fetch_remote(ARCHIVE, dirname=data_home)
>   File
> "/home/steffen/Science/scikit-learn/.pybuild/cpython3_3.9/build/sklearn/datasets/_base.py",
> 
> line 1194, in _fetch_remote
>     urlretrieve(remote.url, file_path)
>   File "/usr/lib/python3.9/urllib/request.py", line 239, in urlretrieve
>     with contextlib.closing(urlopen(url, data)) as fp:
>   File "/usr/lib/python3.9/urllib/request.py", line 214, in urlopen
>     return opener.open(url, data, timeout)
>   File "/usr/lib/python3.9/urllib/request.py", line 517, in open
>     response = self._open(req, data)
>   File "/usr/lib/python3.9/urllib/request.py", line 534, in _open
>     result = self._call_chain(self.handle_open, protocol, protocol +
>   File "/usr/lib/python3.9/urllib/request.py", line 494, in _call_chain
>     result = func(*args)
>   File "/usr/lib/python3.9/urllib/request.py", line 1389, in https_open
>     return self.do_open(http.client.HTTPSConnection, req,
>   File "/usr/lib/python3.9/urllib/request.py", line 1349, in do_open
>     raise URLError(err)
> urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>
> 
> 
> The original data is typically available. But these are classics with
> often unclear licenses, here as in
> 
> # The original data can be found at:
> # https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz
> ARCHIVE = RemoteFileMetadata(
>     filename='cal_housing.tgz',
>     url='https://ndownloader.figshare.com/files/5976036',
>     checksum=('aaa5c9a6afe2225cc2aed2723682ae40'
>               '3280c4a3695a2ddda4ffb5d8215ea681'))

I've never noticed these before. The build process generates a ton of
network errors (especially from documentation builds), but as the cause
was specific and known (no network access), these have been ignored in
the past.

> The build currently ends with
> 
> I: pybuild pybuild:284:   (mv
> /home/steffen/Science/scikit-learn/sklearn/conftest.py
> /home/steffen/Science/scikit-learn/sklearn/conftest.py.test; mv
> /home/steffen/Science/scikit-learn/sklearn/datasets/tests/conftest.py
> /home/steffen/Science/scikit-learn/sklearn/datasets/tests/conftest.py.test;
> cd /home/steffen/Science/scikit-learn/.pybuild/cpython3_3.9/build &&
> python3.9 -c 'import sklearn; sklearn.show_versions()')
> mv: cannot stat
> '/home/steffen/Science/scikit-learn/sklearn/conftest.py': No such file
> or directory
> mv: cannot stat
> '/home/steffen/Science/scikit-learn/sklearn/datasets/tests/conftest.py':
> No such file or directory

This is a different issue, I think. Some of the conftest.py files get
moved around because pytest can get confused [1] when a build result is
placed in a subdirectory of the original source (as it's done in the
.pybuild subdirectory), so debian/rules moves some of them out of the way.

This [2] is probably the cause. You can try removing this and see if it
changes anything. Oddly enough, it's only some of the conftest.py that
cause this issue.

[1] https://github.com/pytest-dev/pytest/issues/7223

[2] https://sources.debian.org/src/scikit-learn/0.23.2-5/debian/rules/#L142

> Built with OpenMP: True
> I: pybuild base:232: cd
> /home/steffen/Science/scikit-learn/.pybuild/cpython3_3.9/build;
> python3.9 -m pytest -m "not network" -v -k "not test_old_pickle and not
> test_ard_accuracy_on_easy_problem"
> ImportError while loading conftest
> '/home/steffen/Science/scikit-learn/conftest.py'.
> ../../../conftest.py:14: in <module>
>     from sklearn.utils import _IS_32BIT
> ../../../sklearn/__init__.py:81: in <module>
>     from . import __check_build  # noqa: F401
> ../../../sklearn/__check_build/__init__.py:46: in <module>
>     raise_build_error(e)
> ../../../sklearn/__check_build/__init__.py:31: in raise_build_error
>     raise ImportError("""%s
> E   ImportError: No module named 'sklearn.__check_build._check_build'
> E
> ___________________________________________________________________________
> E   Contents of /home/steffen/Science/scikit-learn/sklearn/__check_build:
> E   setup.py                  _check_build.c __pycache__
> E   _check_build.pyx          __init__.py
> E
> ___________________________________________________________________________
> E   It seems that scikit-learn has not been built correctly.

This I haven't looked at yet, but I wouldn't be surprised if it's related.

> E
> E   If you have installed scikit-learn from source, please do not forget
> E   to build the package before using it: run `python setup.py install` or
> E   `make` in the source directory.
> E
> E   If you have used an installer, please check that it is suited for your
> E   Python version, your operating system and your platform.
> E: pybuild pybuild:353: test: plugin distutils failed with: exit code=4:
> cd /home/steffen/Science/scikit-learn/.pybuild/cpython3_3.9/build;
> python3.9 -m pytest -m "not network" -v -k "not test_old_pickle and not
> test_ard_accuracy_on_easy_problem"
> 
> What are you all thinking? Should we prepare separate data packages for
> Debian for those data sets that allow an unrestricted distribution? I
> did not have a closer look, but I expect these datasets to total to
> about 100MB with no frequent updates expected if any.
> 
> Or should we not build these jupyter notebooks for the -doc package?

I don't think anyone would stop you from packaging the datasets but to
be honest, I think that would be overkill. The -doc package has a popcon
of 93, and I would assume that (like me) most users of scikit-learn use
upstream's online documentation directly.

Best,
Christian


Reply to: