Re: Packagability of crates with separate test data
On April 10, 2022 12:43 am, Andreas Molzer wrote:
> Hi,
>
> As a crate author/maintainer, I'm wondering how I could improve my crate
> organization to integrate with your CI/packaging setup. Basically, I got
> this Issue report (non-actionable, informative) regarding rust-weezl:
>
> <https://github.com/image-rs/lzw/issues/29>
>
>> Unit tests can't be run from the crate downloaded from crates.io
>>
>> […] The unit tests depend on a file named /benches/binary-8-msb.lzw
>> that isn't included in the crate uploaded to crates.io.
>
> Similar issues are obvious for `rust-png` and `rust-image`. The
> underlying problem appears to be an apparent conflict between the
> largely automated policy for packaging and crates.io archives. A little
> more in-depth:
>
> Speaking as a crate author, the artifacts to crates.io are mainly geared
> towards consumption as a cargo dependency. For this reason, I strive to
> make them as small as possible, with no dev-/test data. (We, image-rs,
> had accidentally published ~1MB once in image-tiff and got issue report
> for that within the day..). In any case, <crates.io> has a hard limit of
> 10MB. For this reason, it does not seem reasonable to mention test data
> in Cargo.toml, even through such mechanisms as specifying an additional
> dev-dependency.
exactly because what's published on crates.io is not always identical to
what's in git, but the former being what consumers (reverse
dependencies) of the crate use, Debian uses released crate archives on
crates.io as canonical source for packaging.
this sometimes causes issues with tests not being able to run
(properly), because
- test scripts are stripped prior to uploading to crates.io
- test data is stripped prior to uploading to crates.io
- tests rely on unpublished internal crates
- (some) dev-dependencies being stripped prior to uploading to crates.io
- intra-workspace dependencies that are stripped prior to uploading to
crates.io
some of that can be worked around by re-adding stripped things via
patches, but obviously especially for bigger test data that's a
non-starter.
> Anyways, thinking about the issue I wanted to offer a potential
> solution. What about adding a dev-dependency that ensures the proper
> data exists, and then loading test data either out-of-band or
> dynamically? So an idea for a crate was born:
> xtest-data: <https://crates.io/crates/xtest-data/1.0.0-beta.2>
>
> The basis was that downloading data over network is not desirable. Over
> time it was clear that data should be available as an archive. So, the
> package morphed into automation to create and load minimal git
> pack-files that contain a shallow, and sparse, archive of the test data.
> This makes it possible to very easily publish test data as a separate
> release artifact, for example via CI/CD Actions/… See the documentation
> of xtest-data to see it exemplified:
>
> <https://github.com/HeroicKatora/xtest-data#how-to-use-offline>
>
> So, I'm left wondering if this approach resonates with some of you. Does
> this simplify any steps for packaging? What could be done to improve
> this further? For instance, `Cargo.toml` allows some arbitrary metadata
> fields. <docs.rs/about/metadata> utilizes this quite effectively. Would
> it be any help if a reference to this additional test data archive to
> the Cargo.toml file (which would appear in the crate), and if so, in
> what form?
Debian uses a special tool called 'debcargo'[0] that manages the
handling of crate/package metadata, automating away lots of the chores
associated with packaging crates. it's also responsible for transforming
a crate published on crates.io into a source tarball (and package) for
consumption by debian tooling.
in theory it would of course be feasible to teach it to fetch a second
(or arbitrary additional ;)) tarballs from other sources than crates.io
- these would then be available at build time just like the regular
crate source tarball generated from the crate archive retrieved via
crates.io. in general we'd like some sort of trust anchor for such
operations (e.g., signed tar balls, crates.io index, ..). for example,
this could run the xtest-data fetch operation (debcargo runs on the
packager's system when preparing the package, so network access is still
okay at this stage).
the main obstacle is that we'd need more than a single crate or crate
author to use such a scheme to make it worthwhile to implement it as a
feature in debcargo in the first place, and support it longterm.
basically if you want to implement it for CI purposes anyway (e.g., to
test the crate as published as opposed to as developed in git) and keep
in mind that some semi-standard way to fetch and verify the contents
based on the crate version would be helpful for external consumers, if
enough people in similar situations adopt such a scheme it makes sense
for downstreams/distros to become consumers of it as well. for the time
being, the "solution" will likely be to just skip tests failing for
this reason (and manually test any patches that might be candidates
for causing breakage, instead of relying on debci to catch those).
> Feel free to contribute via answers on the mailing list, or opening
> concrete proposals on the repository.
Thanks for reaching out, and feel free to keep as posted in case there
are developments regarding this issue!
Fabian (not a DD/DM, just someone active in packaging things in Debian
and further downstream ;))
0: https://salsa.debian.org/rust-team/debcargo
Reply to: