[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Packagability of crates with separate test data



On April 10, 2022 12:43 am, Andreas Molzer wrote:
> Hi,
> 
> As a crate author/maintainer, I'm wondering how I could improve my crate
> organization to integrate with your CI/packaging setup. Basically, I got
> this Issue report (non-actionable, informative) regarding rust-weezl:
> 
> <https://github.com/image-rs/lzw/issues/29>
> 
>> Unit tests can't be run from the crate downloaded from crates.io
>> 
>> […] The unit tests depend on a file named /benches/binary-8-msb.lzw
>> that isn't included in the crate uploaded to crates.io.
> 
> Similar issues are obvious for `rust-png` and `rust-image`. The
> underlying problem appears to be an apparent conflict between the
> largely automated policy for packaging and crates.io archives. A little
> more in-depth:
> 
> Speaking as a crate author, the artifacts to crates.io are mainly geared
> towards consumption as a cargo dependency. For this reason, I strive to
> make them as small as possible, with no dev-/test data. (We, image-rs,
> had accidentally published ~1MB once in image-tiff and got issue report
> for that within the day..). In any case, <crates.io> has a hard limit of
> 10MB. For this reason, it does not seem reasonable to mention test data
> in Cargo.toml, even through such mechanisms as specifying an additional
> dev-dependency.

exactly because what's published on crates.io is not always identical to 
what's in git, but the former being what consumers (reverse 
dependencies) of the crate use, Debian uses released crate archives on 
crates.io as canonical source for packaging.

this sometimes causes issues with tests not being able to run 
(properly), because
- test scripts are stripped prior to uploading to crates.io
- test data is stripped prior to uploading to crates.io
- tests rely on unpublished internal crates
- (some) dev-dependencies being stripped prior to uploading to crates.io
- intra-workspace dependencies that are stripped prior to uploading to 
  crates.io

some of that can be worked around by re-adding stripped things via 
patches, but obviously especially for bigger test data that's a 
non-starter.

> Anyways, thinking about the issue I wanted to offer a potential
> solution. What about adding a dev-dependency that ensures the proper
> data exists, and then loading test data either out-of-band or
> dynamically? So an idea for a crate was born:
> 	xtest-data: <https://crates.io/crates/xtest-data/1.0.0-beta.2>
> 
> The basis was that downloading data over network is not desirable. Over
> time it was clear that data should be available as an archive. So, the
> package morphed into automation to create and load minimal git
> pack-files that contain a shallow, and sparse, archive of the test data.
> This makes it possible to very easily publish test data as a separate
> release artifact, for example via CI/CD Actions/… See the documentation
> of xtest-data to see it exemplified:
> 
> 	<https://github.com/HeroicKatora/xtest-data#how-to-use-offline>
> 
> So, I'm left wondering if this approach resonates with some of you. Does
> this simplify any steps for packaging? What could be done to improve
> this further? For instance, `Cargo.toml` allows some arbitrary metadata
> fields. <docs.rs/about/metadata> utilizes this quite effectively. Would
> it be any help if a reference to this additional test data archive to
> the Cargo.toml file (which would appear in the crate), and if so, in
> what form?

Debian uses a special tool called 'debcargo'[0] that manages the 
handling of crate/package metadata, automating away lots of the chores 
associated with packaging crates. it's also responsible for transforming 
a crate published on crates.io into a source tarball (and package) for 
consumption by debian tooling.

in theory it would of course be feasible to teach it to fetch a second 
(or arbitrary additional ;)) tarballs from other sources than crates.io 
- these would then be available at build time just like the regular 
crate source tarball generated from the crate archive retrieved via 
crates.io. in general we'd like some sort of trust anchor for such 
operations (e.g., signed tar balls, crates.io index, ..). for example, 
this could run the xtest-data fetch operation (debcargo runs on the 
packager's system when preparing the package, so network access is still 
okay at this stage).

the main obstacle is that we'd need more than a single crate or crate 
author to use such a scheme to make it worthwhile to implement it as a 
feature in debcargo in the first place, and support it longterm.

basically if you want to implement it for CI purposes anyway (e.g., to 
test the crate as published as opposed to as developed in git) and keep 
in mind that some semi-standard way to fetch and verify the contents 
based on the crate version would be helpful for external consumers, if 
enough people in similar situations adopt such a scheme it makes sense 
for downstreams/distros to become consumers of it as well. for the time 
being, the "solution" will likely be to just skip tests failing for 
this reason (and manually test any patches that might be candidates 
for causing breakage, instead of relying on debci to catch those).

> Feel free to contribute via answers on the mailing list, or opening
> concrete proposals on the repository.

Thanks for reaching out, and feel free to keep as posted in case there 
are developments regarding this issue!

Fabian (not a DD/DM, just someone active in packaging things in Debian 
and further downstream ;))

0: https://salsa.debian.org/rust-team/debcargo


Reply to: