Hello,
TL;DR: I propose packaging frequently updated PDB Chemical Component
Dictionary. Reasons, technical solutions and limitations below.
PDB Chemical Component Dictionary (CCD) [1] is a single file (~400 MB
uncompressed) collection of small molecule components found in PDB
entries. It is used by at least a couple of Debian packages:
openstructure, which needs it as a build dependency, and libcifpp.
For openstructure I have resorted to putting some version of the CCD
in debian/ directory to fulfill the build requirement and then provide
it as /usr/share/openstructure/components.cif.gz. However, due to this
CCD is not updated as frequently as it is released. Moreover,
large-sized debian/ directories are frowned upon. Therefore I would
like to outsource the CCD.
libcifpp package provides a cron task which keeps an up-to-date CCD in
its cache directory, which is good as Debian-packaged CCD file would
stay static between Debian releases. However, this does not help
building openstructure due to network access constraint.
I propose packaging CCD as a separate source package. It does not have
version, thus update date would have to be used instead. I have hacked
together a watch file to check for new versions, but it fails on
mk-origtargz step:
version=4
opts="downloadurlmangle=s|status.*|monomers/components.cif.gz|,filenamemangle=s|(\d+)/$|ccd-$1.gz|"
\
https://files.wwpdb.org/pub/pdb/data/status/ \
https://files.wwpdb.org/pub/pdb/data/status/(\d+)/
Thus the tarball would have to be produced by get-orig-source target
in debian/rules unless there are other solutions.
Here I would like to ask for comments and suggestions. I am aware that
packaging large and frequently updated data files is not usual
practice, but I believe that doing so would both resolve problems with
building openstructure and benefit users needing a stable CCD version.
[1] https://www.wwpdb.org/data/ccd
Best wishes,
Andrius