Packaging PDB Chemical Component Dictionary
Hello,
TL;DR: I propose packaging frequently updated PDB Chemical Component
Dictionary. Reasons, technical solutions and limitations below.
PDB Chemical Component Dictionary (CCD) [1] is a single file (~400 MB
uncompressed) collection of small molecule components found in PDB
entries. It is used by at least a couple of Debian packages:
openstructure, which needs it as a build dependency, and libcifpp.
For openstructure I have resorted to putting some version of the CCD in
debian/ directory to fulfill the build requirement and then provide it
as /usr/share/openstructure/components.cif.gz. However, due to this CCD
is not updated as frequently as it is released. Moreover, large-sized
debian/ directories are frowned upon. Therefore I would like to
outsource the CCD.
libcifpp package provides a cron task which keeps an up-to-date CCD in
its cache directory, which is good as Debian-packaged CCD file would
stay static between Debian releases. However, this does not help
building openstructure due to network access constraint.
I propose packaging CCD as a separate source package. It does not have
version, thus update date would have to be used instead. I have hacked
together a watch file to check for new versions, but it fails on
mk-origtargz step:
version=4
opts="downloadurlmangle=s|status.*|monomers/components.cif.gz|,filenamemangle=s|(\d+)/$|ccd-$1.gz|"
\
https://files.wwpdb.org/pub/pdb/data/status/ \
https://files.wwpdb.org/pub/pdb/data/status/(\d+)/
Thus the tarball would have to be produced by get-orig-source target in
debian/rules unless there are other solutions.
Here I would like to ask for comments and suggestions. I am aware that
packaging large and frequently updated data files is not usual practice,
but I believe that doing so would both resolve problems with building
openstructure and benefit users needing a stable CCD version.
[1] https://www.wwpdb.org/data/ccd
Best wishes,
Andrius
Reply to: