[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Providing components.cif.gz [Was: Re: DeepMind’s AI Advanced Protein Structure Prediction tool Open Sourced]



Hi Andrius,

Op 09-09-2021 om 15:14 schreef Andrius Merkys:
But I would not mind having a system wide service to update data files
like these. Perhaps with a log with version info, so you can look up
what version was used at what date.
Indeed, it would be nice to find a generic solution, but this might be
tricky. There are conflicting needs of stability (no updates), freshness
(updates every day) and multi-user support (no updates and updates
everyday all at once on the same machine). The only solution I can think
of now is keeping all the downloaded versions with version/date in their
names like:

/var/cache/pdb/components/components-20210814.cif.gz
/var/cache/pdb/components/components-20210820.cif.gz
/var/cache/pdb/components/components-20210826.cif.gz
...
(maybe /var/cache/pdb/components/components.cif.gz symlink to the latest)

Then a user would use environment variable, say, PDB_COMPONENTS to point
to a file with version in its name should they need a specific stable
database, and would use /var/cache/pdb/components/components.cif.gz
should they need the most up-to-date one.

Does this sound reasonable?

I think a bit more is required, when looking at the FAIR principles[1] I can see a few other issues coming up. What would be nice is to have e.g. a JSON file along with the data containing a hash, download date and other meta data for the data files available. Then if you store the hash (and perhaps more meta data) for the data file along with your results, you can always recover what version of the datafile was used.

In the PDB-REDO database we're trying to do this for e.g. the version of all the tools used to create a record.

-maarten

[1] https://en.wikipedia.org/wiki/FAIR_data

-- 
Maarten L. Hekkelman
http://www.hekkelman.com/

Reply to: