[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Using BibRef from upstream-metadata.yaml (Was: Multiple publication data in upstream-metadata.yaml)



Hi Charles,

On Thu, Jan 12, 2012 at 09:48:04AM +0900, Charles Plessy wrote:
> This is likely to be usable in the blend's sentinel pages.  However, I have not
> implemented this.  I do not have a particularly good excuse, except that I am
> not a python programmer, so I would need to block a long slot of time to really
> get into it, and recently such slots I have given them to DEP 5 or to my
> attempts to use Debian Installer to prepare Amazon Machine Images (see
> http://charles.plessy.org/Debian/debiâneries/nuage/ and #637784).
> 
> The data is in the UDD, so anybody can give it a try.

I have tried to use the information from upstream-metadata.yaml in the
tasks pages.  This worked so far but I noticed that the data in UDD are
not yet of the quality I would want them to have.  At first I noticed
that for instance for package perlprimer two entries for PMID do exist
(one of them is empty).  This could be avoided if we would use

  PRIMARY KEY (package,key)

(I'm sure I suggested this before).  I admit that this could lead to
more import errors.  However, this kind of import errors would lead to
check the input base immediately which is a wanted effect IMHO.

I tried to do some more investigation into the UDD bibref table and
noticed that at minimum the Reference-Journal is missing.  This is used
on the tasks pages and should be injected as well.  Also volume, number
and pages might be interesting to propagate from upstream-metadata.yaml
to UDD.  We also use URL and eprint which is missing as well.

I tried to find out the reason for these missings and checked the
intermediate format you are using for the import which is obtained
via

  wget -q http://upstream-metadata.debian.net/for_UDD/biblio.yaml

IMHO this format is really not the best choice (it's even not yaml,
right?)

As far as I understood you are generating these data from another
database at upstream-metadata.debian.net and thus you could choose
yourself the most convinient format for UDD.  If I were you I would make
things pretty simple and use a format which is fit for postgresql COPY
format[1].  The only drawback I could see for this format is that it
might be harder to debug if you are violating the suggested primary key.
Currently the problematic part in your data input is:

- perlprimer
- PMID
- 15073005

---
- perlprimer
- PMID
- ''

Besides the fact that you should not export empty values anyway there
should be some mechanism to avoid duplicates.

Another option would be a pretty simple CSV file because Python has a
cool CSV reader which uses first line as keys for a dictionary.

I'd volunteer to write the importer from this (or any other format you
might choose - yaml, xml - whatever you prefer) including checking the
primary key constraint.  But the precondition would be a value complete
data file.

I can confirm that I yesterday wrote some code which reads the available
data from UDD and uses these data on the tasks pages.  I could make this
available on some separate pages to verify that everything will look as
expected in the next couple of days.

Kind regards

         Andreas.


[1] http://www.postgresql.org/docs/9.1/static/sql-copy.html

-- 
http://fam-tille.de


Reply to: