[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Status bibref gatherer (Was: Tasks pages (close to) fixed; Bibref does not seem to be updated automatically)



Hi Charles,

just a short notice because I was quite occupied today!

On Sun, Mar 11, 2012 at 11:33:29PM +0900, Charles Plessy wrote:
> According to http://wiki.debian.org/UpstreamMetadata, the two following
> files would be equivalent:
> 
> ---
> Reference-PMID: 19854763
> Contact: Manolo Gouy <mgouy@biomserv.univ-lyon1.fr>
> Reference-journal: Mol Biol Evol
> Name: SeaView
> Homepage: http://pbil.univ-lyon1.fr/software/seaview
> Reference-author: Gouy, Manolo and Guindon, Stephane and Gascuel, Olivier
> Downloads: ftp://pbil.univ-lyon1.fr/pub/mol_phylogeny/seaview/archive/
> ...
> 
> ---
> Contact: Manolo Gouy <mgouy@biomserv.univ-lyon1.fr>
> Reference:
>  journal: Mol Biol Evol
>  PMID: 19854763
>  author: Gouy, Manolo and Guindon, Stephane and Gascuel, Olivier
> Name: SeaView
> Homepage: http://pbil.univ-lyon1.fr/software/seaview
> Downloads: ftp://pbil.univ-lyon1.fr/pub/mol_phylogeny/seaview/archive/
> ...
> 
> For good or bad, there is no "Reference" field in the spec.  There
> is just a hack saying that if there is a "Reference" hash, the fields
> it contains have to be consered prefixed by "Reference-".

Ahhh, this explains quite a lot.  I took the "Reference" field for
granted because it was used heavily in practice.  When I started writing
my first own debian/upstream(-metadata.yaml) file I just started to copy
one of your "Pioneer" examples and cloned this example over and over -
probably other people did as well.  From what I have seen later this
seemed to become a de-facto-standard rather than a hack.
 
> This can be changed or clarified, and I think that the sooner we do
> it the better it is.

Yes.  I think now the problem became quite clear to me.  Yes, I agree we
need some clarification about the proposed format.  From my point of
view it would be the easiest way to just keep on using what was accepted
as Standard.  My logfiles of the importer (you might have noticed that
I pushed the preliminary + untested code into 

  git://git.debian.org/git/users/plessy/umegaya.git

in subdir pre_udd_bibref - please do not hit me with a club if something
does not work yet!) show that in practice all upstream files in your
gathering SVN follow the method of using the "Reference" hash.  Could you
please explain in how far this could break other applications or what
drawbacks this could include which I can not see for the moment.  You
are the "umegaya - master" and have the most experience.

> Also, all the fields in the spec are "scalar": they just contain text. So
> things like the following are bogus.
> 
> Screenshots:
>  - http://pbil.univ-lyon1.fr/binaries/seaview4.png
>  - http://pbil.univ-lyon1.fr/binaries/seaview-tree.png
> 
> The problem here is that if we allow arrays, one can not in advance
> expect if the content of a field is its value or an array of values.
> This is difficult to parse.

>From the Python parser perspective this is very simple.  The relevant
code snippet here is:
 
        if isinstance(references, list):
          # upstream file contains more than one reference
          rank=0
          for singleref in references:
            self.setref(singleref, package, rank)
            rank += 1
        elif isinstance(references, str):
          # upstream file has wrongly formatted reference
          self.log.error("File %s has following references: %s" % (ufile, references))
        else:
          # upstream file has exactly one reference
          self.setref(references, package, 0)

and your are done.  I have no idea how other parsers might deal with
this but I'd plainly assume that every language implementation has
some means to detect arrays.

> Back to the references, it means that the following two are not
> equivalent:
> 
> ---
> Reference:
>  journal: Mol Biol Evol
>  PMID: 19854763
>  author: Gouy, Manolo and Guindon, Stephane and Gascuel, Olivier
> ...
> 
> ---
> Reference:
>  - journal: Mol Biol Evol
>    PMID: 19854763
>    author: Gouy, Manolo and Guindon, Stephane and Gascuel, Olivier
> ...

ACK for the non-equivalence, however, the parsing code above could
perfectly cope with it.

> Regardless about what I wrote about the non-existence of a "Reference" field,
> having the co-existence of some upstream file where there is an array, and some
> upstream files where there is not, is likely to cause trouble. 

Because we now are back into details, I'd suggest to move to some
different medium.  While IRC might be good in principle I can only IRC
if I'm at home which is usually the time when you are sleeping in
Japan.  Any better suggestion to find a better medium for clarifying
those details?
 
> According to the point of view, these are bugs or features of the spec, but
> they will be difficult to change after we start to use it seriously.  We
> therefore need make good choice here.  The more complex the data structure we
> use, the less this file will be used directly.  And "direct use" also includes
> Lintian tests and helper integration.

I agree that lintian is quite important to check debian/upstream files
and thus we should make pretty sure that it will be able to work on
those files.  From my perspective there should be at least two checks:

   1. Valid YAML (in general):  -> Error
   2. Fields from a defined set of keywords -> otherwise warning
      (to prevent misspellings

If we can ensure this on the basis of a format following the current
practical usage (including hashes and arrays) I personally see no need
to change what I would call as standard adopted by debian/upstream
writers.  Please put some light for us in some other problems from your
more knowledged point of view in case I might have overseen something.

Kind regards

     Andreas (and sorry for the delayed answer which will probably be
              the case each day this week).
 

-- 
http://fam-tille.de


Reply to: