Re: Biological data being used by an unpublished research paper is considered proprietary

To: Peter Rice <ricepeterm@yahoo.co.uk>
Cc: debian-med@lists.debian.org, debian-devel@lists.debian.org
Subject: Re: Biological data being used by an unpublished research paper is considered proprietary
From: Faheem Mitha <faheem@faheem.info>
Date: Thu, 19 Sep 2013 01:50:48 +0530 (IST)
Message-id: <[🔎] alpine.DEB.2.02.1309180045240.3889@orwell.homelinux.org>
In-reply-to: <[🔎] 5236F28F.2020800@yahoo.co.uk>
References: <[🔎] alpine.DEB.2.02.1309161448160.3889@orwell.homelinux.org> <[🔎] 5236F28F.2020800@yahoo.co.uk>


Hi Peter,

Thank you for your very helpful answer. Seriously, it is rare to get
such a good answer on such a topic. I actually read your response on
academia.sx before you saw your email, and I should have guessed such
a good reason would have come from a Debian person. Also, I see you
registered the same day as your answer. :-)

I'm keeping debian-devel and debian-med cc'd for now, because I do
have some general questions about biological data licensing. If the
lists want me to go away, just say so.

Since you posted your answer publicly, I'm assuming you don't mind if
I quote it. I recommend you post your answer to the Debian lists,
since there is no guarantee that academia.sx will be around forever.

See responses inline. I'm afraid there are a lot of questions, but I
really can't pass up the opportunity to get some answers for
once. Sorry about that.

If you don't want to answer my questions (and let's face it, you
probably don't) perhaps you can suggest some suitable mailing
list(s)/forum(s)?

On Mon, 16 Sep 2013, Peter Rice wrote:

On 16/09/2013 11:31, Faheem Mitha wrote:

Hi,

This is really not Debian-related, except insofar as the software in
question is something that might have been in Debian one day. I talked
about that with people on debian-med recently. So, it is technically
off-topic.

I posted a reply on stackexchange with instructions to get the data
from the EBI SRS server.

However, I have run into this issue before in the context of
biological database entries and Debian so it may be worth discussing
here. There were objections to including SwissProt entries in the
example data for the EMBOSS package because the licensing of
SwissProt does not allow them to be edited.  That was resolved by
agreeing that scientific facts should not be edited so that the
files could be accepted as part of a Debian package even though they
could not be changed. A fine compromise I feel.


So, what license did these files go into Debian as?

regards,

Peter Rice
EMBOSS team

The copyright is probably on the full database release flatfile and
the formatted entries ... you will find similar conditions for
UniProt/SwissProt so it is not so unusual.


Yes, but I'm not trying to download their entire database, just a
small portion of it.

The restrictions on scripts are common to prevent server performance
hits from a large number of requests.


Is such a restriction legally enforceable? I don't see how one can
distinguish between a human user downloading using say curl, and a
script using curl with random pauses between downloads. Or is acceding
to such a request just a matter of common courtesy?

You can simply invite reviewers to download the data from some other
server, for example from the EBI SRS server. The URL for entry
A00673 would be

"http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[IMGTLIGM-ID:a00673]+-view+FastaSeqs+-ascii";


Wow, that works for me! Cool. I've tried before to download data from
other biological data web services, but have always fallen down
confused at the complexity of the sites and the multiplicity of their
options. IMGT is practically the only such site I have found which I
found I was able to navigate without getting brain fever.

http://www.ebi.ac.uk/miriam/main/collections/MIR:00000287

So a few possibly dumb questions.

Question 1: Is there no general agreement on the licensing of
biological data such as that the kind we are talking about? This seems
strange. Aren't such data biological "facts", as you put it in your
message? To me, it makes as much sense to try to treat the list of
prime numbers or any other such mathematical facts as proprietary
information.

Specifically, I don't understand how IMGT can claim to own this data,
to the extent of forbidding its redistribution. They didn't produce
this data themselves, did they?

Question 2: It looks like EBI is hosting a copy of the IMGT
database. Is that right?

Also, there are a lot of different kinds of accession numbers. Which
accession numbers is IMGT using here?

Also, do you know of other servers that have the same data?

You can also use a list of accessions, for example A00673 or A01650

"http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[IMGTLIGM-ID:a00673|a01650]+-view+FastaSeqs+-ascii"

If downloading many entries you should pause between requests, but
putting lists into the URLs may reduce it to few enough not to cause
a problem. I doubts EBI would be upset by 200 requests - they would
be concerned about thousands.


This is *really* useful. I see each of these "list" requests produces
one fasta file with multiple sequences in them. I think this is be a
better way to go rather than producing hundreds of fasta files, each
containing a single sequence, as I have been doing. Also, unlike IMGT,
one justs downloads a FASTA file directly, without having to trim off
HTML stuff. I suspect that each request corresponds at the backend to
a SQL query, and if so, I'm sure the system would prefer one larger
SQL query to many small ones.

Can one do the same trick with the IMGT servers?

In my case, I'm downloading gene segments which contain one or more
Recombination Signal Sequences (RSS). I'm doing this for human RSS (63
segments) and mouse RSS (146 segments), so maybe it would make sense
to download the human segments and the mouse segments as one fasta
file each.

This might be a good place to ask:

QUESTION 3: I'm using the files
http://www.itb.cnr.it/rss/stats/HS12RSS.fasta and
http://www.itb.cnr.it/rss/stats/MM12RSS.fasta as the source of the
RSS.

In each of these files, before the listing of each sequence, there is
an annotation (I hope this is the right word). E.g. in HS12RSS.fasta
there is

HPRT_12

CACACACACACACACACACACAAATACA

So, here for example "HPRT_12" is the annotation, but I have no idea
what it refers to. In some cases I was able to look up these
annotations at IMGT. For example, again in HS12RSS.fasta, there is

TRAJ3*01

CACTGTGGGTAAGGTCTTTGAGATAACC

and I was able to look up TRAJ3*01 in

http://www.imgt.org/IMGTrepertoire/LocusGenes/#h1_32
->
http://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ

But in many cases I was not able to. Do you have any idea what those
other strings refer to?  Here are a couple more, also from
HS12RSS.fasta.

LCK

CACACACACACACACACAAGCCAAAACC

LMO2

CACAGTATTGTCTTACCCAGCAATAATT

There are various fasta formats available for IMGT data, you need to
find a server that produces fasta files compatible with your input
requirements.


I thought there was one standard fasta file format.

Alternatively of course your reviewers could download the whole
database from IMGT or any of the other servers (including
ftp://ftp.ebi.ac.uk/pub/databases/imgt/) and generate their own
fasta subset from the list of accessions/ids


They could, but I don't see the point of it. The reviewers may not
have any special interest in the data unless they happen to be
biologists working with that sort of data, and will probably want to
expend as little effort working with the data as they can.

                                                     Regards, Faheem

Reply to:

Follow-Ups:
- Re: Biological data being used by an unpublished research paper is considered proprietary
  - From: Charles Plessy <plessy@debian.org>

References:
- Biological data being used by an unpublished research paper is considered proprietary
  - From: Faheem Mitha <faheem@faheem.info>
- Re: Biological data being used by an unpublished research paper is considered proprietary
  - From: Peter Rice <ricepeterm@yahoo.co.uk>

Prev by Date: Re: using packages from sid on travis-ci.org
Next by Date: Re: Status of dgit (good for NMUs and fast-forwarding Debian branches)
Previous by thread: Re: Biological data being used by an unpublished research paper is considered proprietary
Next by thread: Re: Biological data being used by an unpublished research paper is considered proprietary
Index(es):
- Date
- Thread