[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

getData - seems to work


over the last days I had surprised myself with the fun I had
while following Charles' download and post-process instructions
for the complete genomes. 

To help the cleanliness of the getData Perl script, Charles
came up with the idea to have Makefiles share a good part of
the functionality. Modularisation. Here the beast that knows
how to retrieve full genomes from Ensembl:

$ more getData.conf.d/Ensembl_genome.mk
SHARED_WGET_OPTIONS=$(shell getData --getWgetOptions)

MIRROR = ftp://ftp.ensembl.org/pub/release-$(ENSEMBLVERSION)/fasta

        echo "I: Retrieving data for Ensembl version $(ENSEMBLVERSION) species $(ORGANISM_L)"
        wget $(SHARED_WGET_OPTIONS) $(MIRROR)/$(ORGANISM_L)/dna/$(ORGANISM).*.$(ENSEMBLVERSION).dna.chromosome.*.fa.gz

        find . -maxdepth 1 -name "*.fa" -delete
        for file in *chromosome.*.fa.gz ; do zcat $$file > `basename $$file .gz` ; done

        if [ -x /usr/bin/makeblastdb ]; then \
                echo "I: Found BLAST+ (preferred) for indexing"; \
                cat *fa | makeblastdb -title $(NICKNAME) -dbtype nucl -out $(NICKNAME); \
        elif [ -x /usr/bin/formatdb ]; then \
                echo "I: Found legacy BLAST for indexing"; \
                cat *fa | formatdb -i /dev/stdin -t $(NICKNAME) -n $(NICKNAME) -p F ; \

The part that calls this Makefile is

$ more getData.conf.d/human.getData
print STDERR "Reading Homo sapiens configuration file\n" if $verbose;

  "name" => "hg18/NCBI36 – Genome Reference Consortium from Ensembl",
  "tags" => ["human","genome"],
  "source" => "make ORGANISM=Homo_sapiens ORGANISM_L=homo_sapiens ENSEMBLVERSION=54 NICKNAME=hg18 -f /etc/getData.conf.d/Ensemb
l_genome.mk get unpack",

  "post-download" => "make -f NICKNAME=hg18 -f /etc/getData.conf.d/Ensembl_genome.mk blast",
  "depends" => "make",
  "recommends" => "ncbi-blast+",
  "size" => "39G"

  "name" => "hg19/GRCh37 – Genome Reference Consortium from Ensembl",
  "tags" => ["human","genome"],
  "source" => "make ORGANISM=Homo_sapiens ORGANISM_L=homo_sapiens ENSEMBLVERSION=75 NICKNAME=hg19 -f /etc/getData.conf.d/Ensemb
l_genome.mk get unpack",
  "post-download" => "make -f NICKNAME=hg19 -f /etc/getData.conf.d/Ensembl_genome.mk blast",
  "depends" => "make",
  "recommends" => "ncbi-blast+",
  "size" => "39G"


The size attribute is not used at the very moment. At some point getData
should warn when there is too little disk space.

The downside of this arrangement, for the very moment, is that the blast
indices are all dispersed across the different data directories. I would very
much like to have those found without environment variable settings.

Ideas are welcome.



Reply to: