Parsing fasta issue in kraken [Was: Problems opening usual fasta files (#141)]

To: Debian Med Project List <debian-med@lists.debian.org>, fuchss@rki.de
Subject: Parsing fasta issue in kraken [Was: Problems opening usual fasta files (#141)]
From: Andreas Tille <andreas@an3as.eu>
Date: Thu, 3 Oct 2019 19:09:49 +0200
Message-id: <[🔎] 20191003170949.nzlufafcyliwjwyx@an3as.eu>

Hi,

could someone who knows better about sequences please have a look and
fix the autopkgtest for kraken and kraken2 (including providing proper
test sequences).

Thanks a lot

      Andreas.

----- Forwarded message from Derrick Wood <notifications@github.com> -----

Date: Thu, 03 Oct 2019 09:42:35 -0700
From: Derrick Wood <notifications@github.com>
To: DerrickWood/kraken2 <kraken2@noreply.github.com>
Cc: Andreas Tille <tille@debian.org>, Mention <mention@noreply.github.com>
Subject: Re: [DerrickWood/kraken2] Problems opening usual fasta files (#141)

Hi Andreas,

Looking at the code and the example files, it does not appear that this is a FASTA parsing issue, but rather an issue with parsing individual items of the sequence ID header and their suitability within Kraken. It looks as if the FASTA headers only have GI numbers in them. These are no longer acceptable for use in Kraken (or Kraken 2) for aiding taxonomy lookups, due to NCBI's move away from GI numbers. The patch you used caused the sequence ID to become only a number (e.g., "441431932"), which was interpreted as a taxid by the kraken2lib::check_seqid() subroutine. Had the test actually tried to classify the reads, it would have found the taxids to be incorrect.

The error I'm seeing when I try to use the scan_fasta_file.pl script to examine your test FASTAs is:

    scan_fasta_file.pl: unable to determine taxonomy ID for sequence gi|441431932|

The sequence ID is being parsed correctly by the script, but with the move away from GI numbers, the sequence ID now lacks any viable token for aiding taxonomy ID lookup. Acceptable replacements are either an explicit taxid in the sequence ID (e.g., `>9606` or `>humanseq|kraken:taxid|9606`) or an accession number (e.g., `>NC_230938.1`).

In short, the test is failing because it is no longer appropriate. The Kraken 2 test should also have the `--minimizer-len 5` removed because minimizer behavior is different in Kraken 2 vs. Kraken 1. In K1, the length of minimizers governed the size of the `database.idx` file, which would be rather large (8 GB) by default, so changing the minimizer length for a test like this made sense. In Kraken 2, no such index exists. Removing the `--minimizer-len 5` should allow the K2 test to work without needing to comment out the build/classification commands.

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/DerrickWood/kraken2/issues/141#issuecomment-538026447

----- End forwarded message -----

-- 
http://fam-tille.de

Reply to:

Prev by Date: Re: [MoM] bcalm package
Next by Date: Fwd: [Debian-med-packaging] Bug#941805: hmmer: autopkgtest regression: hmmpgmd_ga] ... FAILED [crash!]
Previous by thread: Re: Any tools to maintain local mirror of salsa.d.o/med?
Next by thread: Fwd: [Debian-med-packaging] Bug#941805: hmmer: autopkgtest regression: hmmpgmd_ga] ... FAILED [crash!]
Index(es):
- Date
- Thread