[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: mcl package tests



Hi Saira,

nice to observe your progress with mcl.  While also beeing not a trivial
task you seem to went way smoother now!  I'm really happy about this.

On Thu, Jul 11, 2019 at 12:45:44PM +0100, Saira Hussain wrote:
> Hi all,
> 
> in order to test the package mcl, I am using an approach as described on the
> paper of Lima-Mendez et al. (DOI: 10.1093/molbev/msn023) method of
> clustering of bacteriophages based on gene shared content.
> 
> I based mine on some modifications of an original implementation by Ksenia
> Arkhipova. I have already pushed some files to the master but I expect this
> to take another day or so.
> 
> Right now I have one blockage: I am doing a 3-step process
> 
> 1. Using the package prodigal to generate a database file
> 
> 2. Using awk to filter this
> 
> 3. Using mcl to generate protein families
> 
> 4. Using a custom python script to do pairwise comparisons
> 
> 5. Out of the files generated in the previous step, I am using mcl again to
> build clusters of genomes.

Thanks for that description.  I'd recommend you add this to
debian/tests/data/README.  Please `git pull` first since I did some changes.

> On that I have the following question:
> 
> - In order to complete step 5 I need to process a file generated from step
> 3, using the custom, Python script. Now, as we discussed before, I expect I
> can leave the script in (it's already on master) in order to document this
> but is it a right approach to have prepared a test file that is supposed to
> come out of a previous test result?

I admit that this question should be better be directed to a
bioinformatician.  From my general point of view I'd say: Why not.  We
want to prove that mcl is working correctly with some given data set no
matter how this dataset was created.  We simply specify the way how the
data were created to leave no questions about copyright and
reproducibility open.  So for me its fine as you described it (and if
I understood correctly).

> That means I expect that the previous
> test result has been successful!

If the results is sensible according your scientific insight that is a
sensible test.  One important point of Continuous Integration is to
reproduce previous results.  We have this result for now and we want
exactly this in the future.  If future research might uncover some issue
with our data we might adjust these following that knowledge. But for
the given point in time the test is fine, IMHO.
 
> Let me know if that makes sense.

So I think it makes sense.  Thanks for the effort you have undergone to
obtain these data!
 
> p.s. Andreas: I don't need to include the testing files anymore, as I am
> creating my own, user tests as described above.

I'm not fully sure whether I understand your question.  If the question
is:  Currently the repository contains data that can be reconstructed
by the provided scripts but they are not really need for the final test
than yes, you can remove these.  But please leave the scripts to
reproduce the final test data and also add the procedure to the README
file.

I have commited a change where I moved the data to the mcl-doc package.
The rationale is that we do not want to bloat users machines with data
that are not needed to simply run the program.  Those users who want the
documentation might like to run also the test as an example - thus the
data are provided as examples (as well as the test script read to run on
users machines).

Please ask if not all my steps and motivation is clear to you.

Hope this is answering your question.

Kind regards

      Andreas.

-- 
http://fam-tille.de


Reply to: