[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: mcl package tests



On 7/11/19 1:50 PM, Andreas Tille wrote:
Hi Saira,

nice to observe your progress with mcl.  While also beeing not a trivial
task you seem to went way smoother now!  I'm really happy about this.

Thank you! It feels really rewarding from my side as well, than I can feel that it all makes much more sense now and the time I spent earlier searching and understanding things, proves to worth it.

On Thu, Jul 11, 2019 at 12:45:44PM +0100, Saira Hussain wrote:
Hi all,

in order to test the package mcl, I am using an approach as described on the
paper of Lima-Mendez et al. (DOI: 10.1093/molbev/msn023) method of
clustering of bacteriophages based on gene shared content.

I based mine on some modifications of an original implementation by Ksenia
Arkhipova. I have already pushed some files to the master but I expect this
to take another day or so.

Right now I have one blockage: I am doing a 3-step process

1. Using the package prodigal to generate a database file

2. Using awk to filter this

3. Using mcl to generate protein families

4. Using a custom python script to do pairwise comparisons

5. Out of the files generated in the previous step, I am using mcl again to
build clusters of genomes.

Thanks for that description.  I'd recommend you add this to
debian/tests/data/README.  Please `git pull` first since I did some changes.

Done.

On that I have the following question:

- In order to complete step 5 I need to process a file generated from step
3, using the custom, Python script. Now, as we discussed before, I expect I
can leave the script in (it's already on master) in order to document this
but is it a right approach to have prepared a test file that is supposed to
come out of a previous test result?

I admit that this question should be better be directed to a
bioinformatician.  From my general point of view I'd say: Why not.  We
want to prove that mcl is working correctly with some given data set no
matter how this dataset was created.  We simply specify the way how the
data were created to leave no questions about copyright and
reproducibility open.  So for me its fine as you described it (and if
I understood correctly).

That means I expect that the previous
test result has been successful!

If the results is sensible according your scientific insight that is a
sensible test.  One important point of Continuous Integration is to
reproduce previous results.  We have this result for now and we want
exactly this in the future.  If future research might uncover some issue
with our data we might adjust these following that knowledge. But for
the given point in time the test is fine, IMHO.
OK. I totally understand this and it actually makes sense to me. I just wanted to have a second opinion as well :)

Let me know if that makes sense.

So I think it makes sense.  Thanks for the effort you have undergone to
obtain these data!
p.s. Andreas: I don't need to include the testing files anymore, as I am
creating my own, user tests as described above.

I'm not fully sure whether I understand your question.  If the question
is:  Currently the repository contains data that can be reconstructed
by the provided scripts but they are not really need for the final test
than yes, you can remove these.  But please leave the scripts to
reproduce the final test data and also add the procedure to the README
file.

I have commited a change where I moved the data to the mcl-doc package.
The rationale is that we do not want to bloat users machines with data
that are not needed to simply run the program.  Those users who want the
documentation might like to run also the test as an example - thus the
data are provided as examples (as well as the test script read to run on
users machines).

Yes that is very clear, thank you!

Please ask if not all my steps and motivation is clear to you.

Hope this is answering your question.

Kind regards

       Andreas.

Thanks
SH


Reply to: