[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Google SoC (Bio DB manager)



Hi guys,

I was hoping those that are interested might offer some constructive criticism on my application:

Abstract:

Bioinformatics research requires the processing of large amounts of biological data. Because of the sheer quantity of data analysed, most researchers must run local mirrors of the databases that they use. Unfortunately, local mirrors can be intimidating to set up and tedious to maintain. Researchers may choose to use older versions of the datasets involved out of laziness or fear of breaking their current scripts, or they may choose to forego large-scale analyses altogether, especially if they have less experience with systems administration.

I propose to solve this problem by creating a tool that will automate the process of finding, installing, updating, and indexing mirrors of biological databases. It will resolve dependencies, such as datasets that are mapped to other datasets and programs that are required for indexing. The tool should allow users to maintain multiple versions of the databases, as some analyses may be linked to specific revisions of the data. As well, it should automate migration of the datasets from one directory or volume to another, for cases where hard disk space is limited.

Ideally, biological database mirroring will be made easy enough that it can be used by anyone familiar with Debian's existing tools. Not only will current researchers be will be more likely to use the most up-to-date biological data, but others who were previously deterred by the inherent difficulties of maintaining such mirrors may be encouraged to pursue large-scale data analyses.

Debian is one of the most popular and stable GNU/Linux distributions, and already provides the base for popular bioinformatics-targeted distributions such as Debian-Med, DNALinux, and Bio-Linux. Debian currently leads in both the quality and quantity of bioformatics packages. It represents the ideal platform on which to build such a tool. Conversely, such a tool would also help to solidify Debian as the standard bioinformatics platform.

Theoretically, the application is not limited to biological databases. It would be readily expanded to any situation that requires local mirrors of large data sets, such as those used in astronomy. Other future development might also add a GUI to make it more user-friendly.

Detailed Description:

Introduction

Advances in the automation of biological experimentation and data collection have led to an explosion in the size and number of biological databases. Although data clearinghouses such as GenBank, EMBL, and DDBJ facilitate the dissemination of such data, any large-scale bioinformatics analysis requires local mirrors of the relevant databases. The extreme size and volatility of the data sets involved have prevented them from being integrated into the standard Debian package management system. Manually finding, installing, updating, and indexing such databases is a daunting task for any system administrator, much less a researcher with limited time and computer training.


Proposed Project

The project is the creation of a tool to automate the life cycle of biological databases, from installation to removal. It should be usable by those with limited technical experience. Its various proposed uses are as follows:

Select:
    Database selection from a list
    Version selection, if appropriate
    Dependency checking for other databases and/or database versions
    Dependency checking for installed programs (especially important for the "processing" step below)
Install:
    Download
    Extract
    Process: load into MySQL, index for BLAST, etc.
    Clean up: remove any remaining downloaded files
Update:
    Check for new versions of installed datasets
    Install updated sets without removing old versions
Remove:
    Remove data that resulted from processing: drop MySQL tables, delete indices, etc.
    Remove extracted files
Reinstall:
    Remove and install again
Relocate:
    Either a simple "mv" or a reinstall into a new location


Other considerations: Because analyses may be linked to specific version, each version will have its own separate installation, e.g. both ensembl.v38 and ensembl.v39. As well, each database will have very different post-extraction processing, with some being indexed for BLAST, some being loaded into a local SQL database, and others having nothing done at all. This problem is compounded by the lack of common data storage formats. A significant amount of hand-coding may be required for each of the different databases' installation step.

Timeline

May: Community bonding period
June: Basic download/version functionality with dependency database
July: Installation functionality for select datasets
August: Updating and relocation functionality
    Add as many other datasets as possible


Personal Background

My name is Aidan Findlater (aidanfindlater@gmail.com). I will be graduating this May with two degrees, a BSc in Computing and a BSc (Honours) in Biochemistry. I've spent almost ten years using Linux (converting to Debian early on--its package management rules all), but have never contributed to any of the open source projects that I use so often. That's something I would like to change.

Not only will I have degrees in both applicable fields of study, I have direct experience with the intersection of the two. In 2006, I won an NSERC Undergraduate Student Research Award to pursue bioinformatics research in the department of biology. I used BioPerl to do an analysis of N-terminal acetylation (a post-translational protein modification) where I compared orthologues in 16 species using the Inparanoid database. I also analysed the same data set using the Gene Ontology database to determine if there were terms that were either more or less common in the set. I had to download and install Inparanoid, HomoloGene, Gene Ontology, and a variety of proteomic datasets. It was frustrating to update.

This past summer (2007) I worked for the same supervisor in my capacity as a biochemist. However, I was bored one week and decided to port my original BioPerl script to BioRuby. I then wrote a Ruby script to automate the download, extraction, and updating of the biological databases that I was using. If I had had longer than a week (say, a whole summer) and a mentor to help guide me, I like to think that it would have been a solid tool. Now I have the opportunity to do just that. While my script was written by me and for me, the tool that I would like to write would be useful to all bioinformatics researchers.

When I was in high school, I taught myself QBasic (of course), then C++, HTML, PHP, SQL, and XHTML. (I'd like to point out that I'm not restricting the list to programming languages, strictly speaking.) In university, we were taught Java, Haskell, assembly, C, yet more SQL, and some less interesting languages. I taught myself Perl for the above-mentioned project, and Ruby when I got bored of Perl. The point of this is to show that I have at least passing familiarity with the languages that might be required for the project, and the ability to quickly learn new ones.

Beyond programming, I also have experience with Debian administration, having administered my person Debian servers for around eight years and my friend's for about as long. I have a TFTP server installed on my (Apple) laptop so that I can do PXE netinsts of Debian and Ubuntu whenever I need to (if I have access to the DHCP server). Debian package management is about the best thing since sliced bread, especially now with Aptitude.

Something that I feel like I should explain are my reasons for switching from computing to biochemistry in third year. I was getting frustrated with computing. The subject matter was often boring and usually erred on the side of academic. I felt like I wasn't learning anything. My friends who graduated last year still know very little about computers in any practical sense. When I took a biochemistry course in third year, it was something new and fresh. Professors were telling me things I didn't already know. I was actually learning! I had always had an interest in genetics and molecular biology, so the change made sense to me.
 

Thanks,

-Aidan

On 04/04/2008, Aidan Findlater <aidanfindlater@gmail.com> wrote:
If you guys are bored, I threw my old Ruby bio DB updater in a git repo: http://archive.aidanfindlater.com/cgi-bin/gitweb.cgi?p=biodbman.git;a=summary

I'd be writing it from scratch, presumably in Perl, but it might give you an idea of the way my brain works. Most of the logic was coded separately for each DB because they're so very different.

And yes, I really did monkeypatch Hash. I'm sorry.

-Aidan


On 02/04/2008, Aidan Findlater <aidanfindlater@gmail.com> wrote:
Dear Charles,

I was looking at the Debian website and didn't realize that the deadline was extended, so I ended up filling out the Google application stuff yesterday.

I'm not sure what you needed from the SoC website, but the screenshot would require manual stitching of images. Here's the most relevant-seeming information:

City: Kingston, Canada
University: Queen's University (http://www.queensu.ca/)
Degree: BScH Biochemistry and BSc Computing (two degrees)
Expected Graduation: May 2008
Home Page: http://www.aidanfindlater.com/
IM Contact: aidanfindlater (AIM), @gmail.com (Google Talk), @jabber.org (Jabber), @hotmail.com (MSN), @yahoo.ca (Yahoo IM); 12192596 (ICQ)

I'm not really sure how it works from here on in. Did you guys have specific ideas about the tool? I was hoping that it would be modeled after apt-get or aptitude so that it would be already familiar to those using it. The proposal posted to the Debian site already looks pretty good to me.

How does one get ahold of the preliminary draft? I'd be interested to see what kind of choices he's made with it.

-Aidan


On 31/03/2008, Charles Plessy <charles-debian-nospam@plessy.org> wrote:
Le Mon, Mar 31, 2008 at 10:45:55PM +0200, Andreas Tille a écrit :

> On Mon, 31 Mar 2008, Aidan Findlater wrote:
>
> >Oh, I forgot to mention that Debian is the best distro, and Vi is the best
> >editor. Hopefully that last one doesn't get me disqualified...
>
> Well, the first statement increases your chances, but the second pushes
> you out! ;-)  Wait, last chance: What is better Gnome or KDE? ;-)))
>
> >>Are applications for the SoC biological database manager project are still
> >>open? I know it's last-minute, but I thought it can't hurt to ask. Please
> >>let me know if it is possible to still apply, and what I would need to do
> >>so.
>
> Just go to the GSoC page and apply as student.  I think it is not too late
> but I personally have no idea how to apply as student - I just found out
> how to apply as mentor. ;-)


Dear Aidan,

Google extended the application period for one week, so yes, our project
is still opened :)

Are you using Jabber ? My ID is charles@jabber.dunkklar.org. I am in the
Tokyo timezone. I am also mentor for the biological database manager
project. Do not hesitate to contact me for your application, privately
or on this list. I am not usually on IRC, so if you want to see me
there, please ping me before !

Do not hesitate to send us a screenshot of
http://code.google.com/soc/2008/student.html : I just get a message like
"Sorry mentors can not sign in as a student" !

Also, we can start to discuss to know each other a bit better. Just
start on this list if having the disucssion public is not a problem for
you. Otherwise, I think that we could have it with me and Andeas, who
are registered mentors, and with Steffen Möller who made a preliminary
draft in Perl, if he is available at the moment.

Have a nice day,


--
Charles Plessy
http://charles.plessy.org
Wakō, Saitama, Japan




Reply to: