Re: About GSoC project "Large dataset manager".
On Wed, Mar 25, 2009 at 5:07 PM, Charles Plessy <email@example.com> wrote:
> In the application to be posted in the SoC web interface
> (http://socghop.appspot.com), please present yourself as you did in this email,
> and explain briefly with your own words why you think that the project you
> selected is interesting. Then develop your ideas on how to acheive it, and
> conclude by explaining why and how you think that you will manage to end up
> your project with something concrete. Ideally, be precise and concise.
Thanks for the help!
I just submitted my application. Under you'll find the content. Does it look ok?
Title: Large dataset manager
Student: Roy Flemming Hvaara
Large public datasets, like databases for bioinformatics are typically
too big and too volatile to fit the traditional source/binary
packaging scheme of Debian. There are some programs that are
distributed in Debian, like blast and emboss, can index specialised
databases, but Debian lacks a tool to install or update the datasets
they need and keep their indexing in sync.
Name: Roy Flemming Hvaara
Background: I'm 21 years old and from Norway, but I study medicine at
Pécs University in Hungary. I've been developing projects in bash,
perl and php for some years. I've had linux as my main Operating
System for about six years.
Project title: Large database manager
Synopsis: I want to create a tool to install and update large
databases. Initially I want to make an application that downloads the
databases and updates directly from the content provider. Later I will
include an option to download from a mirror, and only the updates.
That means a user would not have to download the whole database again
in case the updates are not in separate files. Thus the amount the
user has to download will be less, and time and resources are spared.
I also want to include a tool to make the databases in the Debian
software package format (.deb), and possibly for other distributions
for linux as well - RPM is something I definitely want to add support
for. This has multiple benefits; 1) There will be one file that the
user can move between multiple computers. 2) The user only has to
download the database once. 3) The files are managed by the package
manager of the distribution, keeping the system more streamline. 4)
Less hassle to update the databases. 5) Keeping track of multiple
version of databases and/or versions.
To me the most logic approach is to create this tool in perl. See getData 
I have been in contact with a webhost that has approved hosting for
packages. I hope to be able to distribute the Debian packages - and
others - through my own repository, apt-and-the-likes compatible.
Benefits to Debian: As of right now there are no good way to
distribute very large datasets in debian. This project will help
towards solving this issue.
Deliverables: Management of large datasets.
Project schedule: I think the schedule depends on how many features
I'm going to implement and so on. Package management is an ongoing
process which always requires some work. I am commited to continue
work even after the summer for continued support.
Exams and other commitments: My exam period starts on the 18th of May.
I will work very hard to finish as soon as possible so that I can
start work on this project.
If you are not a Debian Developer: I have always wanted to be a
package maintainer. The dgen package in debian-games appealed to me
when it was orphaned, but as a medical student I have very limited
time. I will definitely continue development on this project after the
summer. I think it's very interesting, and bioinformatics is something
that I would love to work more with.