[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Beowulf in Bioinformatics



On Tue, Jun 21, 2011 at 8:05 PM, Guilherme Rocha <guilherme@gf7.com.br> wrote:
>
> Hello all,
>
>
> my name is Guilherme Rocha, Biotechnologist and a Debian user since Potato,
> a stupid older user that think to be an advanced user, no more than this.
> Help sometimes to Debian l10n team to localize Debian to PT_BR.
>
> I'm in charge to plan and build a cluster in our lab.  Our lab is Genev -
> Laboratory of Genetics of Population and Molecular Evolution,
> in the Federal University of Bahia - Brasil.
>
> We already have some tasks being done in a Ubuntu Dell Server Machine, but
> in a very slow procedure.
> In a Dell quadcore running Ubuntu this task (PALP analysis) delay 9 days to
> be done.
>
> We want to reduce this time drastically.
>
> So we want to listen you, gurus, about the best practices in order to do it,
> and also, to understand if we will have a significant time reduction with
> our hardware, described below.
>
>
> We first need to identify what we want/need. what is the (typical) problem
> you want to solve?
>
> To use Debian Med in order to make philogenetics analysis, protein modeling,
> DNA alignment, genetics stuff...
> Open Softwares like PALP, GAMGI, GARLIC, GDPC, PyMOL, Perl Primer, etc...
>
> what software do you need for that, do you need a batch scheduler or do you
> have very few users which work at the same place and share the cluster
> without technical measures?
>
> We'll have very few people, 10 I think. Not sure if the tasks need to be
> scheduled
> to be run. We are intended to use Debian Med, (med-bio meta-package) running
> in
> a small size beowulf cluster. Almost 10 to 15 nodes.
>
> think about the OS (Debian is a good choice here ;))
>
> Yes, sure, Debian Med.  :)
>
>  Think about the compute hardware, you probably need a login node, execute
> nodes and a file server, do you need many local cores or are the problems
> too large to fit into a few nodes?
>
> We have very obsolete hardware, our server-node will be a pentium IV 1,5GHz
> with 1GB RAM,
> with work-nodes from k6-500MHz (5 unities) to pentium III 266MHz (10
> unities), Thin Clients ATOM 1GHz
>
>
> Question:
>
> ThinClients with ATOM processor could be used?
> The performance will be good enough?
>
>
>
>  Then you need to look into networking
> (Infiniband or high performance Ethernet), is the software susceptible to
> latency and/or bandwidth available......
>
>
> We have a 10/100 Switch. We are looking to the possibility to acquire a
> 100/100/1000 switch.
>
>
>
> So the questions are:
>
> With this hardware, we will have a significant time reduction on these tasks
> with our hardware?
> Can we use thin clients to build a cluster?
> Some "Debian beowulf Way" method to be reviewed before start?
> Another type of cluster may be better than Beowulf to do it?
> Any Idea will be very welcome
>
> cheers and long life to Debian,
>
> --
> Guilherme Rocha
> GF7 Doc & Systems - Soluções Tecnológicas
> Home Page: http://www.gf7.com.br
> Telefone: + 55 71 4062 9142
> Mobile:   + 55 71 9279 0829
>
>
>
>
>

Your hardware seems to be a bite old, but it should be faster if you
aggregate computers. I will not be able to comment about the
performance but I may help you a little about the architecture.

The main things you need :
- A Server that store all your data and share a FileSystem across
compute node and fronted node. NFS can be a good choice for your
cluster, if you consider having better hardware and multiple servers
on the storage node you may consider use LustreFS which will enhance
you performances.
- Compute node, they need to have an access on the FileSystem (/home
should be mounted from the store node for exemple) and they need to
have access to software. Software can be installed locally or mounted
through from the store node or /software or sthg. And you need a way
to authenticate LDAP client is a good way to do so.
- A scheduler node (Torque, Slurm, SGE) are available on Debian, they
are all good but I prefer Slurm in term of performances.
- A service node, it has Licence Server, OpenLDAP Server and
deployment services to reinstall the others nodes (compute, data,
frontend), it could have configuration services such as puppet (but do
not let it run in background on node it consumes memory). You can use
Nagios and Ganglia to monitor your nodes activity and problems.
- A Front end node, it is basically a compute node where user can
submit job through the batch scheduler.

and don't forget to configure NTP and LDAP everywhere :-)

Debian is clearly a good Linux distribution some of the Top500
clusters are running it. In terms of performance it is a standard
Linux that you can customize so the improvement will come from the
kernel version you choose and drivers versions and it is Linux
distribution agnostic.

-- 
Stéphan


Reply to: