Re: ATLAS debian cluster and Debian 5.0 Lenny?

To: debian-publicity@lists.debian.org
Cc: Bruce Allen <bruce.allen@aei.mpg.de>, Henning Fehrmann <henning.fehrmann@aei.mpg.de>, Steffen Grunewald <steffen.grunewald@aei.mpg.de>, Elke Müller <elke.mueller@aei.mpg.de>, Felicitas Mokler <Felicitas.Mokler@aei.mpg.de>
Subject: Re: ATLAS debian cluster and Debian 5.0 Lenny?
From: Carsten Aulbert <carsten.aulbert@aei.mpg.de>
Date: Tue, 14 Jul 2009 08:15:12 +0200
Message-id: <[🔎] 200907140815.13088.carsten.aulbert@aei.mpg.de>
In-reply-to: <[🔎] 1247536600.5069.73.camel@localhost.localdomain>
References: <0b9d3f44c2df1db3530884bb7b01945d@localhost> <[🔎] 4A55FCC8.7050109@aei.mpg.de> <[🔎] 1247536600.5069.73.camel@localhost.localdomain>

Hi Andre,

On Tuesday 14 July 2009 03:56:40 Andre Felipe Machado wrote:
> As I understood, Atlas and Morgane nodes' disk space were repartitioned
> and a clean install with a new 5.0 system was performed at each of
> them.
> Were the nodes upgraded (dist-upgrade), instead, while repartitioning
> data disk space?

At least for Atlas, we needed to wipe everything clean as we also moved away 
from xfs back to ext3 as the file system for our system partitions. In the past 
we saw quite a number of xfs errors under certain work loads, but 
unfortunately we were never able to reproduce it cleanly and give the xfs 
crowd more help in figuring out the problem. So, the upgrade was basically a 
full free install instead of a dist-upgrade (or full-upgrade nowadays). That 
one we performed only on very few nodes. As FAI allows a reinstall within a 
couple of minutes, we usually do fresh installs than upgrades, although for 
adding new packages we are using FAI's softupdate quite often (right now 
already 4 times across the cluster within 1.5 weeks) which essentially is a 
small upgrade in place.

I think Steffen did about the same for Morgane, although I'm not 100% sure as 
we did the upgrade in parallel and doing it on one site was already quite a 
bit of work ;) Steffen, please correct me, if I said something wrong.

> As I understood, the upgrade performed straight forward, without much
> pain.

Essentially yes.

> Is the storage space bigger now (for data and sw) or only a bit more
> space for nodes programs?
No, the hardware stayed the same this time, and repartitioning just addressed 
a slight problem that /opt was too small as we need to install quite a bit of 
stuff there. We are going to perform a hardware upgrade to our data servers 
later this year to add more storage space and pending on funding and pricing 
will add several 100 TByte of disk space.

> Could you explain a bit more (a few lines) about the procedure of
> transferring these TBytes from other countries?

Sure. We use a software called LIGO data replicator (LDR, [1]), which relies 
heavily on globus tools for the actual work it does underneath. For example it 
used gridftp (or gsiftp) for transferring the large amounts of data hence and 
forth, using multiple TCP streams to counter the long round trip times across 
the "pond" (Atlantic). The web page has more details, but essentially the data 
is created at the sites of Virgo and LIGO (Cascina in Italy, Hanford & 
Livingston in the US), copied to CIT (CalTech) and published. All other sites 
will then start to learn automatically, which new files are available, and 
start downloading them, querying multiple sites if they already have data and 
downloads it from there.

This reads a lot like the bittorrent protocol is working, but is more 
stringent as data is produced continuously and information about files and 
their metadata are here kept in SQL databases, which would not easily be 
possible with bittorrent nor single file copy speeds well above 20 MB/s for 
long distance connections if the file is only available at a single location.

If you want to know more, we can provide contact information with the guy who 
thought this through, designed and coded this service.

> Given that the data crunching will last for months, how do you
> classify/verify/expect reliability at the cluster and its sw and deal
> with their types of failures?

Good question. The good part is, that basically all jobs are single threaded 
and typically run only for a couple of hours. I.e. this buys you a lot as you 
can afford to have a couple of failed nodes from time to time as these jobs can 
then be resent to other compute nodes. Our scheduling system Condor can 
checkpoint certain job types, i.e. stop the program and write its current 
state to a file on disk which then can be moved to a different machine and 
restarted from where it left earlier. If that does not work, the job will be 
started from scratch again and we "just" lose a little bit of cpu-time, thus 
our overall efficiency is going down slightly. When not enough user jobs are 
present we run Einstein@Home (our own BOINC project) as a backfill job, thus in 
principle our CPUs are 100% busy all the time.

We are currently employing both Ganglia and Nagios to monitor vital system 
services and servers and also have a small database backed system which 
informs us when something is going amiss. Our goal was to make each node as 
self contained as possible, e.g. if a node finds an unrecoverable error with 
the disk (via smartmontools), it will contact the tftp server telling it that 
it needs to perform a disk check. The node will then reboot and start into a 
DOS image via PXE and perform the drive fitness test for the disk drive. We can 
then log into the node via the IPMI card (basically a serial connection via 
LAN) and check the status. 

All these things are handled by each node autonomously, it will only send us 
an email when it's encountering a problem (and yes, if we do a stupid mistake 
we might get bombarded by emails ;)).

Overall we try to maximize the uptime of machines, ensure that all needed 
software packages are installed and up to date and try to optimize the user's 
experience. If we run into bugs there is always reportbug and usually we are 
not always the first to hit a problem, thus BTS and 
%SEARCHENGINE_OF_YOUR_CHOICE% are very helpful as are the various mailing 
lists (both Debian and non-Debian).

I hope that addresses your question to some amount already, please ask more 
questions if you want more background information. If you want pictures, we 
can provide you with (you could also ask tolimar to come here and take some, 
he's living close by and helped as greatly last year when we needed a through 
introduction to Debian packaging). 

Last but not least, I already added Elke Müller und Felicitas Mokler to my 
last email, they are the PR pros about the institute and the science behind 
it. Feel free to ask them also questions if you like.

Cheers

Carsten

[1] http://www.lsc-group.phys.uwm.edu/LDR/

Reply to:

References:
- Re: ATLAS debian cluster and Debian 5.0 Lenny?
  - From: Carsten Aulbert <carsten.aulbert@aei.mpg.de>
- Re: ATLAS debian cluster and Debian 5.0 Lenny?
  - From: Andre Felipe Machado <andremachado@techforce.com.br>

Prev by Date: Re: ATLAS debian cluster and Debian 5.0 Lenny?
Next by Date: Debian Publicity BoF?
Previous by thread: Re: ATLAS debian cluster and Debian 5.0 Lenny?
Next by thread: Debian Publicity BoF?
Index(es):
- Date
- Thread