Re: ATLAS debian cluster and Debian 5.0 Lenny?
On Tuesday 14 July 2009 03:56:40 Andre Felipe Machado wrote:
> As I understood, Atlas and Morgane nodes' disk space were repartitioned
> and a clean install with a new 5.0 system was performed at each of
> Were the nodes upgraded (dist-upgrade), instead, while repartitioning
> data disk space?
At least for Atlas, we needed to wipe everything clean as we also moved away
from xfs back to ext3 as the file system for our system partitions. In the past
we saw quite a number of xfs errors under certain work loads, but
unfortunately we were never able to reproduce it cleanly and give the xfs
crowd more help in figuring out the problem. So, the upgrade was basically a
full free install instead of a dist-upgrade (or full-upgrade nowadays). That
one we performed only on very few nodes. As FAI allows a reinstall within a
couple of minutes, we usually do fresh installs than upgrades, although for
adding new packages we are using FAI's softupdate quite often (right now
already 4 times across the cluster within 1.5 weeks) which essentially is a
small upgrade in place.
I think Steffen did about the same for Morgane, although I'm not 100% sure as
we did the upgrade in parallel and doing it on one site was already quite a
bit of work ;) Steffen, please correct me, if I said something wrong.
> As I understood, the upgrade performed straight forward, without much
> Is the storage space bigger now (for data and sw) or only a bit more
> space for nodes programs?
No, the hardware stayed the same this time, and repartitioning just addressed
a slight problem that /opt was too small as we need to install quite a bit of
stuff there. We are going to perform a hardware upgrade to our data servers
later this year to add more storage space and pending on funding and pricing
will add several 100 TByte of disk space.
> Could you explain a bit more (a few lines) about the procedure of
> transferring these TBytes from other countries?
Sure. We use a software called LIGO data replicator (LDR, ), which relies
heavily on globus tools for the actual work it does underneath. For example it
used gridftp (or gsiftp) for transferring the large amounts of data hence and
forth, using multiple TCP streams to counter the long round trip times across
the "pond" (Atlantic). The web page has more details, but essentially the data
is created at the sites of Virgo and LIGO (Cascina in Italy, Hanford &
Livingston in the US), copied to CIT (CalTech) and published. All other sites
will then start to learn automatically, which new files are available, and
start downloading them, querying multiple sites if they already have data and
downloads it from there.
This reads a lot like the bittorrent protocol is working, but is more
stringent as data is produced continuously and information about files and
their metadata are here kept in SQL databases, which would not easily be
possible with bittorrent nor single file copy speeds well above 20 MB/s for
long distance connections if the file is only available at a single location.
If you want to know more, we can provide contact information with the guy who
thought this through, designed and coded this service.
> Given that the data crunching will last for months, how do you
> classify/verify/expect reliability at the cluster and its sw and deal
> with their types of failures?
Good question. The good part is, that basically all jobs are single threaded
and typically run only for a couple of hours. I.e. this buys you a lot as you
can afford to have a couple of failed nodes from time to time as these jobs can
then be resent to other compute nodes. Our scheduling system Condor can
checkpoint certain job types, i.e. stop the program and write its current
state to a file on disk which then can be moved to a different machine and
restarted from where it left earlier. If that does not work, the job will be
started from scratch again and we "just" lose a little bit of cpu-time, thus
our overall efficiency is going down slightly. When not enough user jobs are
present we run Einstein@Home (our own BOINC project) as a backfill job, thus in
principle our CPUs are 100% busy all the time.
We are currently employing both Ganglia and Nagios to monitor vital system
services and servers and also have a small database backed system which
informs us when something is going amiss. Our goal was to make each node as
self contained as possible, e.g. if a node finds an unrecoverable error with
the disk (via smartmontools), it will contact the tftp server telling it that
it needs to perform a disk check. The node will then reboot and start into a
DOS image via PXE and perform the drive fitness test for the disk drive. We can
then log into the node via the IPMI card (basically a serial connection via
LAN) and check the status.
All these things are handled by each node autonomously, it will only send us
an email when it's encountering a problem (and yes, if we do a stupid mistake
we might get bombarded by emails ;)).
Overall we try to maximize the uptime of machines, ensure that all needed
software packages are installed and up to date and try to optimize the user's
experience. If we run into bugs there is always reportbug and usually we are
not always the first to hit a problem, thus BTS and
%SEARCHENGINE_OF_YOUR_CHOICE% are very helpful as are the various mailing
lists (both Debian and non-Debian).
I hope that addresses your question to some amount already, please ask more
questions if you want more background information. If you want pictures, we
can provide you with (you could also ask tolimar to come here and take some,
he's living close by and helped as greatly last year when we needed a through
introduction to Debian packaging).
Last but not least, I already added Elke Müller und Felicitas Mokler to my
last email, they are the PR pros about the institute and the science behind
it. Feel free to ask them also questions if you like.