We're just in the process of assembling a new i386 cluster here at Bilkent
CS Dept.,
but we are not very sure about the quality of hardware which we have
purchased
since the computing nodes are part of the research grant, but they
have been used
for 2-3 days in a convention organized by our magnificent institute
of science and
technology which very willingly gave us the grant. :) Whatever,
our goal is to make
sure that all nodes are in good order. But, this means we're going
to test each component
of the machine individually, and the idea of stress-testing a no-video-keyboard-mouse
computing node manually didn't sound very attractive to me.
Instead, I thought that the test process could be automated to
some extent. I considered
implementing a client/server test system for the final cluster, which
just uses some custom
TCP protocol, or I don't know, perhaps XML over http
:), to transmit test request/replies. Well,
the request could contain which tests to perform, and the replies contain
the results of
those tests. I think you could make it pretty much text-based
using /proc and output of
familiar tools and logs. OTOH, it's not very easy to do that, so I
decided to first do a very
lame hack with a test disk that boots a customized kernel and a root
img, so that the node
starts up, tries to mount some nfs dir, performs some selected tests
(like testing the hd) and
writes results back. It can write stuff to a simple file on the nfs
mount. Initially, I thought that
testing the hd, and reporting on the status of network config is fine.
If it boots up and
does that., it'd be fantastic.
Of course, the complete tool would be pretty handy. It could automate
testing for clusters,
and increase reliability. It could mail the admin if things go weird,
or if a certain expected
anomaly (!) arises, perform some correction operation (okay, the simplest
I can think of is
re-installing everything on that node automagically if it doesn't respond
at all to our test
server!!) The disk is a good idea too. Like an install disk, a test
disk would boot up itself.
It could use BOOTP, (or RARP?) to config network, and then it would
contain the test client
or perhaps a stripped down test client I should say, and go with
it. The server package
could have an option to create a test disk with desired tests on it.That
might be a neat hack,
and I'd really like to see it the Debian way.
Comments welcome.
Indeed, some feedback would be appreciated. Apparently, I need some
advice on what
features are necessary, what other programs such a program could use
or has to interface
with, or whether there is any need for this hack in the first place.
Thanks,
-- ++++-+++-+++-++-++-++--+---+----+----- --- -- - - + Eray "eXa" Ozkural . . . . . . + CS, Bilkent University, Ankara ^ . o . . | mail: erayo@cs.bilkent.edu.tr . ^ . .
begin:vcard n:Ozkural;Eray tel;home:4276846 x-mozilla-html:TRUE org:Bilkent University;CS version:2.1 email;internet:erayo@cs.bilkent.edu.tr title:Graduate Student adr;quoted-printable:;;Simsek Sok. No:6/2=0D=0AMeltem Apt. Asagi Ayranci ;Ankara;;06540;Turkey x-mozilla-cpt:;10112 fn:Eray Ozkural end:vcard