[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

General advice needed : rebuilding server



Dear All

This is going to be a long email about various issues. Some are
management issues rather than 64bit issues exactly, but I thought that I
would post a full narrative in the hope that people would comment on
parts of it.

The Server
----------

We installed a new dual 2.GHz Xeon (1 MB Cache) 2U fileserver with a
S5350-1 U Dual Opteron Motherboard, 2GB RAM and AMCC/3Ware 9500S-8 SATA
RAID Controller at the beginning of the year. I had initial problems
finding an installer to support the new 3Ware card -- I had to use the
stable image prepared by 3Ware itself (viewable here
http://www.3ware.com/support/OS-support.asp), and chose the amd64
generic version.

4 250GB SATA disks were arranged, for cost reasons, in a single RAID5
array. The main file shares are running on LVM2 volumes.

A very minimalistic server environment was installed using Debian
testing (amd64). No X or other unnecessary packages were installed.
Running daemons were mainly
    
    apcupsd (battery daemon)
    samba
    netatalk (mac fileshareing daemon)
    exim4 (light version)
    portmapper (resolving only on loopback address)
    sshd

The Problem
-----------

The server has been running very well, although we have noticed the RAID
card battery complaining of high temperatures. We have been pressing the
client to install air conditioning.

Yesterday some samba users could not log in and a technician from our
office arrived on site to find that /boot and /var on the server were
empty. Resorting to 'dmesg' in the absence of log files we found little
of interest. I didn't have the 3ware raid utilities on the machine and
didn't investigate the RAID array (although later on reboot is showed
all ok -- I didn't run a full verify though).

Thinking this might be a crack we reset all the passwords on the
firewalls and other server machines. It seems unlikely to be a crack as
an intruder would have had to go through 2 other machines to get to the
fileserver. Simple forensics (running chkrootkit, looking at
/var/log/auth.log, lsof, ps etc after reinstalling these) did not show
up any problems. This leaves the chance that someone removed /boot and
/var from the console. This is very unlikely considering our clients'
office environment, although it could have been a newbie sysadmin.

Current theses for the Problem:

- a RAID card problem, possibly heat related
- an out-of-control daemon
  apcupsd? samba? netatalk?
- a kernel problem
  wrong kernel architecture, buggy kernel implementation by 3Ware
- cracker or newbie sysadmin killing /boot and /var

Resuscitation
-------------

I was able to fairly quickly get a working apt together and decided to
reinstall all the critical packages so that /var would be populated
properly. 

I encountered a major snag. All the amd64 repositories were offline or
only had partial contents. In the end I wasn't able to rebuild the
server this way (or install a new kernel) due to missing testing
repositories.

I had to rebuild the OS on the server, this time using the stable
release as testing is unavailable. By carefully avoiding overwriting the
LVM VG I was able to bring the main data volumes back online without
loss.

Questions
---------

I'd be really grateful to know if anyone could suggest how or why this
has happened. In particular:

- can anyone suggest a 2.6.15 kernel flavour to use, with support for
  the 9550SX card, Xeons and a 64bit motherboard.
- how to work with the Testing release for 64bit machines

Thoughts and observations much appreciated.

Kind regards,
Rory

-- 
Rory Campbell-Lange 
<rory@campbell-lange.net>
<www.campbell-lange.net>

Attachment: signature.asc
Description: Digital signature


Reply to: