Dear All This is going to be a long email about various issues. Some are management issues rather than 64bit issues exactly, but I thought that I would post a full narrative in the hope that people would comment on parts of it. The Server ---------- We installed a new dual 2.GHz Xeon (1 MB Cache) 2U fileserver with a S5350-1 U Dual Opteron Motherboard, 2GB RAM and AMCC/3Ware 9500S-8 SATA RAID Controller at the beginning of the year. I had initial problems finding an installer to support the new 3Ware card -- I had to use the stable image prepared by 3Ware itself (viewable here http://www.3ware.com/support/OS-support.asp), and chose the amd64 generic version. 4 250GB SATA disks were arranged, for cost reasons, in a single RAID5 array. The main file shares are running on LVM2 volumes. A very minimalistic server environment was installed using Debian testing (amd64). No X or other unnecessary packages were installed. Running daemons were mainly apcupsd (battery daemon) samba netatalk (mac fileshareing daemon) exim4 (light version) portmapper (resolving only on loopback address) sshd The Problem ----------- The server has been running very well, although we have noticed the RAID card battery complaining of high temperatures. We have been pressing the client to install air conditioning. Yesterday some samba users could not log in and a technician from our office arrived on site to find that /boot and /var on the server were empty. Resorting to 'dmesg' in the absence of log files we found little of interest. I didn't have the 3ware raid utilities on the machine and didn't investigate the RAID array (although later on reboot is showed all ok -- I didn't run a full verify though). Thinking this might be a crack we reset all the passwords on the firewalls and other server machines. It seems unlikely to be a crack as an intruder would have had to go through 2 other machines to get to the fileserver. Simple forensics (running chkrootkit, looking at /var/log/auth.log, lsof, ps etc after reinstalling these) did not show up any problems. This leaves the chance that someone removed /boot and /var from the console. This is very unlikely considering our clients' office environment, although it could have been a newbie sysadmin. Current theses for the Problem: - a RAID card problem, possibly heat related - an out-of-control daemon apcupsd? samba? netatalk? - a kernel problem wrong kernel architecture, buggy kernel implementation by 3Ware - cracker or newbie sysadmin killing /boot and /var Resuscitation ------------- I was able to fairly quickly get a working apt together and decided to reinstall all the critical packages so that /var would be populated properly. I encountered a major snag. All the amd64 repositories were offline or only had partial contents. In the end I wasn't able to rebuild the server this way (or install a new kernel) due to missing testing repositories. I had to rebuild the OS on the server, this time using the stable release as testing is unavailable. By carefully avoiding overwriting the LVM VG I was able to bring the main data volumes back online without loss. Questions --------- I'd be really grateful to know if anyone could suggest how or why this has happened. In particular: - can anyone suggest a 2.6.15 kernel flavour to use, with support for the 9550SX card, Xeons and a 64bit motherboard. - how to work with the Testing release for 64bit machines Thoughts and observations much appreciated. Kind regards, Rory -- Rory Campbell-Lange <rory@campbell-lange.net> <www.campbell-lange.net>
Attachment:
signature.asc
Description: Digital signature