[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: What did wreck the system?



[Ralf]
> Dear List of Experts :)
>
> I do confess having entered 
> (*) e2fsadm -L +10M /dev/vg_system/lv_var
>
> without unmounting - and without getting those XXXXXXXX bar that usually 
> indicates progress / success. At that state, 98% of /var were full (used).

This should not have modified the file system, as e2fsadm only resize
the volume, and leave the filesystem untouched, when it detect that
the file system already is mounted.

> Checking the disk usage with df, on /var allegedly 101% were used,
> the absolute amount of bytes being used was a large negative number!
> However, we still could browse /var at this state, while paging some
> log files lasted strikingly long, so there was already a feeling of
> corruption.

Full /var/ will break a lot of servers. :(

I guess the real problem here is
<URL:http://bugs.skolelinux.no/show_bug.cgi?id=653>.  The LVM volumes
created should be scaled with the available size.

> Could it be, that at this very state - even without the stupidity of
> (*) - the overcharged /var drive had lead to corrupt ldap data as
> there was no further way to write to /var/ldap (or what the exact
> location is)?

Perhaps.  I believe it is more likely that the ldap server crashed
when it was unable to write to its logs, and left the LDAP database
intact.

> Instead of backing up as much from /var as possible, we then
> unmounted /var and gave it a "fsck -fy /dev/vg_system/lv_var" (I
> regret the 'y'). After pages of fixing messages, we mounted /var
> again - and found only a lost+found directory there. We managed to
> restore most of the data - but didn't get the ldap to running.

I'm not sure how fsck behaves when a disk is completely full.  

> What do you think now:
> [ ] The system got wrecked when /var run out of memory.

Yes.

> [ ] The system got wrecked when (*) was done.

Do not think so.

> [ ] fsck couldn't cope with the situation as there was no free space
> on the drive, which wrecked the system?

No idea.

> Now, we have RC-3 running - and this is not too bad - but for further 
> situations, one should learn some lessons. Please comment on those:
>
> (1) As a matter of fact, /var run out of memory. This was due to two facts:
>  (i) Taking in consideration that squid takes 100 MB out of 150 MB
> partitioned for /var, there is only 50 MB designed for logs AND
> ldap.

It might be a good idea to create a separate LVM volume at install
time for the squid cache.  I see no good reason why it shouldn't be on
a separate partition.

>  (ii) All logs go to tjener's /var - even logs from attached workstations 
> (this is what we believe, at least). Admitedly, our teacher's work
> station is quite old and once per second says "kernel: i8253 count
> to high! resetting"!  You can imagine that this message filled up
> /var/log/messages!  => LESSON: Make /var larger, filter the above
> mentioned message, trigger logrotate on size rather than on time.

We also provide nagios to make it possible for the sysadmins to get a
warning when /var/ is getting full.

>
> (2) A full /var/log corrupts ldap!
> => LESSON: Put those on different partitions, add /var/ldap (or what
> path it is) by default to the list of backup directories! (This was
> not the case with our system, was it, Klaus?)

Not sure if this is true.  If it is, it should be reported to the
openldap developers.

> (3) LESSON: Never try to enlarge mounted partitions!

You should get an error message when you try this, and it should fail
to do any harm.

> (4) LESSON: Never do fsck with -y option set on a full partition
> (rather (re)move some files first and omit -y switch)!

Perhaps a good idea.  I guess it depend on the file system type used.

> (5) LESSON: Always backup your system.

Good lesson. :)

> (6) LESSON: Don't use tight time slices for administration!

Or perhaps outsource the monitoring and administration?  Several
schools here in Norway are remotely administrated by professional
sysadmins.



Reply to: