Re: etch on aranym, was Re: [buildd] Etch?
On Thu, 17 Aug 2006, Petr Stehlik wrote:
> Finn Thain wrote:
> > > > difficult to reproduce the bug?
> > > It's kinda random.
> > In that case, it might be necessary to make the scheduler behave in a
> > more derministic way (maybe realtime priority?). Single-user mode
> > would help.
> I could try upgrading the sarge to etch in single-user mode to see if it
> changes something.
Yes, but that won't really help to isolate a workload that fails every
time. The upgrade will operate differently a second time. I guess you
could backup the hard disk image first.
Single user-mode was just a way to try to eliminate non-deterministic
scheduler behaviour in the interests of repeatability, by making sure that
there were no other runnable processes in the system.
> > I'd create a script, say /root/crash.sh, make it executable, and boot
> > the kernel with "init=/root/crash.sh". In crash.sh I'd run some
> > single-threaded stress tests.
> > http://samba.org/ftp/tridge/dbench/README
> > http://weather.ou.edu/~apw/projects/stress/
> > http://www.bitmover.com/lmbench/
> FYI, I have just finished the following test:
> # stress -c 4 -i 16 -m 3 --vm-bytes 32M -d 4 --hdd-bytes 128M
> It's been running for almost 5 hours. No problem detected. On another
> console I ran while(true) do uptime; sleep 300; done and saw a
> consistent load of 28-29.
That is a long run queue. If you did find a problem that way, it could be
very hard to reproduce because of the interactions of all the tasks.
> So the machine was busy stressing CPU, memory and disk but it didn't
> detect anything wrong.
Well, maybe we need to concentrate on I/O. I'd try continuous tripwire
checks, or a similar intrusion detection system.
> > If you can't reproduce the problem that way, I'd try introducing more
> > context switching into the workload.
> like stress -c 1k instead of -c 4?
To get a single threaded test, I'd be trying -c 0 -i 0 -m 0 but maybe 1
fork is the minimum (?)
> > > s!/usr/bin/perl
> > Are you sure the problem was not confined to the buffer cache?
> I am not sure at all.
If we are going to test disk I/O, we must find a way to disable the buffer
cache completely. Does anyone know how to do this?
> > Re-reading the same file after an unmount/remount would determine that.
> will try the next time.