[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: etch on aranym, was Re: [buildd] Etch?

On Thu, 17 Aug 2006, Petr Stehlik wrote:

> Finn Thain wrote:
> > > > difficult to reproduce the bug?
> > > It's kinda random.
> > 
> > In that case, it might be necessary to make the scheduler behave in a 
> > more derministic way (maybe realtime priority?). Single-user mode 
> > would help.
> I could try upgrading the sarge to etch in single-user mode to see if it 
> changes something.

Yes, but that won't really help to isolate a workload that fails every 
time. The upgrade will operate differently a second time. I guess you 
could backup the hard disk image first.

Single user-mode was just a way to try to eliminate non-deterministic 
scheduler behaviour in the interests of repeatability, by making sure that 
there were no other runnable processes in the system.

> > I'd create a script, say /root/crash.sh, make it executable, and boot 
> > the kernel with "init=/root/crash.sh". In crash.sh I'd run some 
> > single-threaded stress tests.
> > 
> > http://samba.org/ftp/tridge/dbench/README 
> > http://weather.ou.edu/~apw/projects/stress/ 
> > http://www.bitmover.com/lmbench/
> FYI, I have just finished the following test:
> # stress -c 4 -i 16 -m 3 --vm-bytes 32M -d 4 --hdd-bytes 128M
> It's been running for almost 5 hours. No problem detected. On another 
> console I ran while(true) do uptime; sleep 300; done and saw a 
> consistent load of 28-29.

That is a long run queue. If you did find a problem that way, it could be 
very hard to reproduce because of the interactions of all the tasks.

> So the machine was busy stressing CPU, memory and disk but it didn't 
> detect anything wrong.

Well, maybe we need to concentrate on I/O. I'd try continuous tripwire 
checks, or a similar intrusion detection system.

> > If you can't reproduce the problem that way, I'd try introducing more 
> > context switching into the workload.
> like stress -c 1k instead of -c 4?

To get a single threaded test, I'd be trying -c 0 -i 0 -m 0 but maybe 1 
fork is the minimum (?)

> > > s!/usr/bin/perl
> > 
> > Are you sure the problem was not confined to the buffer cache?
> I am not sure at all.

If we are going to test disk I/O, we must find a way to disable the buffer 
cache completely. Does anyone know how to do this?


> > Re-reading the same file after an unmount/remount would determine that.
> will try the next time.
> Petr

Reply to: