[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Xen support on Squeeze



"Nikita V. Youshchenko" <yoush@debian.org> writes:

>> We have had to carry that patch without any upstream support (or sharing
>> with Novell, which eventually released SLES 11 with 2.6.27).  As a
>> result, the xen-flavour kernels for lenny are very buggy, particularly
>> for domains with multiple vCPUs (though that *may* be fixed now).
>
> Unfortunately it is not fixed.
>
> We here once migrated to xen and now rely on it, and that gives lots of 
> frustration. For any loaded domain we still have to run etch kernel, 
> because lenny kernel constantly crashes after several days of heavy load. 
> Dom0's run lenny kernel - and with a fix for #542250 they don't crash, but 
> those are almost unloaded.

I was having problems with multiple vCPUs also, under moderate load I
would regularly get crashes. I reported my findings in #504805. I
swapped out machines, didn't work. When the fix for the xen_spin_wait()
came out, I eagerly switched to that, but it didn't fix my problem. I
even tried my hardest to switch to the latest upstream Xen kernel to see
if that would fix things, but it was way too unstable and I couldn't get
it to work at all.

Eventually I stumbled on a way to keep my machines from restarting, its
not a great solution, but it stops me from having to deal with the
failure on a daily basis. I think that anyone else who is having this
problem can do this and it will work. Obviously this is not the right
solution, but it works until we can get a fix.

First I made sure this was set:

/etc/xen/xend-config.sxp: (dom0-cpus 0)

Then I pinned individual physical CPUs to specific domU's, once pinned,
the problem stops.

What does that mean? Well, Xen does this wacky thing where it creates
virtual CPUs (VCPUs), each domU has one of them by default (but you can
have more), and then it moves physical CPUs between those VCPUs
depending on need.


So lets say you have four CPUs, and a domU. That domU has one VCPU by
default. That VCPU could actually have the physical CPU 0, 1, 2, 3 all
servicing it to provide that VCPU, even at the same time. I found
somewhere that this can be a performance hit, because it needs to figure
out how to deal with this and switch contexts. I also read that it could
cause some instability (!), so pinning the physical CPUs so they don't
move around seemed to solve this.

The pinning does not stick across reboots, so it has to be done again if
the system is rebooted, and it isn't really possible to set this in a
startup script, at least I don't think so.

So how do you do this? If you look at 'xm vcpu-list' (which annoyingly
isn't listed in 'xm help') you will see the CPU column populated with a
random CPU, depending on scheduling, and then the CPU Affinity column
all say 'any cpu'. This means that any physical CPU could travel between
them, and would, depending on the scheduling. Once you pin things, then
the individual domU's are set to have specific CPU affinities, so the
CPUs don't 'travel' between them, and magically the crash stops.

So an example:

root@shoveler:~# xm vcpu-list
Name                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0             0     0     1   -b-  283688.8 any cpu
Domain-0             0     1     1   ---   39666.3 any cpu
Domain-0             0     2     1   r--   49224.4 any cpu
Domain-0             0     3     1   -b-   75591.1 any cpu
kite                 1     0     3   -b-   71411.8 any cpu
murrelet             2     0     0   -b-  472222.2 any cpu
test                 3     0     0   r--  342182.3 any cpu

So we want to fix that final column using 'xm vcpu-pin' (also a command
not listed in 'xm help'):

Usage: xm vcpu-pin <Domain> <VCPU|all> <CPUs|all>

Set which CPUs a VCPU can use.

root@shoveler:~# xm vcpu-pin 0 0 0
root@shoveler:~# xm vcpu-pin 0 1 0
root@shoveler:~# xm vcpu-pin 0 2 0
root@shoveler:~# xm vcpu-pin 0 3 0
root@shoveler:~# xm vcpu-pin 1 0 1
root@shoveler:~# xm vcpu-pin 2 0 2
root@shoveler:~# xm vcpu-pin 3 0 3

root@shoveler:~# xm vcpu-list                                                   
Name                 ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0              0     0     1   -b-  283700.3 0
Domain-0              0     1     1   r--   39669.6 0
Domain-0              0     2     1   -b-   49227.4 0
Domain-0              0     3     1   -b-   75596.2 0
kite                  1     0     3   -b-   71415.3 1
murrelet              2     0     0   -b-  472237.8 2
test                  3     0     0   r--  342182.3 3


And voila, no more crashes... :P

micah


Reply to: