[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#774702: linux-image-3.16.0-4-amd64: Regression in topology for multi-NUMA-node Haswell Xeon CPUs



Package: src:linux
Version: 3.16.7-ckt2-1
Severity: normal

Dear Maintainer,

On a machine with 2 Intel Haswell Xeon E5-2697 v3 CPUs, we are observing a regression in how topology is detected. Using Wheezy, Linux detects 2 sockets
and output the following text:

====><===============
Jan 6 15:15:11 pocn001 kernel: [ 0.450629] Booting Node 0, Processors #1 Jan 6 15:15:11 pocn001 kernel: [ 0.455199] smpboot cpu 1: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 0.567069] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    0.573406]  #2
Jan 6 15:15:11 pocn001 kernel: [ 0.575160] smpboot cpu 2: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 0.686818] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    0.693158]  #3
Jan 6 15:15:11 pocn001 kernel: [ 0.694911] smpboot cpu 3: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 0.806473] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    0.812809]  #4
Jan 6 15:15:11 pocn001 kernel: [ 0.814562] smpboot cpu 4: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 0.926220] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    0.932548]  #5
Jan 6 15:15:11 pocn001 kernel: [ 0.934302] smpboot cpu 5: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.045959] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.052293]  #6
Jan 6 15:15:11 pocn001 kernel: [ 1.054047] smpboot cpu 6: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.165709] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.172099]  Ok.
Jan 6 15:15:11 pocn001 kernel: [ 1.174143] Booting Node 1, Processors #7 Jan 6 15:15:11 pocn001 kernel: [ 1.178712] smpboot cpu 7: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.289472] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.295830]  #8
Jan 6 15:15:11 pocn001 kernel: [ 1.297584] smpboot cpu 8: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.409242] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.415599]  #9
Jan 6 15:15:11 pocn001 kernel: [ 1.417354] smpboot cpu 9: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.529010] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.535350]  #10
Jan 6 15:15:11 pocn001 kernel: [ 1.537201] smpboot cpu 10: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.648655] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.654984]  #11
Jan 6 15:15:11 pocn001 kernel: [ 1.656835] smpboot cpu 11: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.768484] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.774815]  #12
Jan 6 15:15:11 pocn001 kernel: [ 1.776667] smpboot cpu 12: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 1.888219] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    1.894552]  #13
Jan 6 15:15:11 pocn001 kernel: [ 1.896403] smpboot cpu 13: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.008055] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.014445]  Ok.
Jan 6 15:15:11 pocn001 kernel: [ 2.016491] Booting Node 2, Processors #14 Jan 6 15:15:11 pocn001 kernel: [ 2.021156] smpboot cpu 14: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.131722] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.138096]  #15
Jan 6 15:15:11 pocn001 kernel: [ 2.139948] smpboot cpu 15: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.251343] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.257713]  #16
Jan 6 15:15:11 pocn001 kernel: [ 2.259564] smpboot cpu 16: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.371119] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.377469]  #17
Jan 6 15:15:11 pocn001 kernel: [ 2.379320] smpboot cpu 17: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.490874] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.497218]  #18
Jan 6 15:15:11 pocn001 kernel: [ 2.499070] smpboot cpu 18: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.610525] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.616866]  #19
Jan 6 15:15:11 pocn001 kernel: [ 2.618717] smpboot cpu 19: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.730272] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.736616]  #20
Jan 6 15:15:11 pocn001 kernel: [ 2.738468] smpboot cpu 20: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.850025] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.856412]  Ok.
Jan 6 15:15:11 pocn001 kernel: [ 2.858455] Booting Node 3, Processors #21 Jan 6 15:15:11 pocn001 kernel: [ 2.863122] smpboot cpu 21: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 2.973884] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    2.980261]  #22
Jan 6 15:15:11 pocn001 kernel: [ 2.982113] smpboot cpu 22: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 3.093568] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    3.099939]  #23
Jan 6 15:15:11 pocn001 kernel: [ 3.101791] smpboot cpu 23: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 3.213261] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    3.219631]  #24
Jan 6 15:15:11 pocn001 kernel: [ 3.221483] smpboot cpu 24: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 3.332984] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    3.339329]  #25
Jan 6 15:15:11 pocn001 kernel: [ 3.341181] smpboot cpu 25: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 3.452836] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    3.459188]  #26
Jan 6 15:15:11 pocn001 kernel: [ 3.461040] smpboot cpu 26: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 3.572499] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    3.578847]  #27 Ok.
Jan 6 15:15:11 pocn001 kernel: [ 3.581277] smpboot cpu 27: start_ip = 89000 Jan 6 15:15:11 pocn001 kernel: [ 3.692337] NMI watchdog enabled, takes one hw-pmu counter.
Jan  6 15:15:11 pocn001 kernel: [    3.698561] Brought up 28 CPUs
Jan 6 15:15:11 pocn001 kernel: [ 3.701962] Total of 28 processors activated (145597.28 BogoMIPS).
====><===============

lscpu gives:

====><===============
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                28
On-line CPU(s) list:   0-27
Thread(s) per core:    1
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               2601.000
BogoMIPS:              5199.94
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              17920K
NUMA node0 CPU(s):     0-6
NUMA node1 CPU(s):     7-13
NUMA node2 CPU(s):     14-20
NUMA node3 CPU(s):     21-27
====><===============

Booting the same machine, or one with the exact same hardware, using Jessie's kernel
leads to a different result:

====><===============
Jan 6 13:58:55 pocn501 kernel: [ 0.444912] x86: Booting SMP configuration: Jan 6 13:58:55 pocn501 kernel: [ 0.449579] .... node #0, CPUs: #1 Jan 6 13:58:55 pocn501 kernel: [ 0.468345] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
Jan  6 13:58:55 pocn501 kernel: [    0.477630]   #2  #3  #4  #5  #6
Jan 6 13:58:55 pocn501 kernel: [ 0.551311] .... node #1, CPUs: #7 Jan 6 13:58:55 pocn501 kernel: [ 0.567061] ------------[ cut here ]------------ Jan 6 13:58:55 pocn501 kernel: [ 0.572421] WARNING: CPU: 7 PID: 0 at /build/linux-CMiYW9/linux-3.16.7-ckt2/arch/x86/kernel/smpboot.c:310 topology_sane.isra.2+0x7b/0x90() Jan 6 13:58:55 pocn501 kernel: [ 0.586304] sched: CPU #7's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
Jan  6 13:58:55 pocn501 kernel: [    0.597176] Modules linked in:
Jan 6 13:58:55 pocn501 kernel: [ 0.600591] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt2-1 Jan 6 13:58:55 pocn501 kernel: [ 0.610011] Hardware name: IBM IBM NeXtScale nx360 M5 -[5465FT1]-/00KG122, BIOS -[THE104FUS-1.03]- 11/26/2014 Jan 6 13:58:55 pocn501 kernel: [ 0.621079] 0000000000000009 ffffffff81507263 ffff88046f9f7e58 ffffffff81065847 Jan 6 13:58:55 pocn501 kernel: [ 0.629367] 0000000000000001 ffff88046f9f7ea8 ffff88087fc12980 0000000000012980 Jan 6 13:58:55 pocn501 kernel: [ 0.637657] 000000000000a060 ffffffff810658ac ffffffff8170f760 ffff880400000030
Jan  6 13:58:55 pocn501 kernel: [    0.645948] Call Trace:
Jan 6 13:58:55 pocn501 kernel: [ 0.648678] [<ffffffff81507263>] ? dump_stack+0x41/0x51 Jan 6 13:58:55 pocn501 kernel: [ 0.654607] [<ffffffff81065847>] ? warn_slowpath_common+0x77/0x90 Jan 6 13:58:55 pocn501 kernel: [ 0.661505] [<ffffffff810658ac>] ? warn_slowpath_fmt+0x4c/0x50 Jan 6 13:58:55 pocn501 kernel: [ 0.668112] [<ffffffff810027ae>] ? calibrate_delay+0xbe/0x910 Jan 6 13:58:55 pocn501 kernel: [ 0.674622] [<ffffffff8104236b>] ? topology_sane.isra.2+0x7b/0x90 Jan 6 13:58:55 pocn501 kernel: [ 0.681519] [<ffffffff81042844>] ? set_cpu_sibling_map+0x484/0x500 Jan 6 13:58:55 pocn501 kernel: [ 0.688515] [<ffffffff81042a04>] ? start_secondary+0x144/0x2d0 Jan 6 13:58:55 pocn501 kernel: [ 0.695123] ---[ end trace 7f2af1a99481016b ]---
Jan  6 13:58:55 pocn501 kernel: [    0.720515]   #8  #9 #10 #11 #12 #13
Jan 6 13:58:55 pocn501 kernel: [ 0.808491] .... node #2, CPUs: #14 #15 #16 #17 #18 #19 #20 Jan 6 13:58:55 pocn501 kernel: [ 1.011650] .... node #3, CPUs: #21 #22 #23 #24 #25 #26 #27 Jan 6 13:58:55 pocn501 kernel: [ 1.135087] x86: Booted up 4 nodes, 28 CPUs Jan 6 13:58:55 pocn501 kernel: [ 1.139961] smpboot: Total of 28 processors activated (145614.25 BogoMIPS)
====><===============

and lscpu gives:

====><===============
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                28
On-line CPU(s) list:   0-27
Thread(s) per core:    1
Core(s) per socket:    7
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
Stepping:              2
CPU MHz:               1272.679
CPU max MHz:           3600,0000
CPU min MHz:           1200,0000
BogoMIPS:              5201.29
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              17920K
NUMA node0 CPU(s):     0-6
NUMA node1 CPU(s):     7-13
NUMA node2 CPU(s):     14-20
NUMA node3 CPU(s):     21-27
====><===============

I attach relevant log file and bug script output for your convenience. Please
let me know if you need more details.

Looking at recent changes in Linux 3.18, it might be resolved using:
- cebf15eb09a2fd2fa73ee4faa9c4d2f813cf0f09
- 728e5653e6fdb2a0892e94a600aef8c9a036c7eb

(We intend to test this during the week).

Regards

--
Mehdi

Attachment: kern_log_3.2.0-4-amd64.gz
Description: Binary data

Attachment: kern_log_3.16.0-4-amd64.gz
Description: Binary data

Attachment: reportbug-linux-image-3.2.0-4-amd64.gz
Description: Binary data

Attachment: reportbug-linux-image-3.16.0-4-amd64.gz
Description: Binary data


Reply to: