Bug#929359: linux: instability on arm64 MP30-AR1 servers
Source: linux
Version: 4.9.168-1
Severity: important
X-Debbugs-Cc: debian-arm@lists.debian.org, debian-admin@lists.debian.org
User: debian-admin@lists.debian.org
Usertags: needed-by-DSA-Team
Hi,
ever since the 9.9 point release conova-node01.debian.org and
conova-node02.debian.org have been unstable. They run for an hour or
three, and then things go bad. Rebooting back to 4.9.144-3.1 makes them
stable again.
Latest example:
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: PingAck did not arrive in time.
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: new current UUID 3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: ack_receiver terminated
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Terminating drbd_a_resource
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Connection closed
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: conn( NetworkFailure -> Unconnected )
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: receiver terminated
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Restarting receiver thread
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: receiver (re)started
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: conn( Unconnected -> WFConnection )
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Handshake successful: Agreed network protocol version 101
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Peer authenticated using 16 bytes HMAC
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: conn( WFConnection -> WFReportParams )
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Starting ack_recv thread (from drbd_r_resource [8449])
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: drbd_sync_handshake:
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: self 3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5 bits:4 flags:0
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: peer 0BEBDA613EA56FD6:0000000000000000:D5BF70E0AA6560C4:D5BE70E0AA6560C5 bits:0 flags:0
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: uuid_compare()=1 by rule 70
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0%
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0%
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: helper command: /bin/true before-resync-source minor-3
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: helper command: /bin/true before-resync-source minor-3 exit code 0 (0x0)
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: Began resync as SyncSource (will sync 16 KB [4 bits set]).
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: updated sync UUID 3EA2D1FA6B3ACD47:0BECDA613EA56FD7:0BEBDA613EA56FD7:D5BF70E0AA6560C5
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: Resync done (total 1 sec; paused 0 sec; 16 K/sec)
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: updated UUIDs 3EA2D1FA6B3ACD47:0000000000000000:0BECDA613EA56FD7:0BEBDA613EA56FD7
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
May 22 04:17:48 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: efi: [Firmware Bug]: IRQ flags corrupted (0x00000140=>0x00000100) by EFI get_time
May 22 04:18:54 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: efi: [Firmware Bug]: IRQ flags corrupted (0x00000140=>0x00000100) by EFI set_time
May 22 04:18:54 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: efi: [Firmware Bug]: IRQ flags corrupted (0x00000140=>0x00000100) by EFI get_time
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Bad mode in FIQ handler detected on CPU0, code 0x56000000 -- SVC (AArch64)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Internal error: Oops - bad mode: 0 [#1] SMP
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Modules linked in: openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat binfmt_misc nls_ascii nls_cp437 vfat fat dm_mod ip6t_REJECT nf_reject_ipv6
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_NFLOG nfnetlink_log nfnetlink xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_hashlimit xt_multiport xt_conntrack nf_conntr
ack iptable_filter ast ttm drm_kms_helper xgene_hwmon efi_pstore drm i2c_algo_bit xgene_edac edac_core xgene_dma joydev evdev chaoskey mailbox_xgene_slimpro sg xgene_rng rng_core efivars tun drbd lru_cache efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq crc32c_generic libcrc32c raid0 multipath linear raid1 hid_generic md_mod usbhid hid sd_mod
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: i2c_xgene_slimpro ahci_xgene libahci_platform libahci xhci_plat_hcd xgene_enet xhci_hcd libata phy_xgene marvell usbcore scsi_mod mdio_xgene of_mdio fixed_phy libphy usb_common gpio_xgene_sb
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: CPU: 0 PID: 1410 Comm: ovsdb-server Tainted: G W I 4.9.0-9-arm64 #1 Debian 4.9.168-1+deb9u2
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Hardware name: GIGABYTE R120-P31/MP30-AR1, BIOS D7b 08/26/2016
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: task: ffff807ff9d54380 task.stack: ffff807f95c94000
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: PC is at 0xffffa10dbf00
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: LR is at 0xffffa13d221c
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: pc : [<0000ffffa10dbf00>] lr : [<0000ffffa13d221c>] pstate: a0000000
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: sp : 0000fffff72e8970
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x29: 0000fffff72e8970 x28: 0000000000000000
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x27: 0000aaaafa714d90 x26: 0000aaaafa7354c8
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x25: 0000aaaafa6eaed0 x24: 0000000000000018
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x23: 0000aaaafa72c660 x22: 0000aaaafa711b80
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x21: 0000000000000004 x20: 000000000000000c
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x19: 0000aaaafa702b90 x18: 00000000002597a9
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x17: 0000ffffa10dbec0 x16: 0000ffffa14837a0
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x15: ffffffffffffffff x14: 0000000000000010
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x13: 33613a63353a3834 x12: 3a66373a63613a36
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x11: 0101010101010101 x10: 0000000066666666
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x9 : 7f7f7f7f7f7f7f7f x8 : 0101010101010101
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x7 : 7f7fffffff7f7f7f x6 : feffa9a9f970ff72
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x5 : 8080000000008000 x4 : 0080000000008080
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x3 : 0000aaaafa720073 x2 : 726f7272655f7874
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x1 : 0000aaaafa711c20 x0 : 0000000000000008
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel:
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Process ovsdb-server (pid: 1410, stack limit = 0xffff807f95c94020)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: ---[ end trace 1fdaa7d4350a5508 ]---
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Bad mode in FIQ handler detected on CPU0, code 0x56000000 -- SVC (AArch64)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: INFO: rcu_bh detected stalls on CPUs/tasks:
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: 0-...: (1 GPs behind) idle=1fd/140000000000000/0 softirq=736283/736285 fqs=2434
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: (detected by 2, t=5255 jiffies, g=15038, c=15037, q=8)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Task dump for CPU 0:
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: ovsdb-server R running task 0 1410 1409 0x0000000a
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Call trace:
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008086190>] __switch_to+0x90/0xd8
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000808b804>] bad_mode+0x6c/0x90
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<0000000021dc9afc>] 0x21dc9afc
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<0000000021db79b8>] 0x21db79b8
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008610748>] virt_efi_set_variable.part.6+0x68/0xb0
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008610898>] virt_efi_set_variable+0x78/0x90
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000860f020>] efivar_entry_set_safe+0xc8/0x200
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff0000010574b8>] efi_pstore_write+0x158/0x1b0 [efi_pstore]
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000830cdbc>] pstore_dump+0x17c/0x388
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008132a54>] kmsg_dump+0xac/0xd0
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff0000080cf5cc>] oops_exit+0x2c/0x38
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000808b0a4>] die+0xdc/0x1c8
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000808b818>] bad_mode+0x80/0x90
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<0000ffffa13d221c>] 0xffffa13d221c
I don't know if the drbd stuff is related to the Oops, I guess it may
not be (as I see similar messages before things break). In any case
after that point the network is down. The network driver is xgene-enet.
/etc/network/interfaces:
# The loopback network interface
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet manual
pre-up echo 1 > /proc/sys/net/ipv6/conf/$IFACE/disable_ipv6
pre-up ip link set dev $IFACE up
post-down ip link set dev $IFACE down
# The primary network interface
allow-hotplug br-inet
iface br-inet inet static
address 217.196.149.227/28
gateway 217.196.149.238
iface br-inet inet6 static
address 2a02:16a8:dc41:100::227/64
gateway 2a02:16a8:dc41:100::def
auto eth1
iface eth1 inet static
address 172.29.186.11/24
auto eth2
iface eth2 inet static
address 172.29.184.11/24
bridge config:
# ovs-vsctl show
91934a25-b86f-4d3a-a598-19f915404192
Bridge br-inet
Port "tap0"
Interface "tap0"
Port "eth0"
Interface "eth0"
Port br-inet
Interface br-inet
type: internal
Port "tap2"
Interface "tap2"
error: "could not open network device tap2 (No such device)"
Port "tap1"
Interface "tap1"
ovs_version: "2.6.2"
(the tap interfaces are for qemu VMs)
Cheers,
Julien
Reply to: