[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#929359: linux: instability on arm64 MP30-AR1 servers



Source: linux
Version: 4.9.168-1
Severity: important
X-Debbugs-Cc: debian-arm@lists.debian.org, debian-admin@lists.debian.org
User: debian-admin@lists.debian.org
Usertags: needed-by-DSA-Team

Hi,

ever since the 9.9 point release conova-node01.debian.org and
conova-node02.debian.org have been unstable.  They run for an hour or
three, and then things go bad.  Rebooting back to 4.9.144-3.1 makes them
stable again.

Latest example:

May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: PingAck did not arrive in time.
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: new current UUID 3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: ack_receiver terminated
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Terminating drbd_a_resource
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Connection closed
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: conn( NetworkFailure -> Unconnected ) 
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: receiver terminated
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Restarting receiver thread
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: receiver (re)started
May 22 04:17:37 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: conn( Unconnected -> WFConnection ) 
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Handshake successful: Agreed network protocol version 101
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Peer authenticated using 16 bytes HMAC
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: conn( WFConnection -> WFReportParams ) 
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: drbd resource3: Starting ack_recv thread (from drbd_r_resource [8449])
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: drbd_sync_handshake:
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: self 3EA2D1FA6B3ACD47:0BEBDA613EA56FD7:D5BF70E0AA6560C5:D5BE70E0AA6560C5 bits:4 flags:0
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: peer 0BEBDA613EA56FD6:0000000000000000:D5BF70E0AA6560C4:D5BE70E0AA6560C5 bits:0 flags:0
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: uuid_compare()=1 by rule 70
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) 
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0%
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0%
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: helper command: /bin/true before-resync-source minor-3
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: helper command: /bin/true before-resync-source minor-3 exit code 0 (0x0)
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) 
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: Began resync as SyncSource (will sync 16 KB [4 bits set]).
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: updated sync UUID 3EA2D1FA6B3ACD47:0BECDA613EA56FD7:0BEBDA613EA56FD7:D5BF70E0AA6560C5
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: Resync done (total 1 sec; paused 0 sec; 16 K/sec)
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: updated UUIDs 3EA2D1FA6B3ACD47:0000000000000000:0BECDA613EA56FD7:0BEBDA613EA56FD7
May 22 04:17:38 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: block drbd3: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
May 22 04:17:48 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: efi: [Firmware Bug]: IRQ flags corrupted (0x00000140=>0x00000100) by EFI get_time
May 22 04:18:54 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: efi: [Firmware Bug]: IRQ flags corrupted (0x00000140=>0x00000100) by EFI set_time
May 22 04:18:54 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: efi: [Firmware Bug]: IRQ flags corrupted (0x00000140=>0x00000100) by EFI get_time
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Bad mode in FIQ handler detected on CPU0, code 0x56000000 -- SVC (AArch64)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Internal error: Oops - bad mode: 0 [#1] SMP
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Modules linked in: openvswitch nf_nat_ipv6 nf_nat_ipv4 nf_nat binfmt_misc nls_ascii nls_cp437 vfat fat dm_mod ip6t_REJECT nf_reject_ipv6
 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_NFLOG nfnetlink_log nfnetlink xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_hashlimit xt_multiport xt_conntrack nf_conntr
ack iptable_filter ast ttm drm_kms_helper xgene_hwmon efi_pstore drm i2c_algo_bit xgene_edac edac_core xgene_dma joydev evdev chaoskey mailbox_xgene_slimpro sg xgene_rng rng_core efivars tun drbd lru_cache efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq crc32c_generic libcrc32c raid0 multipath linear raid1 hid_generic md_mod usbhid hid sd_mod
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel:  i2c_xgene_slimpro ahci_xgene libahci_platform libahci xhci_plat_hcd xgene_enet xhci_hcd libata phy_xgene marvell usbcore scsi_mod mdio_xgene of_mdio fixed_phy libphy usb_common gpio_xgene_sb
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: CPU: 0 PID: 1410 Comm: ovsdb-server Tainted: G        W I     4.9.0-9-arm64 #1 Debian 4.9.168-1+deb9u2
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Hardware name: GIGABYTE R120-P31/MP30-AR1, BIOS D7b 08/26/2016
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: task: ffff807ff9d54380 task.stack: ffff807f95c94000
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: PC is at 0xffffa10dbf00
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: LR is at 0xffffa13d221c
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: pc : [<0000ffffa10dbf00>] lr : [<0000ffffa13d221c>] pstate: a0000000
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: sp : 0000fffff72e8970
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x29: 0000fffff72e8970 x28: 0000000000000000 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x27: 0000aaaafa714d90 x26: 0000aaaafa7354c8 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x25: 0000aaaafa6eaed0 x24: 0000000000000018 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x23: 0000aaaafa72c660 x22: 0000aaaafa711b80 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x21: 0000000000000004 x20: 000000000000000c 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x19: 0000aaaafa702b90 x18: 00000000002597a9 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x17: 0000ffffa10dbec0 x16: 0000ffffa14837a0 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x15: ffffffffffffffff x14: 0000000000000010 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x13: 33613a63353a3834 x12: 3a66373a63613a36 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x11: 0101010101010101 x10: 0000000066666666 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x9 : 7f7f7f7f7f7f7f7f x8 : 0101010101010101 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x7 : 7f7fffffff7f7f7f x6 : feffa9a9f970ff72 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x5 : 8080000000008000 x4 : 0080000000008080 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x3 : 0000aaaafa720073 x2 : 726f7272655f7874 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: x1 : 0000aaaafa711c20 x0 : 0000000000000008 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Process ovsdb-server (pid: 1410, stack limit = 0xffff807f95c94020)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: ---[ end trace 1fdaa7d4350a5508 ]---
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Bad mode in FIQ handler detected on CPU0, code 0x56000000 -- SVC (AArch64)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: INFO: rcu_bh detected stalls on CPUs/tasks:
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel:      0-...: (1 GPs behind) idle=1fd/140000000000000/0 softirq=736283/736285 fqs=2434 
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel:      (detected by 2, t=5255 jiffies, g=15038, c=15037, q=8)
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Task dump for CPU 0:
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: ovsdb-server    R  running task        0  1410   1409 0x0000000a
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: Call trace:
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008086190>] __switch_to+0x90/0xd8
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000808b804>] bad_mode+0x6c/0x90
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<0000000021dc9afc>] 0x21dc9afc
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<0000000021db79b8>] 0x21db79b8
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008610748>] virt_efi_set_variable.part.6+0x68/0xb0
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008610898>] virt_efi_set_variable+0x78/0x90
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000860f020>] efivar_entry_set_safe+0xc8/0x200
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff0000010574b8>] efi_pstore_write+0x158/0x1b0 [efi_pstore]
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000830cdbc>] pstore_dump+0x17c/0x388
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff000008132a54>] kmsg_dump+0xac/0xd0
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff0000080cf5cc>] oops_exit+0x2c/0x38
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000808b0a4>] die+0xdc/0x1c8
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<ffff00000808b818>] bad_mode+0x80/0x90
May 22 04:23:51 conova-node01/conova-node01/::ffff:217.196.149.227 kernel: [<0000ffffa13d221c>] 0xffffa13d221c

I don't know if the drbd stuff is related to the Oops, I guess it may
not be (as I see similar messages before things break).  In any case
after that point the network is down.  The network driver is xgene-enet.

/etc/network/interfaces:

  # The loopback network interface
  auto lo
  iface lo inet loopback

  auto eth0
  iface eth0 inet manual
  	pre-up    echo 1 > /proc/sys/net/ipv6/conf/$IFACE/disable_ipv6
  	pre-up    ip link set dev $IFACE up
  	post-down ip link set dev $IFACE down

  # The primary network interface
  allow-hotplug br-inet
  iface br-inet inet static
  	address 217.196.149.227/28
  	gateway 217.196.149.238
  iface br-inet inet6 static
  	address 2a02:16a8:dc41:100::227/64
  	gateway 2a02:16a8:dc41:100::def

  auto eth1
  iface eth1 inet static
  	address 172.29.186.11/24

  auto eth2
  iface eth2 inet static
  	address 172.29.184.11/24

bridge config:

  # ovs-vsctl show
  91934a25-b86f-4d3a-a598-19f915404192
      Bridge br-inet
          Port "tap0"
              Interface "tap0"
          Port "eth0"
              Interface "eth0"
          Port br-inet
              Interface br-inet
                  type: internal
          Port "tap2"
              Interface "tap2"
                  error: "could not open network device tap2 (No such device)"
          Port "tap1"
              Interface "tap1"
      ovs_version: "2.6.2"

(the tap interfaces are for qemu VMs)

Cheers,
Julien


Reply to: