[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

QNAP TS-419 and TS-219P troubles during today's jessie updates: Massive kernel oopses + no boot



Hi, I've recently upgraded three QNAP devices from wheezy to
jessie(testing). (I sent an upgrade-report about the wheezy->jessie upgrade 
to the bugtracker earlier, in http://bugs.debian.org/781742, by the way).

As I was updating from jessie circa 2015.04.02 to jessie today 2015.04.05, I
hit a few major issues.

The updates applied today, according to /var/log/apt/history.log, were:

>Upgrade: bsdutils:armel (2.25.2-5, 2.25.2-6), perl-modules:armel (5.20.2-2,
>5.20.2-3), libcap2:armel (2.24-7, 2.24-8), libudev1:armel (215-12, 215-14),
>perl:armel (5.20.2-2, 5.20.2-3), unrar:armel (5.0.10-1, 5.2.7-0.1),
>systemd-sysv:armel (215-12, 215-14), libmount1:armel (2.25.2-5, 2.25.2-6),
>libblkid1:armel (2.25.2-5, 2.25.2-6), mount:armel (2.25.2-5, 2.25.2-6),
>perl-doc:armel (5.20.2-2, 5.20.2-3), systemd:armel (215-12, 215-14),
>libsystemd0:armel (215-12, 215-14), libcap2-bin:armel (2.24-7, 2.24-8),
>udev:armel (215-12, 215-14), util-linux:armel (2.25.2-5, 2.25.2-6),
>perl-base:armel (5.20.2-2, 5.20.2-3), libperl5.20:armel (5.20.2-2,
>5.20.2-3), util-linux-locales:armel (2.25.2-5, 2.25.2-6), libuuid1:armel
>(2.25.2-5, 2.25.2-6), libsmartcols1:armel (2.25.2-5, 2.25.2-6)

On one QNAP TS-419P+ turbo, the upgrade went fine including a reboot, since
there was a flash-kernel trigger.

On a second, identical QNAP TS-419P+ turbo with only a slightly different
set of packages, the upgrade caused the machine to no longer boot. The
actual upgrade went fine, but the machine did not come up after a "shutdown
-r now". More details below.

On a third machine, a QNAP TS-219P II Turbo, about 10 seconds after the
upgrade completed, the kernel started spewing kernel oopses and commands
were segfaulting left and right. I had to pull the power physically, but
fortunately the machine seems stable after booting. More details below.



OK, so, for the one that didn't boot up: The LCD display showed the "SYSTEM
BOOTING >>>" and then went blank, which is normal. The one LED was flashing
red. But other than having power, the machine never appeared on the network.
I even let it run for 3+ hours in case there was an fsck running, but there
weren't any hard disk activity. I pulled the power and connected one of the
four SATA drives via an SATA-USB adapter to a different computer. I could
see the md raid1 and raid5 partition slices were marked clean, as were the
ext4 filesystems too. But there were no entries in /var/log/* since the
fatal shutdown. 

In the end I managed to enter recovery by building a wheezy installer TFTP 
image by combining old mtdblock backups (fewf!) with the wheezy installer 
kernel+initrd and serve via an adhoc DNSMasq dhcp+tftp setup on a 
nearby macbook. Entering the installer, and letting it load the
mdcfg parts etc, and dropping in a shell, everything looked fine. I manually
mounted the root filesystem, bind-mounted /dev inside the target chroot, as
well as proc and sys filesystems. I couldn't figure out how to run
update-initramfs to regenerate initrd (it really doesn't like to run inside
a chroot from the installer emergency shell apparently), but I ran
flash-kernel which re-flashed the existing kernel+initrd from /boot. A
reboot later and the system came up as if nothing had happened. 

Any ideas what that could have been? Unfortunately I don't have the setup
for a serial console. Could it have been a bad flash on the previous
flash-kernel run during the update?



And then for the other machine that started oopsing all over the place. This
was really worrying. I saw something in the systemd changelogs about
duplicate swap mounts, is it possible that the upgrade did something weird
with the active swap partition? The machine has a 2GB swap on /dev/md1 which
is a 2-device raid1 array.

Some choice log entries for this case:

Apr 05 12:56:38 hostname systemd[1]: Reexecuting. 
Apr 05 12:56:38 hostname systemd[1]: systemd 215 running in system mode. (+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ -SECCOMP -APPARMOR) 
Apr 05 12:56:38 hostname systemd[1]: Detected architecture 'arm'. 
Apr 05 12:56:39 hostname systemd[1]: Reloading. 
Apr 05 12:56:39 hostname systemd[1]: Reloading. 
Apr 05 12:56:40 hostname systemd[1]: Reloading. 
Apr 05 12:57:19 hostname dovecot[480]: imap(xxx): Disconnected: Logged out in=351 out=1481 
Apr 05 12:57:20 hostname kernel: Unable to handle kernel paging request at virtual address 05e93644 
Apr 05 12:57:20 hostname kernel: pgd = de740000 
Apr 05 12:57:20 hostname kernel: [05e93644] *pgd=00000000 
Apr 05 12:57:20 hostname kernel: Internal error: Oops: 5 [#1] ARM 
Apr 05 12:57:20 hostname kernel: Modules linked in: hmac sha1_generic sha1_arm ehci_orion ehci_hcd marvell usbcore orion_wdt usb_common mv_cesa ahci libahci sg mv643xx_eth mvmdio of_mdio libphy evdev loop gpio_keys fuse ipv6 autofs4 ext4 mbcache jbd2 raid1 md_mod sd_mod crc_t10dif crct10dif_generic crct10dif_common sata_mv libata scsi_mod 
Apr 05 12:57:20 hostname kernel: CPU: 0 PID: 1498 Comm: mandb Not tainted 3.16.0-4-kirkwood #1 Debian 3.16.7-ckt7-1 
Apr 05 12:57:20 hostname kernel: task: c098fa80 ti: de7c2000 task.ti: de7c2000 
Apr 05 12:57:20 hostname kernel: PC is at get_vmalloc_info+0x64/0xf4 
Apr 05 12:57:20 hostname kernel: LR is at meminfo_proc_show+0x5c/0x3e4 
Apr 05 12:57:20 hostname kernel: pc : [<c00e0dcc>]    lr : [<c0146b94>]    psr: a0000013 
                                         sp : de7c3d68  ip : e0000000  fp : 00013445 
Apr 05 12:57:20 hostname kernel: r10: de7c3f80  r9 : 00000400  r8 : b6b9f000 
Apr 05 12:57:20 hostname kernel: r7 : 00000001  r6 : e0efe000  r5 : c061a280  r4 : c05a6e40 
Apr 05 12:57:20 hostname kernel: r3 : 05e93644  r2 : e0efe000  r1 : 05e9365c  r0 : de7c3e78 
Apr 05 12:57:20 hostname kernel: Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user 
Apr 05 12:57:20 hostname kernel: Control: 0005397f  Table: 1e740000 DAC: 00000015 
Apr 05 12:57:20 hostname kernel: Process mandb (pid: 1498, stack limit = 0xde7c21c0) 

(... lots of hex bytes showing a stack dump ...)

Apr 05 12:57:20 hostname kernel: [<c00e0dcc>] (get_vmalloc_info) from [<c0146b94>] (meminfo_proc_show+0x5c/0x3e4) 
Apr 05 12:57:20 hostname kernel: [<c0146b94>] (meminfo_proc_show) from [<c0110b9c>] (seq_read+0x1ac/0x3f8) 
Apr 05 12:57:20 hostname kernel: [<c0110b9c>] (seq_read) from [<c013f554>] (proc_reg_read+0x78/0x8c) 
Apr 05 12:57:20 hostname kernel: [<c013f554>] (proc_reg_read) from [<c00f45bc>] (vfs_read+0x90/0x174) 
Apr 05 12:57:20 hostname kernel: [<c00f45bc>] (vfs_read) from [<c00f4d4c>] (SyS_read+0x44/0x84) 
Apr 05 12:57:20 hostname kernel: [<c00f4d4c>] (SyS_read) from [<c0009400>] (ret_fast_syscall+0x0/0x2c) 
Apr 05 12:57:20 hostname kernel: Code: e26334ff e5803004 ea000021 e595c000 (e5931000)  
Apr 05 12:57:20 hostname kernel: ---[] end trace 921667d30991e9a7 ]---

Apr 05 12:57:26 hostname kernel: Unable to handle kernel paging request at virtual address b0e7f26c 
Apr 05 12:57:26 hostname kernel: pgd = c0958000 
Apr 05 12:57:26 hostname kernel: [b0e7f26c] *pgd=00000000 
Apr 05 12:57:26 hostname kernel: Internal error: Oops: 5 [#2] ARM 
Apr 05 12:57:26 hostname kernel: Modules linked in: hmac sha1_generic sha1_arm ehci_orion ehci_hcd marvell usbcore orion_wdt usb_common mv_cesa ahci libahci sg mv643xx_eth mvmdio of_mdio libphy evdev loop gpio_keys fuse ipv6 autofs4 ext4 mbcache jbd2 raid1 md_mod sd_mod crc_t10dif crct10dif_generic crct10dif_common sata_mv libata scsi_mod 
Apr 05 12:57:26 hostname kernel: CPU: 0 PID: 26687 Comm: apt-get Tainted: G      D       3.16.0-4-kirkwood #1 Debian 3.16.7-ckt7-1 
Apr 05 12:57:26 hostname kernel: task: de683620 ti: c0856000 task.ti: c0856000 
Apr 05 12:57:26 hostname kernel: PC is at __find_vmap_area+0x18/0x50 
Apr 05 12:57:26 hostname kernel: LR is at remove_vm_area+0x10/0x5c 
Apr 05 12:57:26 hostname kernel: pc : [<c00dec10>]    lr : [<c00e0288>]    psr: a0000013 
                                         sp : c0857e48  ip : dbe08944  fp : 00000000 
Apr 05 12:57:26 hostname kernel: r10: 00000000  r9 : 00000000  r8 : 00000000 
Apr 05 12:57:26 hostname kernel: r7 : 00000001  r6 : e1152000  r5 : 00000000  r4 : dbe08800 
Apr 05 12:57:26 hostname kernel: r3 : b0e7f278  r2 : 69e2d336  r1 : 00000001  r0 : e1152000 
Apr 05 12:57:26 hostname kernel: Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user 
Apr 05 12:57:26 hostname kernel: Control: 0005397f  Table: 00958000 DAC: 00000015 
Apr 05 12:57:26 hostname kernel: Process apt-get (pid: 26687, stack limit = 0xc08561c0) 
Apr 05 12:57:26 hostname kernel: Stack: (0xc0857e48 to 0xc0858000) 

(... lots of hex bytes showing a stack dump ...)

Apr 05 12:57:26 hostname kernel: [<c00dec10>] (__find_vmap_area) from [<c00e0288>] (remove_vm_area+0x10/0x5c) 
Apr 05 12:57:26 hostname kernel: [<c00e0288>] (remove_vm_area) from [<c00e0308>] (__vunmap+0x34/0xcc) 
Apr 05 12:57:26 hostname kernel: [<c00e0308>] (__vunmap) from [<c025c178>] (n_tty_close+0x2c/0x38) 
Apr 05 12:57:26 hostname kernel: [<c025c178>] (n_tty_close) from [<c02605dc>] (tty_ldisc_close.isra.0+0x60/0x6c) 
Apr 05 12:57:26 hostname kernel: [<c02605dc>] (tty_ldisc_close.isra.0) from [<c02607b4>] (tty_ldisc_reinit+0x38/0xa4) 
Apr 05 12:57:26 hostname kernel: [<c02607b4>] (tty_ldisc_reinit) from [<c0260d24>] (tty_ldisc_hangup+0x124/0x1c8) 
Apr 05 12:57:26 hostname kernel: [<c0260d24>] (tty_ldisc_hangup) from [<c0259158>] (__tty_hangup+0x25c/0x38c) 
Apr 05 12:57:26 hostname kernel: [<c0259158>] (__tty_hangup) from [<c0262b78>] (pty_close+0x178/0x194) 
Apr 05 12:57:26 hostname kernel: [<c0262b78>] (pty_close) from [<c025a1e0>] (tty_release+0x140/0x4b8) 
Apr 05 12:57:26 hostname kernel: [<c025a1e0>] (tty_release) from [<c00f5954>] (__fput+0xe4/0x1b4) 
Apr 05 12:57:26 hostname kernel: [<c00f5954>] (__fput) from [<c0032274>] (task_work_run+0x90/0xac) 
Apr 05 12:57:26 hostname kernel: [<c0032274>] (task_work_run) from [<c000bdf4>] (do_work_pending+0xd4/0xf4) 
Apr 05 12:57:26 hostname kernel: [<c000bdf4>] (do_work_pending) from [<c000943c>] (work_pending+0xc/0x20) 
Apr 05 12:57:26 hostname kernel: Code: e59f303c e5933000 e3530000 0a00000a (e513200c)  
Apr 05 12:57:26 hostname kernel: ---[] end trace 921667d30991e9a8 ]--- 

(... etc etc several more processes crashing like this and then ...)

Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c0 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c1 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c2 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c3 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c4 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c5 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c6 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c7 
Apr 05 12:57:55 hostname kernel: swap_dup: Bad swap file entry 364791c1 
Apr 05 12:57:57 hostname kernel: Unable to handle kernel paging request at virtual address 05e93644 
Apr 05 12:57:57 hostname kernel: pgd = de740000 
Apr 05 12:57:57 hostname kernel: [05e93644] *pgd=00000000 
Apr 05 12:57:57 hostname kernel: Internal error: Oops: 5 [#4] ARM 

(... more log lines cut ...)

Apr 05 12:57:58 hostname kernel: /build/linux-gXNuoJ/linux-3.16.7-ckt7/mm/pgtable-generic.c:33: bad pmd 84d05d82. 
Apr 05 12:57:58 hostname kernel: /build/linux-gXNuoJ/linux-3.16.7-ckt7/mm/pgtable-generic.c:33: bad pmd 3eebde47. 
Apr 05 12:58:00 hostname kernel: swap_free: Bad swap file entry 20c3e99f 
Apr 05 12:58:10 hostname kernel: BUG: Bad page map in process cron pte:c3e99f82 pmd:1e53f831 
Apr 05 12:58:10 hostname kernel: addr:b6ac2000 vm_flags:00000075 anon_vma:  (null) mapping:df24939c index:0 
Apr 05 12:58:10 hostname kernel: vma->vm_ops->fault: filemap_fault+0x0/0x410 
Apr 05 12:58:10 hostname kernel: vma->vm_file->f_op->mmap: ext4_file_mmap+0x0/0x54 [ext4] 
Apr 05 12:58:10 hostname kernel: CPU: 0 PID: 416 Comm: cron Tainted: G      D       3.16.0-4-kirkwood #1 Debian 3.16.7-ckt7-1 
Apr 05 12:58:10 hostname kernel: [<c001009c>] (unwind_backtrace) from [<c000c440>] (show_stack+0x18/0x1c) 
Apr 05 12:58:10 hostname kernel: [<c000c440>] (show_stack) from [<c00d35b0>] (print_bad_pte+0x168/0x19c) 
Apr 05 12:58:10 hostname kernel: [<c00d35b0>] (print_bad_pte) from [<c00d47c0>] (unmap_single_vma+0x4e8/0x600) 
Apr 05 12:58:10 hostname kernel: [<c00d47c0>] (unmap_single_vma) from [<c00d57fc>] (unmap_vmas+0x4c/0x5c) 
Apr 05 12:58:10 hostname kernel: [<c00d57fc>] (unmap_vmas) from [<c00da3c4>] (exit_mmap+0xdc/0x214) 
Apr 05 12:58:10 hostname kernel: [<c00da3c4>] (exit_mmap) from [<c0017a60>] (mmput+0x50/0xdc) 
Apr 05 12:58:10 hostname kernel: [<c0017a60>] (mmput) from [<c001bac0>] (do_exit+0x328/0x884) 
Apr 05 12:58:10 hostname kernel: [<c001bac0>] (do_exit) from [<c000c6fc>] (die+0x2b8/0x394) 
Apr 05 12:58:10 hostname kernel: [<c000c6fc>] (die) from [<c03a9970>] (__do_kernel_fault.part.11+0x5c/0x7c) 
Apr 05 12:58:10 hostname kernel: [<c03a9970>] (__do_kernel_fault.part.11) from [<c0012bdc>] (do_page_fault+0x300/0x360) 
Apr 05 12:58:10 hostname kernel: [<c0012bdc>] (do_page_fault) from [<c00083a0>] (do_DataAbort+0x3c/0xa0) 
Apr 05 12:58:10 hostname kernel: [<c00083a0>] (do_DataAbort) from [<c000ced8>] (__dabt_svc+0x38/0x60) 
Apr 05 12:58:10 hostname kernel: Exception stack(0xde5dfdd8 to 0xde5dfe20) 
Apr 05 12:58:10 hostname kernel: fdc0: c0c56320 00000012 
Apr 05 12:58:10 hostname kernel: fde0: 00000012 44d05000 dea544b0 00012000 de5de000 000000a8 c0c56320 c0c56320 
Apr 05 12:58:10 hostname kernel: fe00: de540000 c0c56354 de5de000 de5dfe20 c0012a00 c00d6490 a0000013 ffffffff 
Apr 05 12:58:10 hostname kernel: [<c000ced8>] (__dabt_svc) from [<c00d6490>] (handle_mm_fault+0xf4/0x914) 
Apr 05 12:58:10 hostname kernel: [<c00d6490>] (handle_mm_fault) from [<c0012a00>] (do_page_fault+0x124/0x360) 
Apr 05 12:58:10 hostname kernel: [<c0012a00>] (do_page_fault) from [<c0008440>] (do_PrefetchAbort+0x3c/0xa0) 
Apr 05 12:58:10 hostname kernel: [<c0008440>] (do_PrefetchAbort) from [<c000d214>] (ret_from_exception+0x0/0x10) 

(... several more of the do_DataAbort ... unwind_backtrace lines cut ....)

Apr 05 12:58:11 hostname kernel: Fixing recursive fault but reboot is needed! 

And then I pretty much pulled the power.

After the system came up, I ran a few test php scripts to allocate a lot of
memory and saw the system start eating up swap, but it didn't provoke any
crashes.

The machines have been running wheezy and squeeze for years on end
with no problems at all. 

Also, probably a freak incident, but an APC UPS that is powering 2 of these
devices (not the one with the kernel oops) no longer registers with the
USB cable, not on any of these machines or even a macbook anymore, just
giving a "unable to enumerate device" error. It worked yesterday (even
though the "apcaccess" cli binary is broken on armel jessie, the web
interface still worked). I guess when it rains, it pours :-/

Would be grateful for any input, and hoping this can be of help to the
people preparing the jessie release as well. And sorry for the wall of text. 
Thanks.



Reply to: