[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Bug#992855: upgrade-reports: buster to bullseye, nvme logs make ttys messy.



Control: reassign -1 kernel

Hi,

Although as you said, it may be something else, as it's the kernel
logging this, I reassign to the kernel package, in the hope they are
more able to pinpoint where the problem lies.

Paul

On 24-08-2021 10:53, hoxp18@noramail.jp wrote:
> Package: upgrade-reports
> Severity: normal
> 
> Dear maintainers,
> 
> I've upgraded one of my buster 10.10 into bullseye 11.
> 
> * Upgrading was mostly successfull.
> 
> * As "buster" GNOME desktop it had almost no problem,
>   and upgraded "bullseye" GNOME session seems fine for now.
> 
> However, after rebooting the issue below stared on ttys.
> 
>   The machine has one NVMe and one SATA SSD (separated /home).
>   The Entire system installed into LUKS2 encrypted LVM volumes.
>   Mostly "main" packages except Intel microcode ("non-free").
> 
> NVMe continual syslog messsage on terminal
> ==========================================
> 
> 1. kernel reports nvme RxErr related messages.
> 2. on any tty (even in rescue.target).
> 3. almost always (I/O access on NVMe seems to be a trigger).
> 
> AFAIK, the reported device (WD BLACK NVMe) works fine
> and "nvme smartlog" have not reported any error in years.
> 
> After bullseye upgrading kernel started this reporting on ttys.
> 
> At the same time, "smartd" seems to have a minor systemd unit problem,
> and it also seems to start tracking NVMe just like below syslog sample
> (at the smartd starting log section).
> "smartd" in buster only monitored SATA SSD.
> 
> I can not tell what actual package and/or whether that device caused this,
> posting this bug report to "upgrade-reports".
> 
> Actual syslog (partially modified; DATETIME/HOSTNAME/SERIAL_NUMBER)
> ---------------------------------------------------------------------
> 
> 1. Cold booted.
> 2. systemctl shows no error.
> 3. target was shifted during this log sampling (graphical, rescue, multiuser, graphical).
> 
> syslog and my comments follow:
> 
> Aug 23 19:50:27 hostname kernel: [    0.289634] pci 0000:01:00.0: [15b7:5002] type 00 class 0x010802
> Aug 23 19:50:27 hostname kernel: [    0.289657] pci 0000:01:00.0: reg 0x10: [mem 0xdf000000-0xdf003fff 64bit]
> Aug 23 19:50:27 hostname kernel: [    0.289691] pci 0000:01:00.0: reg 0x20: [mem 0xdf004000-0xdf0040ff 64bit]
> Aug 23 19:50:27 hostname kernel: [    0.718254] pci 0000:01:00.0: Adding to iommu group 11
> Aug 23 19:50:27 hostname kernel: [    1.020213] nvme nvme0: pci function 0000:01:00.0
> Aug 23 19:50:27 hostname kernel: [    1.029047] nvme nvme0: 4/0/0 default/read/poll queues
> Aug 23 19:50:27 hostname kernel: [    1.031074]  nvme0n1: p1 p2 p3
> 
> This NVMe contains EFI, /boot, and LUKS root filesystem except /home (in SATA SSD).
> 
> Aug 23 19:50:27 hostname kernel: [   15.626412] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input18
> Aug 23 19:50:27 hostname kernel: [   15.626483] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input19
> Aug 23 19:50:27 hostname kernel: [   15.626548] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input20
> 
> The first NVMe RxErr reporting starts after the kernel dmesg (after sound subsystem).
> 
> Aug 23 19:50:27 hostname kernel: [   17.365057] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
> Aug 23 19:50:27 hostname kernel: [   17.365089] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Aug 23 19:50:27 hostname kernel: [   17.365117] nvme 0000:01:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
> Aug 23 19:50:27 hostname kernel: [   17.365142] nvme 0000:01:00.0:    [ 0] RxErr                 
> 
> Then, on any tty (regardless the login status),
> it keep showing that messages like this.
> DATETIME was modified but kernel timing is the real log.
> 
> Aug 23 19:51:16 hostname kernel: [   70.416916] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
> Aug 23 19:51:16 hostname kernel: [   70.417029] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Aug 23 19:51:16 hostname kernel: [   70.417145] nvme 0000:01:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
> Aug 23 19:51:16 hostname kernel: [   70.417268] nvme 0000:01:00.0:    [ 0] RxErr                 
> Aug 23 19:51:17 hostname kernel: [   71.693968] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
> Aug 23 19:51:17 hostname kernel: [   71.699967] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Aug 23 19:51:17 hostname kernel: [   71.701511] nvme 0000:01:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
> Aug 23 19:51:17 hostname kernel: [   71.703040] nvme 0000:01:00.0:    [ 0] RxErr                 
> Aug 23 19:51:19 hostname kernel: [   73.739331] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
> Aug 23 19:51:19 hostname kernel: [   73.745023] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Aug 23 19:51:19 hostname kernel: [   73.746547] nvme 0000:01:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
> Aug 23 19:51:19 hostname kernel: [   73.748053] nvme 0000:01:00.0:    [ 0] RxErr                 
> Aug 23 19:51:20 hostname kernel: [   74.506475] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
> Aug 23 19:51:20 hostname kernel: [   74.512092] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Aug 23 19:51:20 hostname kernel: [   74.513594] nvme 0000:01:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
> Aug 23 19:51:20 hostname kernel: [   74.515037] nvme 0000:01:00.0:    [ 0] RxErr                 
> Aug 23 19:51:22 hostname kernel: [   76.552810] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
> Aug 23 19:51:22 hostname kernel: [   76.558081] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Aug 23 19:51:22 hostname kernel: [   76.559548] nvme 0000:01:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
> Aug 23 19:51:22 hostname kernel: [   76.560989] nvme 0000:01:00.0:    [ 0] RxErr                 
> Aug 23 19:51:24 hostname kernel: [   78.599100] pcieport 0000:00:1b.0: AER: Corrected error received: 0000:01:00.0
> Aug 23 19:51:24 hostname kernel: [   78.604922] nvme 0000:01:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> Aug 23 19:51:24 hostname kernel: [   78.606389] nvme 0000:01:00.0:   device [15b7:5002] error status/mask=00000001/0000e000
> Aug 23 19:51:24 hostname kernel: [   78.607832] nvme 0000:01:00.0:    [ 0] RxErr                 
> 
> then smartd initialization follows,
> 
> Aug 24 11:34:33 hostname smartd[1026]: smartd 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-17-amd64] (local build)
> Aug 24 11:34:33 hostname smartd[1026]: Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
> Aug 24 11:34:33 hostname smartd[1026]: Opened configuration file /etc/smartd.conf
> Aug 24 11:34:33 hostname smartd[1026]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
> Aug 24 11:34:33 hostname smartd[1026]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
> Aug 24 11:34:33 hostname smartd[1026]: Device: /dev/sda, type changed from 'scsi' to 'sat'
> Aug 24 11:34:33 hostname smartd[1026]: Device: /dev/sda [SAT], opened
> 
> (minor issue) smartd.service (please see the log).
> 
> Aug 23 19:23:18 hostname systemd[1]: /lib/systemd/system/smartd.service:9: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
> Aug 23 19:23:18 hostname systemd[1]: Reloading.
> Aug 23 19:23:18 hostname systemd[1]: /lib/systemd/system/smartd.service:9: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.
> Aug 23 19:23:18 hostname systemd[1]: Starting Network Time Synchronization...
> Aug 23 19:23:18 hostname systemd[1]: Started Network Time Synchronization.
> Aug 23 19:23:18 hostname systemd[1]: Reached target System Time Set.
> 
> FYI: buster smartd did not track NVMe device, only SATA SSD.
>      bullseye smartd shows it monitor both.
>      This might be a potential cause.
>      (info: S/N: section is not the real one)
> 
> Note that smartd 7.2 shows 4.19 kernel version,
> but this session is booted with bullseye 5.10
> (4.19 image was purged, and "apt autoremove --purge" had done).
> 
> Aug 23 19:26:12 hostname smartd[28840]: smartd 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-17-amd64] (local build)
> Aug 23 19:26:12 hostname smartd[28840]: Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
> Aug 23 19:26:12 hostname smartd[28840]: Opened configuration file /etc/smartd.conf
> Aug 23 19:26:12 hostname smartd[28840]: Drive: DEVICESCAN, implied '-a' Directive on line 21 of file /etc/smartd.conf
> Aug 23 19:26:12 hostname smartd[28840]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/sda, type changed from 'scsi' to 'sat'
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/sda [SAT], opened
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/sda [SAT], CT500MX500SSD1, S/N:SDASN_HEX, WWN:5-00a075-1e1e6e854, FW:M3CR023, 500 GB
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/sda [SAT], found in smartd database: Crucial/Micron Client SSDs
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.CT500MX500SSD1-SDASN_HEX.ata.state
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/nvme0, opened
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/nvme0, WDS250G3X0C-00SJG0, S/N:NVME0SN_HEX, FW:102000WD, 250 GB
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/nvme0, is SMART capable. Adding to "monitor" list.
> Aug 23 19:26:12 hostname smartd[28840]: Monitoring 1 ATA/SATA, 0 SCSI/SAS and 1 NVMe devices
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.CT500MX500SSD1-SDASN_HEX.ata.state
> Aug 23 19:26:12 hostname smartd[28840]: Device: /dev/nvme0, state written to /var/lib/smartmontools/smartd.WDS250G3X0C_00SJG0-NVME0SN_HEX.nvme.state
> Aug 23 19:26:12 hostname systemd[1]: Started Self Monitoring and Reporting Technology (SMART) Daemon.
> Aug 23 19:26:15 hostname systemd[1]: Reloading.
> 
> "lspci" is here (as bullseye output).
> 
>   I'm not sure about 01:00.1 NVMe product name is correct.
>   Probably this one is WD Black NVMe 2nd gen. or 3rd gen.
> 
> 00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:591f] (rev 05)
> 00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 630 [8086:5912] (rev 04)
> 00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af]
> 00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba]
> 00:17.0 SATA controller [0106]: Intel Corporation 200 Series PCH SATA controller [AHCI mode] [8086:a282]
> 00:1b.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #17 [8086:a2e7] (rev f0)
> 00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #1 [8086:a290] (rev f0)
> 00:1c.6 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #7 [8086:a296] (rev f0)
> 00:1d.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #9 [8086:a298] (rev f0)
> 00:1f.0 ISA bridge [0601]: Intel Corporation 200 Series PCH LPC Controller (Z270) [8086:a2c5]
> 00:1f.2 Memory controller [0580]: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller [8086:a2a1]
> 00:1f.3 Audio device [0403]: Intel Corporation 200 Series PCH HD Audio [8086:a2f0]
> 00:1f.4 SMBus [0c05]: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller [8086:a2a3]
> 00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (2) I219-V [8086:15b8]
> 01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black 2018/SN750 / PC SN720 NVMe SSD [15b7:5002]
> 
> Please tell me if there are something I can do for this.
> 
> Thanks for all maintainers.
> 
> Regards,
> 

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


Reply to: