[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

unklarer Fehler ZFS/SMART/SATA 6.0 Gbps/UDMA/133



Hallo,

ich bin ratlos, was mit meinem seit geraumer Zeit zuverlässig laufendem
Proxmox Server (Debian bullseye) los ist und wie ich das lösen kann.

Kurzfassung: ZFS + SMART melden Fehler, alle Festplatten werden nur noch
als UDMA/133 eingebunden.

Heute Morgen 5:58 Uhr: ZFS device fault for pool
Code:

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.

 impact: Fault tolerance of the pool may be compromised.
    eid: 154391
  class: statechange
  state: FAULTED
   host: meinrechnername
   time: 2021-11-14 05:58:11+0100
  vpath: /dev/sdc2
  vguid: 0xC338A55969D3184F
   pool: 0x174BC0321B6273A6


Code:

:~# zpool status -x
  pool: rpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Sun Nov 14 00:24:02 2021
        9.88T scanned at 360M/s, 9.00T issued at 328M/s, 15.6T total
        912K repaired, 57.77% done, 05:50:27 to go
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            sda2    ONLINE       0     0     0
            sdb2    ONLINE       0     0     0
            sdc2    FAULTED     17     0     0  too many errors (repairing)

errors: No known data errors


Anschließend 6:18 Uhr: SMART error (ErrorCount) detected on host:
Code:

This message was generated by the smartd daemon running on:

   host name:  meinrechnername
   DNS domain: localdomain

The following warning/error was logged by the smartd daemon:

Device: /dev/sdc [SAT], ATA error count increased from 0 to 1

Device info:
ST8000NM0055-1RM112, S/N:ZA19V8QR, WWN:5-000c50-0af629d42, FW:SN05, 8.00 TB


In der /var/log/syslog steht dazu
Code:

Nov 14 06:18:37 meinrechnername smartd[2680]: Device: /dev/sdc [SAT],
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 69
Nov 14 06:18:37 meinrechnername smartd[2680]: Device: /dev/sdc [SAT],
SMART Usage Attribute: 187 Reported_Uncorrect changed from 100 to 99
Nov 14 06:18:37 meinrechnername smartd[2680]: Device: /dev/sdc [SAT],
SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 60 to 62
Nov 14 06:18:37 meinrechnername smartd[2680]: Device: /dev/sdc [SAT],
SMART Usage Attribute: 194 Temperature_Celsius changed from 40 to 38
Nov 14 06:18:37 meinrechnername smartd[2680]: Device: /dev/sdc [SAT],
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 78
Nov 14 06:18:37 meinrechnername smartd[2680]: Device: /dev/sdc [SAT],
ATA error count increased from 0 to 1
Nov 14 06:18:37 meinrechnername smartd[2680]: Sending warning via
/usr/share/smartmontools/smartd-runner to root ...


Daraufhin habe ich einen neuer Smart-Schnelltest angestoßen: smartctl -t
short /dev/sdc

Code:

smartctl -l selftest /dev/sdc
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.22-5-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90% 12392         -
# 2  Extended offline    Completed without error       00% 40         -


Meine Vermutung hierzu war, dass die Platte /dev/sdc defekt ist und
ausgetauscht werden muss. Daraufhin habe ich im laufenden System eine
Ersatzplatte eingesteckt und wollte das resilvering anstoßen. Beim
Einstecken ist mir aufgefallen, dass die Platte nur im UDMA/133 läuft.

Code:

[1860674.617985] ata4: link is slow to respond, please be patient (ready=0)
[

1860678.486019] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[1860678.488885] ACPI BIOS Error (bug): Could not resolve symbol
[\_SB.PCI0.SAT0.PRT3._GTF.DSSP], AE_NOT_FOUND (20201113/psargs-330)

[1860678.488921] No Local Variables are initialized for Method [_GTF]

[1860678.488923] No Arguments are initialized for method [_GTF]

[1860678.488924] ACPI Error: Aborting method \_SB.PCI0.SAT0.PRT3._GTF
due to previous error (AE_NOT_FOUND) (20201113/psparse-529)
[1860678.489451] ata4.00: ATA-10: ST8000NM0055-1RM112, SN05, max UDMA/133
[1860678.489453] ata4.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth
32), AA
[1860678.492204] ACPI BIOS Error (bug): Could not resolve symbol
[\_SB.PCI0.SAT0.PRT3._GTF.DSSP], AE_NOT_FOUND (20201113/psargs-330)

[1860678.492240] No Local Variables are initialized for Method [_GTF]

[1860678.492241] No Arguments are initialized for method [_GTF]

[1860678.492243] ACPI Error: Aborting method \_SB.PCI0.SAT0.PRT3._GTF
due to previous error (AE_NOT_FOUND) (20201113/psparse-529)
[1860678.492639] ata4.00: configured for UDMA/133
[1860678.492696] scsi 3:0:0:0: Direct-Access     ATA ST8000NM0055-1RM
SN05 PQ: 0 ANSI: 5
[1860678.492887] sd 3:0:0:0: Attached scsi generic sg3 type 0
[1860678.492922] sd 3:0:0:0: [sdd] 15628053168 512-byte logical blocks:
(8.00 TB/7.28 TiB)
[1860678.492924] sd 3:0:0:0: [sdd] 4096-byte physical blocks
[1860678.492929] sd 3:0:0:0: [sdd] Write Protect is off
[1860678.492930] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[1860678.492939] sd 3:0:0:0: [sdd] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[1860678.570090] sd 3:0:0:0: [sdd] Attached SCSI disk


Daraufhin habe ich kein resilvering angestoßen, sondern den Rechner neu
gestartet. Nun ist mir aufgefallen, dass alle Platten im UDMA/133 laufen
(obwohl die Platten als auch der Controller SATA 6.0 Gbps können).

Code:

dmesg | grep ata1
[    1.439448] ata1: SATA max UDMA/133 abar m2048@0xf7a4b000 port
0xf7a4b100 irq 130
[    1.757152] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.774653] ata1.00: ATA-10: ST8000NM0055-1RM112, SN04, max UDMA/133
[    1.774655] ata1.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth
32), AA
[    1.778225] ata1.00: configured for UDMA/133
root@meinrechnername:~# dmesg | grep ata2
[    1.439450] ata2: SATA max UDMA/133 abar m2048@0xf7a4b000 port
0xf7a4b180 irq 130
[    1.752945] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.756486] ata2.00: ATA-10: ST8000NM0055-1RM112, SN04, max UDMA/133
[    1.756489] ata2.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth
32), AA
[    1.759997] ata2.00: configured for UDMA/133
root@meinrechnername:~# dmesg | grep ata3
[    1.439452] ata3: SATA max UDMA/133 abar m2048@0xf7a4b000 port
0xf7a4b200 irq 130
[    1.752901] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.765947] ata3.00: ATA-10: ST8000NM0055-1RM112, SN05, max UDMA/133
[    1.765950] ata3.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth
32), AA
[    1.769530] ata3.00: configured for UDMA/133


Ich bin ratlos. Was könnte die Ursache sein?

Vielen Dank


Tony


P. S. Ich habe die Anfrage auch schon im Proxmox-Forum gestellt
(https://forum.proxmox.com/threads/unklarer-fehler-zfs-smart-sata-6-0-gbps-udma-133.99679/#post-430560).
Bisher habe ich keine weiterführende Antwort erhalten. Deshalb bitte ich
hier um Tipps.


Reply to: