Occassional ATA errors with Intel SSD

To: debian-kernel@lists.debian.org
Subject: Occassional ATA errors with Intel SSD
From: Andy Smith <andy@strugglers.net>
Date: Sun, 26 Jul 2015 06:12:28 +0000
Message-id: <[🔎] 20150726061228.GW4243@bitfolk.com>

Hi,

I've installed jessie on a new server that has two Intel DC S3610
SSDs.

During the process of installation and initial testing the IO system
was put under quite heavy IO load (e.g. SSD hotswap testing resulted
in two complete MD syncs) without incident.

However, in the last three days the same ATA error has happened twice:

Jul 23 17:14:41 soup kernel: [68044.504092] ata2.00: exception Emask 0x0 SAct 0x3000000 SErr 0x0 action 0x6 frozen
Jul 23 17:14:41 soup kernel: [68044.504215] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 23 17:14:41 soup kernel: [68044.504291] ata2.00: cmd 61/01:c0:00:a8:75/00:00:66:00:00/40 tag 24 ncq 512 out
Jul 23 17:14:41 soup kernel: [68044.504291]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 23 17:14:41 soup kernel: [68044.504357] ata2.00: status: { DRDY }
Jul 23 17:14:41 soup kernel: [68044.504376] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 23 17:14:41 soup kernel: [68044.504402] ata2.00: cmd 61/08:c8:d1:b1:b5/00:00:09:00:00/40 tag 25 ncq 4096 out
Jul 23 17:14:41 soup kernel: [68044.504402]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 23 17:14:41 soup kernel: [68044.504468] ata2.00: status: { DRDY }
Jul 23 17:14:41 soup kernel: [68044.504488] ata2: hard resetting link
Jul 23 17:14:42 soup kernel: [68044.824115] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 23 17:14:42 soup kernel: [68044.825069] ata2.00: configured for UDMA/133
Jul 23 17:14:42 soup kernel: [68044.825096] ata2.00: device reported invalid CHS sector 0
Jul 23 17:14:42 soup kernel: [68044.825123] ata2.00: device reported invalid CHS sector 0
Jul 23 17:14:42 soup kernel: [68044.825153] ata2: EH complete

Jul 25 11:19:47 soup kernel: [219549.271520] ata2.00: exception Emask 0x0 SAct 0x30000000 SErr 0x0 action 0x6 frozen
Jul 25 11:19:47 soup kernel: [219549.272650] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 11:19:47 soup kernel: [219549.273743] ata2.00: cmd 61/07:e0:f9:db:75/00:00:66:00:00/40 tag 28 ncq 3584 out
Jul 25 11:19:47 soup kernel: [219549.273743]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 25 11:19:47 soup kernel: [219549.275924] ata2.00: status: { DRDY }
Jul 25 11:19:47 soup kernel: [219549.276993] ata2.00: failed command: WRITE FPDMA QUEUED
Jul 25 11:19:47 soup kernel: [219549.278066] ata2.00: cmd 61/08:e8:09:dc:b5/00:00:09:00:00/40 tag 29 ncq 4096 out
Jul 25 11:19:47 soup kernel: [219549.278066]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 25 11:19:47 soup kernel: [219549.280196] ata2.00: status: { DRDY }
Jul 25 11:19:47 soup kernel: [219549.281236] ata2: hard resetting link
Jul 25 11:19:47 soup kernel: [219549.599404] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 25 11:19:47 soup kernel: [219549.600355] ata2.00: configured for UDMA/133
Jul 25 11:19:47 soup kernel: [219549.600359] ata2.00: device reported invalid CHS sector 0
Jul 25 11:19:47 soup kernel: [219549.600360] ata2.00: device reported invalid CHS sector 0
Jul 25 11:19:47 soup kernel: [219549.600366] ata2: EH complete

The server was sitting idle at each of those times, so this does not
appear to be IO load-related. An extended SMART self-test completed
without error.

Does anyone have any knowledge as to whether this is likely to be a
hardware fault or a kernel bug?

I suppose the fact that both times it's been /dev/sdb does suggest
that device may be physically at fault, so I'll pursue that, but as
these are rather pricey items it'd be good to know if anyone has
experienced this as a result of a bug.

The host is a Xen dom0 so it's running:

xen-hypervisor-4.4-amd64         4.4.1-9+deb8u1
linux-image-3.16.0-4-amd64       3.16.7-ckt11-1

As far as I am aware it's not doing TRIM as no mounted filesystem
has discard option enabled and no one has done fstrim.

Cheers,
Andy

Reply to:

Prev by Date: Processed: your mail
Next by Date: Bug#793661: crashes the server
Previous by thread: Bug#793653: linux-image-3.16.0-4-amd64: ASUS X450LD, touchpad invalid after resuming from suspend.
Next by thread: Bug#793661: crashes the server
Index(es):
- Date
- Thread