[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Buster System hangs, requires hard reboot



On 4/22/20 7:24 PM, Tom Dial wrote:
> 
> 
> On 4/20/20 19:44, Ralph Katz wrote:
>> Hi -- Please help me diagnose and fix this problem.
>>
>> My five month old Dell laptop with updated firmware and new up-to-date
>> Buster completely hangs and requires a hard reboot after 7-40 days
>> uptime.  While reading something onscreen or away from the laptop, the
>> system hangs completely: screen freezes, keyboard is unresponsive, lid
>> close fails to sleep, can't ssh in, pings fail.  Hard reboot is required.
>>
>> Actions taken:
>>
>> - re-installed Buster twice over several months
>> - ran fsck.ext4 -y /dev/sda2  when fsck failed on boot
>> - ran bad blocks
>> - run smartmontools long test weekly; no errors reported in logs.
>>
>> Sometimes there are errors in syslog like this before a crash:
>>
>>> Apr  2 06:04:18 spike3 kernel: [539637.882916] EXT4-fs error (device sda2): ext4_lookup:1590: inode #55838123: comm updatedb.mlocat: iget: checksum invalid
>>
>> Today there were no syslog errors for weeks before the system hung.
>> After rebooting, errors like these appeared:
>>
>>> Apr 20 16:05:57 spike3 kernel: [  887.007328] EXT4-fs error (device sda2): ext4_lookup:1590: inode #55842004: comm GMPThread: iget: checksum invalid
>>> Apr 20 16:08:53 spike3 kernel: [ 1062.821504] EXT4-fs error (device sda2): ext4_lookup:1590: inode #55842002: comm DOM Worker: iget: checksum invalid
>>
>> Any ideas?
> 
> These errors seem to indicate that data in three inodes (and probably
> more) are invalid: they contain a checksum different from that
> calculated in function ext4_iget, at (or near) line 1590 in file
> inode.c. It looks like the device block(s) containing the inodes were
> read successfully, indicating they are intact and consistent. The data
> within them, however, are not. The three inodes are located fairly close
> together and may have been written to the block device by the same
> physical operation.
> 
> The two messages issued within a period of 3 minutes, and the hang
> without a logged message, suggest that the errors logged were symptoms
> rather than causes. However, an unlogged error of the same type (for
> instance, reading and then using bad data that has no built in
> checksum), seems plausible.
> 
>>From the logged errors:
> 
> The checksum computed by the OS for the data read from the block device
> differs from the checksum computed for the data at the time it was sent
> to the block device.
> 
> No block read error is reported. If true, that implies that the data on
> the device is unchanged from what was written by the device firmware.
> 
> Which implies, in turn, that the inode data were incorrect when received
> by the block device or were corrupted on the block device before
> completion of the write operation.
> 
> The first indicates a bug in the ext4 file system. That seems a stretch
> in view of the maturity and widespread use of ext4 (including by me) on
> Gnu/Linux systems. Still, a file system is an extremely complex and
> subtle piece of code, probably running on multiple CPU hardware that may
> present unique issues. It might be worth looking for ext4 bug reports
> that resonate with this. If there is (but I know of no reason to suspect
> it), installing on a different file system could be a solution. In
> addition to ext[2,3,4] I have used jfs and xfs on systems for quite a
> few years, and found them stable reliable. The last time I looked, they
> were available installer choices. For a laptop used portably, ZFS (from
> buster-backports) also is a reasonable candidate with built in
> encryption capability, although installation requires quite a bit more
> effort than the installer.
> 
> The second indicates a problem, probably in firmware, within the block
> device. I have seen such, and it could be worth looking into whether the
> device manufacturer has released firmware updates, and applying the
> latest if different from what now is present on the device. I lean
> toward that rather than a file system bug.
> 
> A five month old machine should be under warranty, although I do not
> know whether installing Linux would affect that. It would be worth
> looking into and should offload firmware upgrade for or replacement of
> the block device.
> 
> 
> Regards,
> Tom Dial

Tom, thanks for your comprehensive review!  It is under warranty and
bringing it in is probably my next step.  This laptop supports ubuntu
from the factory, so there is no concern with Linux.

Dell website shows no firmware updates for my laptop service tag other
than BIOS, which I have applied earlier.  Searching for the drive model
returns nothing @ dell nor @ support.toshiba.com:
> Your entry doesn’t appear to be valid. Please double-check that your product is from the US or Latin America and try again.

>From smartctl:
Device Model:     TOSHIBA MQ04ABF100
Firmware Version: JU000D

Thanks again!
Ralph





Reply to: