[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Buster System hangs, requires hard reboot




On 4/20/20 19:44, Ralph Katz wrote:
> Hi -- Please help me diagnose and fix this problem.
> 
> My five month old Dell laptop with updated firmware and new up-to-date
> Buster completely hangs and requires a hard reboot after 7-40 days
> uptime.  While reading something onscreen or away from the laptop, the
> system hangs completely: screen freezes, keyboard is unresponsive, lid
> close fails to sleep, can't ssh in, pings fail.  Hard reboot is required.
> 
> Actions taken:
> 
> - re-installed Buster twice over several months
> - ran fsck.ext4 -y /dev/sda2  when fsck failed on boot
> - ran bad blocks
> - run smartmontools long test weekly; no errors reported in logs.
> 
> Sometimes there are errors in syslog like this before a crash:
> 
>> Apr  2 06:04:18 spike3 kernel: [539637.882916] EXT4-fs error (device sda2): ext4_lookup:1590: inode #55838123: comm updatedb.mlocat: iget: checksum invalid
> 
> Today there were no syslog errors for weeks before the system hung.
> After rebooting, errors like these appeared:
> 
>> Apr 20 16:05:57 spike3 kernel: [  887.007328] EXT4-fs error (device sda2): ext4_lookup:1590: inode #55842004: comm GMPThread: iget: checksum invalid
>> Apr 20 16:08:53 spike3 kernel: [ 1062.821504] EXT4-fs error (device sda2): ext4_lookup:1590: inode #55842002: comm DOM Worker: iget: checksum invalid
> 
> Any ideas?

These errors seem to indicate that data in three inodes (and probably
more) are invalid: they contain a checksum different from that
calculated in function ext4_iget, at (or near) line 1590 in file
inode.c. It looks like the device block(s) containing the inodes were
read successfully, indicating they are intact and consistent. The data
within them, however, are not. The three inodes are located fairly close
together and may have been written to the block device by the same
physical operation.

The two messages issued within a period of 3 minutes, and the hang
without a logged message, suggest that the errors logged were symptoms
rather than causes. However, an unlogged error of the same type (for
instance, reading and then using bad data that has no built in
checksum), seems plausible.

>From the logged errors:

The checksum computed by the OS for the data read from the block device
differs from the checksum computed for the data at the time it was sent
to the block device.

No block read error is reported. If true, that implies that the data on
the device is unchanged from what was written by the device firmware.

Which implies, in turn, that the inode data were incorrect when received
by the block device or were corrupted on the block device before
completion of the write operation.

The first indicates a bug in the ext4 file system. That seems a stretch
in view of the maturity and widespread use of ext4 (including by me) on
Gnu/Linux systems. Still, a file system is an extremely complex and
subtle piece of code, probably running on multiple CPU hardware that may
present unique issues. It might be worth looking for ext4 bug reports
that resonate with this. If there is (but I know of no reason to suspect
it), installing on a different file system could be a solution. In
addition to ext[2,3,4] I have used jfs and xfs on systems for quite a
few years, and found them stable reliable. The last time I looked, they
were available installer choices. For a laptop used portably, ZFS (from
buster-backports) also is a reasonable candidate with built in
encryption capability, although installation requires quite a bit more
effort than the installer.

The second indicates a problem, probably in firmware, within the block
device. I have seen such, and it could be worth looking into whether the
device manufacturer has released firmware updates, and applying the
latest if different from what now is present on the device. I lean
toward that rather than a file system bug.

A five month old machine should be under warranty, although I do not
know whether installing Linux would affect that. It would be worth
looking into and should offload firmware upgrade for or replacement of
the block device.


Regards,
Tom Dial

>
> Thanks in advance!
> Ralph
> 


Reply to: