[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Invalid GART PTE entry errors during bulk data transfers



Hello,

If this message does not belong here, please let me know where it should be posted, thanks.
See also http://www.comtechnique.ch/index.php?view=article&id=14


Recently I installed Debian Linux 6 (Squeeze, kernel 2.6.32-5-amd64 #1 SMP) on an IBM eServer platform. The system has dual AMD Opteron processors.

While transferring lots of data from the original server this server was expected to replace, I noticed errors appearing repeatedly every 4 minutes or so in the ssh sessions:

Message from syslogd@jupiter at Jul 24 07:30:07 ...
kernel:[43618.440106]  Northbridge Error, node 0

Message from syslogd@jupiter at Jul 24 07:30:07 ...
kernel:[43618.440304] Invalid GART PTE entry during table walk.

The errors appeared regularly, and it seemed only during very large data transfers across the network. As soon as the file transfers (using rsync) were completed, the errors stopped appearing. These messages show on all ssh sessions I have open to that server.

After some searching, I found a Linux kernel patch from Borislav Petkov at AMD where the exact error message was listed. The following document from AMD however gave me the best information, but doesn't yet explain why the errors appear in the ssh sessions, much less why this appears during bulk data transfers. AMD states these messages should be suppressed.

http://support.amd.com/us/Processor_TechDocs/26094.PDF

On Page 333 I read:

------------------------------

12.10.1 GART Table Walk Error Reporting

This error is typically caused by a software graphics driver that improperly reserves or allocates aperture pages in the GART, resulting in benign visual artifacts which are often undetected on other platforms.Setting MC4_CTL[10] allows software developers to debug this error; the resulting benign machine check errors can, however, confuse an end user. For this reason, AMD recommends that the BIOS developers disable this function by setting bit 10 of MC4_CTL_MASK register (MSR C001_0048h) to a value of 1. This bit must be set before MC4_CTL[10] bit is set. AMD also recommends adding a setup option to the BIOS setup menu. The
following should be displayed in the setup option:

Gart Table Walk Error MC reporting: Disabled/Enabled.

The default setting is disabled. The device driver developer may enable this function for implementation and testing purposes. Also, a help message should be added with this setup option.
An example of the help message is:

This option should remain disabled for normal operation.

-----------------------------------

It doesn't seem to be a real problem to me, but does anyone here have any further knowledge on this issue?

Thanks, kind regards,
Jaap Hoetmer




Reply to: