[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#416374: marked as done (kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!)



Your message dated Wed, 4 Apr 2007 17:05:02 -0700
with message-id <20070405000502.GC20124@dario.dodds.net>
and subject line kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?! 
has caused the attached Bug report to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what I am
talking about this indicates a serious mail system misconfiguration
somewhere.  Please contact me immediately.)

Debian bug tracking system administrator
(administrator, Debian Bugs database)

--- Begin Message ---
Package: kernel
Severity: critical
Justification: causes serious data loss

Hi everybody.

I'm currently (together with others) investigating in a severe data
corruption problem that at least many users might suffer from.

A short description, when you validate lots of GBs over and over with
md5sums (or another hash) there are errors found.

We do not yet know the real reson for the problems but it might relate
to Opteron (and perhaps Athlon) CPUs and/or Nvidia chipsets (mainboard).
So it might be a hardware design error (but even a kernel error could be
possible).
This is definitely not a single hardware issue of my system as many
other users on lkml reported the problem (and we all did very extensive
hardware tests).

The error occurs only if on has so much memory that the system uses
memory mapping (and the hardware iommu).
At lkml we currently found two "solutions" (I consider them more
workarounds, as we don't know exactly why they're solving the problem):
1) Disabling memory hole mapping in the system BIOS. The downside is
that there is no memory hole mapping at all, and the users looses much
of his main memory (in my case 1,5 GB)
2) Setting iommu=soft. The users keeps it full memory, and in all our
tests (at least as far as I am informed), and we do very much tests as I
and someone else administer some big linux clusters,... the error did
_not_ occur.

Windows users do generally not suffer from this corruption, as Windows
(at least until Vista) was not able to make use of the hardware iommu,
and always uses the software iommu.
The Intel CPUs with EMT64/Intel64 don't suffer from that problem either,
as they don't have an hwiommu, too (at least as far as I know).

We are not yet sure if this is a large scale problem or affects only
some special hardware combinations. We do however think that the issue
occurs only with PCI-DMA accesses. (Tests showed, that when disabling
dma or at least using slower dma modes on the disks, the issue disappeared).
The problem is vendors (at least Nvidia) does not help very much, they
even didn't answer my mails.
And most "normal" users won't recognise this problem, as they don't have
enought main memory and even it they have the error occurs very rarely
(perhaps some 100 bytes every 30 GB <- only a very imprecise scale).

What I suggest know:
As this is a very grave I suggest

- to configure all the default kernels for etch that may be affected (as
far as I know that are the amd64-k8 and amd64-generic kernels. Perhaps
the i386 packages too, have a look at lkml for this) to use iommu=soft.
- to update all packages in sarge and woody (as far as they might be
affected)
- put some warnings in the packages where users might configure their
own kernel and the boot-loaders.

Have a look at this thread at lkml
http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2 for in-depth
information.
It also contains links to some previous threads. There are also some
posts to lkml about this topics in separate threads (e.g. "amd64 iommu
causing corruption? (was Re: data corruption with nvidia chipsets and
IDE/SATA drives // memory hole mapping related bug?!)").

Best wishes,
Chris.

btw: please CC me as I'm off-list at the moment.
PS: I'll also write this the debian-kernel mailinglist.



-- System Information:
Debian Release: 4.0
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)
Shell:  /bin/sh linked to /bin/dash
Kernel: Linux 2.6.18
Locale: LANG=en_DE@scientia.net, LC_CTYPE=en_DE@scientia.net (charmap=UTF-8)


begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:calestyo@scientia.net
x-mozilla-html:TRUE
version:2.1
end:vcard


--- End Message ---
--- Begin Message ---
clone 416374 -1
reassign -1 installation-guide-amd64
reopen -1
found -1 20070319
severity -1 important
thanks

The following explanatory text has been added to the release notes in CVS:

 5.1.7 Data corruption with Hardware IOMMU on Nvidia chipsets

 A problem has been identified on AMD64 systems with Nvidia chipsets and more
 than 3GB of RAM that causes sporadic data corruption when the hardware IOMMU
 is used. This problem is still under investigation by the Linux kernel
 developers and the hardware manufacturers, and no official upstream fix has
 been released. To protect the integrity of their data, users of these
 systems are advised to manually disable the use of hardware IOMMU at boot
 time by adding iommu=soft to their kernel boot options until a correct
 solution can be found.

 More information about this issue is available in Debian bug #404148 and
 Linux Kernel bug #7768.

I am therefore closing this bug.  I'm also cloning a copy of this report to
the installation-guide so the problem can be documented there as well, as
discussed in the log of bug #404148.

-- 
Steve Langasek                   Give me a lever long enough and a Free OS
Debian Developer                   to set it on, and I can move the world.
vorlon@debian.org                                   http://www.debian.org/

--- End Message ---

Reply to: