Re: Fwd: "cuda error cudastreamcreate",

On Wed, Jun 15, 2011 at 8:22 AM, Francesco Pietra <chiendarret@gmail.com> wrote:

Running "nvidia-smi -L" as root restores the visibility of the graphic
cards. At any boot such visibility vanishes. So, it is a small
problem, or no problem. francesco

---------- Forwarded message ----------
From: Francesco Pietra <chiendarret@gmail.com>

Date: Wed, Jun 15, 2011 at 4:37 PM
Subject: Fwd: Fwd: "cuda error cudastreamcreate",
To: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>, amd64 Debian
<debian-amd64@lists.debian.org>

The simulation (pressure equilibration) was completed successfully.
Next run (just a continuation of previous pressure equilibration)
failed, again 'Device Emulation (CPU' , see log file below. Attempted
again, same error.

# modinfo nvidia
filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
alias: char-major-195-*
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: i2c-core
vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
parm: NVreg_EnableVia4x:int
parm: NVreg_EnableALiAGP:int
parm: NVreg_ReqAGPRate:int
parm: NVreg_EnableAGPSBA:int
parm: NVreg_EnableAGPFW:int
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UseVBios:int
parm: NVreg_RMEdgeIntrCheck:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnableMSI:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_NvAGP:int

However:

$ nvidia-smi -L
Could not open device /dev/nvidia1 (no such file)
Failed to initialize NVML: unknown error.

I am unable to draw technical conclusions from this 'unknown error'. I
wonder whether other information can be extracted to fix the problems.

Thanks for advice.

francesco

Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
Info: 1 NAMD CVS-2011-06-04 Linux-x86_64-CUDA 6 gig64 francesco
Info: Running on 6 processors, 6 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00658393 s
Pe 2 sharing CUDA device 0 first 0 next 3
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'Device
Emulation (CPU)' Mem: 0MB Rev: 9999.9999
FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device 0): no
CUDA-capable device is available

---------- Forwarded message ----------
From: Francesco Pietra <chiendarret@gmail.com>
Date: Wed, Jun 15, 2011 at 9:04 AM
Subject: Re: Fwd: "cuda error cudastreamcreate",
To: Fabricio Cannini <fabricio@versatushpc.com.br>, Lennart Sorensen
<lsorense@csclub.uwaterloo.ca>, amd64 Debian
<debian-amd64@lists.debian.org>

The "nvidia-smi -L" output was for a machine of Jim Phillips, the
main developer of NAMD. He provided that to show that it should also
work with my GTX 470 cards.

That said, my problems seem to have been solved by following Lennart's
indications. The driver was rebuilt, date 15 June, and NAMD simulation
could be started regularly. However, we have to wait before claiming
full victory. Please see below..

In retrospect, the nvidia.ko I had before, dated 5 June, must have
also been built within Debian. Renaming it no_nvidia.ko prevented
rebuilding for the reasons that Lennart clarified.

For some reasons, previous installation of nvidia.ko must have had
some problems, as, for example, "nvidia-smi -L" did not work (there
was a single installation of nvidia-smi, "nvidia-smi 270.41.19-1"),
while "modinfo nvidia" output was correct. Now, both are correct:

$ nvidia-smi -L
GPU 0: GeForce GTX 470 (UUID: N/A)
GPU 1: GeForce GTX 470 (UUID: N/A)

# modinfo nvidia
filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
alias: char-major-195-*
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: i2c-core
vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
parm: NVreg_EnableVia4x:int
parm: NVreg_EnableALiAGP:int
parm: NVreg_ReqAGPRate:int
parm: NVreg_EnableAGPSBA:int
parm: NVreg_EnableAGPFW:int
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UseVBios:int
parm: NVreg_RMEdgeIntrCheck:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnableMSI:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_NvAGP:int

I said above that time will show if the system is stable. In fact,
this morning, NAMD simulation did not start (I am using the console
memory to recover commands, so that no error of digitizing). I had not
carried out any amd64 upgrade in between. From the simulation log:

Info: Charm++/Converse parallel runtime startup completed at 0.00989103 s
Pe 2 sharing CUDA device 0 first 0 next 3
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'Device
Emulation (CPU)' Mem: 0MB Rev: 9999.9999
FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device 0): no
CUDA-capable device is available

'Device Emulation (CPU)' indicates (for some to me unclear reasons)
that things have gone bad.

On a second identical attempt (after having explored the driver
location and carried out info commands), NAMD simulation started, with
the correct log output:

Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
Info: 1 NAMD CVS-2011-06-04 Linux-x86_64-CUDA 6 gig64 francesco
Info: Running on 6 processors, 6 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00650811 s

We will see if failure/success will be presented again (now a
simulation lasts several hours (which would be days on a 8 processor
machine). If failure will occur again, there are so many possible
reasons, including problems with the NAMD code.

I was so discomforted yesterday to allude to a change of driver
source. Which was unfair.

Thanks a lot
francesco

On Wed, Jun 15, 2011 at 2:22 AM, Fabricio Cannini
<fabricio@versatushpc.com.br> wrote:
> Em terça-feira 14 junho 2011, às 16:01:57, Lennart Sorensen escreveu:
>> On Tue, Jun 14, 2011 at 07:23:38PM +0200, Francesco Pietra wrote:
>> > I forgot to answer: yes, sometime it works, sometimes not, everything
>> > being the same.
>> >
>> > As a matter of fact, after a day of failure, I have now renamed back
>> >
>> > /lib/modules/2.638-2-amd64/updatesdkms/no_nvidia.ko
>> >
>> > to
>> >
>> > /lib/modules/2.638-2-amd64/updatesdkms/nvidia.ko
>> >
>> > and the NAMD simulation started regularly using both gtx 470. The
>> > machine had not been touched either.
>>
>> I wonder if having the 9800 card in there along with the 470 gtx cards
>> is confusing the driver. Maybe the card order is getting swapped around
>> on some boots.
>>
>> What is the 9800 doing in the box anyhow?
>
> Hi All.
>
> I'm thinking the same as Lennart. It seems to me that the order which the
> cards are named varies, thus confusing the application( s ). I'd try to fix the
> order in /etc/X11/xorg.conf and see if it works. Look in the cuda docs how to
> do that.
>
> Good luck.
>
>
> --
> To UNSUBSCRIBE, email to debian-amd64-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> Archive: 201106142122.04376.fcannini@gmail.com" target="_blank">http://lists.debian.org/201106142122.04376.fcannini@gmail.com
>
>

--
To UNSUBSCRIBE, email to debian-amd64-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: BANLkTimUuPNrKwcjy_2SyMwLDS4A1nCbXA@mail.gmail.com" target="_blank">http://lists.debian.org/BANLkTimUuPNrKwcjy_2SyMwLDS4A1nCbXA@mail.gmail.com