[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Fwd: "cuda error cudastreamcreate",



I forgot to answer: yes, sometime it works, sometimes not, everything
being the same.

As a matter of fact, after a day of failure, I have now renamed back

/lib/modules/2.638-2-amd64/updatesdkms/no_nvidia.ko

to

/lib/modules/2.638-2-amd64/updatesdkms/nvidia.ko

and the NAMD simulation started regularly using both gtx 470. The
machine had not been touched either.

francesco


---------- Forwarded message ----------
From: Francesco Pietra <chiendarret@gmail.com>
Date: Tue, Jun 14, 2011 at 6:38 PM
Subject: Re: "cuda error cudastreamcreate",
To: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>


The two gtx 470 are in place, and are seen by unix commands. However,
the specific check, like

jim@aberdeen>nvidia-smi -L
GPU 0: Tesla C870 (UUID:
GPU-798dee8502c5e13c-7dd72cfe-6069e259-8fd36a96-5163bf00fbbcb8e9f61eda54)
GPU 1: Tesla C870 (UUID:
GPU-ed96e9c4afb70d35-694f6869-981de52a-23e64327-917becef3aa20bfd0d66432c)
GPU 2: GeForce 9800 GTX/9800 GTX+ (UUID: N/A)

provided by NAMD people fails.

$ which nvidia-smi
/usr/bin/nvidia-smi

$ nvidia-smi -L
could not open device file /dev/nvidiaactl (no such device or address)


I renamed "nvidia.ko" present in
/lib/modules/2.638-2-amd64/updatesdkms/ (which was copied there from
another amd64 machine with a lower GeForce card)

# modinfo nvidia
 no  /lib/modules/2.638-2-amd64/updatesdkms/nvidia.ko

Rebooting the machine did not create the "nvidia.ko", modinfo gave the
same answer. It must be that something is wrong with my installation,
PROVIDED THAT on merely rebooting should build the module. Included in
the list of installed packages are:

gcc-4.4, 4.5, 4-6
libcuda1 270.41.19-1
libgl1-nvidia-glx 270.41.19-1
libnvidia-ml1 270.41.19-1
linux-headers-2.6-amd64  (2.6.38+34)
linux-headers-2.6.38-2-amd64  (2.6.38-5)
linux-headers-2.6.38-2-common (2.6.38-5)
linux-image-2.6-amd64 (2.38+34)
linux-image-2.6-38-2-amd64 (2.6.38-5)
linux-kbuild-2.6.38 (2.6.38-1)
nvidia-cuda-dev 3.2.16.2
nvidia-cuda-toolkit 3.2.16-2
nvidia-glx 270.41.19-1
nvidia-installer-cleanup 20110515+1
nvidia-kernel-common 20110515+1
nvidia-kernel-dkms 270.41.19-1
nvidia-smi 20110515+1
nvidia-smi 270.41.19-1
nvidiasupport 20110515+1
nvidia-vdpau-driver 270.41.19-1
nvidia-xconfig 270.41.06.1

Really painful. Users in the NAMD list utilize the nvidia.ko
installation according to
"http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Getting_Started_Linux.pdf";
so that they can't help much. Still, I refrain to use that method to
avoid frequent rebuilding.

Thanks
francesco

On Tue, Jun 14, 2011 at 5:57 PM, Lennart Sorensen
<lsorense@csclub.uwaterloo.ca> wrote:
> On Tue, Jun 14, 2011 at 07:54:16AM +0200, Francesco Pietra wrote:
>> Hello:
>> With a gaming machine
>> Gigabyte GA 890FXAUD5
>> Six-core AMD PhenomII 1075T
>> 2x GTX 470
>> Debian GNU-Linux amd64 wheezy
>>
>>
>> I run successfully NAMD code (molecular dynamics simulations). Now I
>> am having problems getting GTX 470 to work and I can't understand
>> whether it is hardware or software problem, and if software the OS is
>> concerned. I am submitting the same problem to NAMD, s it might be
>> NAMD specific.
>>
>> When the code works, the top of the log file says:
>>
>> nfo: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
>> Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
>> Info: 1 NAMD  CVS-2011-06-04  Linux-x86_64-CUDA  6    gig64  francesco
>> Info: Running on 6 processors, 6 nodes, 1 physical nodes.
>> Info: CPU topology information available.
>> Info: Charm++/Converse parallel runtime startup completed at 0.00650811 s
>> Pe 5 sharing CUDA device 1 first 1 next 1
>> Pe 2 sharing CUDA device 0 first 0 next 4
>> Did not find +devices i,j,k,... argument, using all
>> Pe 5 physical rank 5 binding to CUDA device 1 on gig64: 'GeForce GTX
>> 470'  Mem: 1279MB  Rev: 2.0
>> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
>> 470'  Mem: 1279MB  Rev: 2.0
>> Pe 0 sharing CUDA device 0 first 0 next 2
>> Pe 3 sharing CUDA device 1 first 1 next 5
>> Pe 1 sharing CUDA device 1 first 1 next 3
>> Pe 1 physical rank 1 binding to CUDA device 1 on gig64: 'GeForce GTX
>> 470'  Mem: 1279MB  Rev: 2.0
>> Pe 0 physical rank 0 binding to CUDA device 0 on gig64: 'GeForce GTX
>> 470'  Mem: 1279MB  Rev: 2.0
>> Pe 3 physical rank 3 binding to CUDA device 1 on gig64: 'GeForce GTX
>> 470'  Mem: 1279MB  Rev: 2.0
>> Pe 4 sharing CUDA device 0 first 0 next 0
>> Pe 4 physical rank 4 binding to CUDA device 0 on gig64: 'GeForce GTX
>> 470'  Mem: 1279MB  Rev: 2.0
>> Info: 1.64104 MB of memory in use based on CmiMemoryUsage
>> Info: Configuration file is min-02.conf
>>
>> When failure:
>>
>> Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
>> Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
>> Info: 1 NAMD  CVS-2011-06-04  Linux-x86_64-CUDA  6    gig64  francesco
>> Info: Running on 6 processors, 6 nodes, 1 physical nodes.
>> Info: CPU topology information available.
>> Info: Charm++/Converse parallel runtime startup completed at 0.0124412 s
>> Pe 5 sharing CUDA device 0 first 0 next 0
>> Pe 5 physical rank 5 binding to CUDA device 0 on gig64: 'Device
>> Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (gig64 device 0): no
>> CUDA-capable device is available
>> ------------- Processor 5 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (gig64 device
>> 0): no CUDA-capable device is available
>>
>> Did not find +devices i,j,k,... argument, using all
>> Pe 0 sharing CUDA device 0 first 0 next 1
>> Pe 0 physical rank 0 binding to CUDA device 0 on gig64: 'Device
>> Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
>> Pe 3 sharing CUDA device 0 first 0 next 4
>> Pe 3 physical rank 3 binding to CUDA device 0 on gig64: 'Device
>> Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
>> Pe 1 sharing CUDA device 0 first 0 next 2
>> Pe 1 physical rank 1 binding to CUDA device 0 on gig64: 'Device
>> Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 0 (gig64 device 0): no
>> CUDA-capable device is available
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 0 (gig64 device
>> 0): no CUDA-capable device is available
>>
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 3 (gig64 device 0): no
>> CUDA-capable device is available
>> ------------- Processor 3 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 3 (gig64 device
>> 0): no CUDA-capable device is available
>>
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 1 (gig64 device 0): no
>> CUDA-capable device is available
>> ------------- Processor 1 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 1 (gig64 device
>> 0): no CUDA-capable device is available
>>
>> Pe 2 sharing CUDA device 0 first 0 next 3
>> Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'Device
>> Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device 0): no
>> CUDA-capable device is available
>> ------------- Processor 2 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device
>> 0): no CUDA-capable device is available
>>
>> Pe 4 sharing CUDA device 0 first 0 next 5
>> Pe 4 physical rank 4 binding to CUDA device 0 on gig64: 'Device
>> Emulation (CPU)'  Mem: 0MB  Rev: 9999.9999
>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 4 (gig64 device 0): no
>> CUDA-capable device is available
>> ------------- Processor 4 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 4 (gig64 device
>> 0): no CUDA-capable device is available
>
> Hmm, I wonder if 'no CUDA-capable device is available' means none were
> found or if it means none were not already busy.
>
> So sometimes it works and sometimes it doesn't?  Is this with the same
> code or is it working with some code and not with other code?
>
>> [0] Stack Traceback:
>>
>> --------------------------------
>>
>> In both cases:
>>
>> /var/lib/dkms/nvidia/270.41.19/2.6.38-2-amd64/x86_64/module/nvidia.ko
>>
>> /lib/module/2.6.38-2-amd64/update/dkms/nvidia.ko
>>
>> are in order.
>>   I tried:
>>
>> nvidia-smi -r (or nvidia-smi -a)
>> NVIDIA: could not open the device file /dev/nvidia1 (no such file)
>> Failed to initialize NVML: unknown error.
>
> Don't know.  With one card I only have /dev/nvidia0 and /dev/nvidiactl.
> I would think nvidia1 would be a second card.
>
>> unsure if these commands are for Tesla only.
>
> Having never done cuda or tesla things, I don't know unfortunately.
>
> --
> Len Sorensen
>


Reply to: