[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: CUDA error cudaStreamSynchronize(stream) and CUDA error in ComputeBondedCUDA



I do not know if this would work with this kind of computation but I would suggest you try and run the programme under gdb.

This should  tell you where things go wrong. You might have recompile the programme and enable debugging symbols 

Peter 

Sent from my phone. Please forgive misspellings and weird “corrections”

On 20 Nov 2022, at 18:08, Francesco Pietra <chiendarret@gmail.com> wrote:


Hello
Main board GA-X79-UD3 with two 680 GPUs
Debian10 Linux,
kernel 5.10.0-19-amd64
OpenGL 4.6.0
nvidia driver 470.141.03

Months ago, following updating/upgrading of amd64, the GPUs, while rendering correctly, became unable to run classical molecular dynamics simulations. Launching a minimization with software NAMD with both GPUs or with one of them (by software or even by removing one GPU)

namd2 +idlepoll +p12 +devices 0,1 min.conf
namd2 +idlepoll +p12 +devices 0 min.conf
namd2 +idlepoll +p12 +devices 1 min.conf

NAMD organizes the simulation correctly but at the stage of starting the computation, accessing memory, a crash occurs with error

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program

"illegal memory access" is a software error (as also proven by using alternatively one of the two GPUs) that escapes all my attempts at unraveling its origin. I had no clues from NAMD forum. Hope here.

Thanks for your kind attention

francesco pietra




Reply to: