Bug#1082888: librocfft0-tests: read kernel buffer failed on Linux 6.10
On 2024-09-27 23:34, Cordell Bloor wrote:
>> $ sudo sysctl kernel.dmesg_restrict=0
>> $ sudo sysctl vm.overcommit_memory=2
> The log output after applying both changes:
>
> [ RUN ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_67108864_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_67108864_odist_67108864_ioffset_0_0_ooffset_0_0
> [ OK ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_67108864_single_op_batch_1_istride_1_CI_ostride_1_CI_idist_67108864_odist_67108864_ioffset_0_0_ooffset_0_0 (953 ms)
> [ RUN ]
> pow2_1D/accuracy_test.vs_fftw/complex_forward_len_134217728_double_ip_batch_4_istride_1_CI_ostride_1_CI_idist_134217728_odist_134217728_ioffset_0_0_ooffset_0_0
> command1 FAIL non-zero exit status 1
>
> The dmesg output from test after applying both changes:
Am I interpreting this right that the "Killed" disappeared? If so, then the issue should be reproducible by re-enabling vm.overcommit_memory=0.
Would be nice to be certain of this.
> [50555.651205] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8592035840 not enough memory for the allocation
> [50555.651226] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8592035840 not enough memory for the allocation
> [50555.651233] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8572432384 not enough memory for the allocation
> [50555.651237] __vm_enough_memory: pid: 57317, comm: rocfft-test, bytes:
> 8592166912 not enough memory for the allocation
> [50555.651261] show_signal_msg: 11 callbacks suppressed
> [50555.651263] rocfft-test[57317]: segfault at 3c0 ip 00007fab8c38937b
> sp 00007faa749fe558 error 6 in
> libfftw3.so.3.6.10[18937b,7fab8c224000+1c5000] likely on CPU 9 (core 4,
> socket 0)
> [50555.651276] Code: 2d 57 15 48 8e 06 00 c4 c1 65 5c d9 c5 e5 57 1d 3b
> 8e 06 00 c4 43 7d 05 d2 05 c4 e3 7d 05 db 05 c4 41 4d 5c ca c4 c1 4d 58
> f2 <c4> 43 7d 19 0c 0a 01 c4 41 79 29 0a c5 55 58 cb c5 d5 5c eb 4d 8b>
> I also just noticed that [2] is segfaulting, so there's clearly another
> issue even with the older kernel. I hadn't noticed that before. It
> didn't do that when rocfft 6.1.2 was first uploaded [4].
It seems that this is non-deterministic. Some test complete, some don't. Sadly, we don't have dmesg for the older tests, but looking at the tail of the log [5] just two days after [4], we can see a
> 7779s Memory access fault by GPU node-1 (Agent handle: 0x55790431e060) on address 0xffb895600000. Reason: Page not present or supervisor privilege. [...]
which could be related.
It looks non-deterministic because this only occurs occasionally, and at different locations in the test run. An easy way to spot this is to look at log sizes [6]; completed tests tend to have ~640KB, shorter means early abort. This one [7] crashed almost immediately.
If it's not related, things become even more complicated...
> See attached for rocminfo logs from Debian Stable. Here's the diff:
> Pool Info:
> Pool 1
> Segment: GLOBAL; FLAGS: COARSE GRAINED
> - Size: 2097152(0x200000) KB
> + Size: 31761860(0x1e4a5c4) KB
This is the pool from the gfx1035. It increased in size from 2GiB to ~32GiB.
If overcommit was indeed the issue behind "Killed", then I suspect that the test malloc'ed so much such that it eventually triggered the OOM when both test and GPU consumed all physical memory, eg: with a 32GiB large test case computed on both GPU and CPU for expected/actual comparison.
Best,
Christian
>>> [1]: https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/33925/log.gz
>>> [2]: https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/34278/log.gz
>> [3]: https://sources.debian.org/src/rocfft/6.1.2-1/debian/tests/upstream-binaries/#L70
>>
> [4]:
> https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1035/r/rocfft/18220/log.gz
[5]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/18314/
[6]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/
[7]: https://ci.rocm.debian.net/packages/r/rocfft/unstable/amd64+gfx1035/23638/
Reply to: