Bug#1075724: rocblas: Give SIGILL on CPUS without the f16c extention
Package: rocblas
Version: 5.5.1+dfsg-5
Tags: patch
When compiling llama.cpp with ROCm support and running it, I get a
illegal instruction crash in the binary. The cause seem to be that
rocblas is built with -mf16c.
I built llama.cpp using this command line:
HIPCXX=clang-17 cmake -H. -Bbuild -DGGML_HIPBLAS=ON -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON
I see the crash after downloading a model from huggingface and starting
bin/llama-cli using this model. Using valgrind, I get this report from
the crash:
==27243== Warning: set address range perms: large range [0x221c55000,
0x231e56000) (noaccess)
llama_kv_cache_init: ROCm0 KV buffer size = 256,00 MiB
llama_new_context_with_model: KV self size = 256,00 MiB, K (f16): 128,00
MiB, V (f16): 128,00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0,12 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 164,00 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 12,01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x13 0xC0 0xC5
0xF0 0x57 0xC9 0xC5
vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38
vex amd64->IR: PFX.66=1 PFX.F2=0 PFX.F3=0
==27243== valgrind: Unrecognised instruction at address 0x1331a8a8.
==27243== at 0x1331A8A8: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x13326E28: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x13157CBA: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x13155D51: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x1314DB31: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x1314B477: rocblas_gemm_batched_ex (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x1305CD09: hipblasGemmBatchedEx (in
/usr/lib/x86_64-linux-gnu/libhipblas.so.0.1)
==27243== by 0x4AA55CD:
ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor
const*, ggml_tensor const*, ggml_tensor*) (in
/home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243== by 0x4A94C71: ggml_backend_cuda_graph_compute(ggml_backend*,
ggml_cgraph*) (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243== by 0x4A1E61C: ggml_backend_sched_graph_compute_async (in
/home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243== by 0x48D1A32: llama_decode (in
/home/pere/src/ki/llama.cpp/build/src/libllama.so)
==27243== by 0x13C4EC: llama_init_from_gpt_params(gpt_params&) (in
/home/pere/src/ki/llama.cpp/build/bin/llama-cli)
==27243== Your program just tried to execute an instruction that Valgrind
==27243== did not recognise. There are two possible reasons for this.
==27243== 1. Your program has a bug and erroneously jumped to a non-code
==27243== location. If you are running Memcheck and you just saw a
==27243== warning about a bad jump, it's probably your program's fault.
==27243== 2. The instruction is legitimate but Valgrind doesn't handle it,
==27243== i.e. it's Valgrind's fault. If you think this is the case or
==27243== you are not sure, please let us know and we'll try to fix it.
==27243== Either way, Valgrind will now raise a SIGILL signal which will
==27243== probably kill your program.
==27243==
==27243== Process terminating with default action of signal 4 (SIGILL)
==27243== Illegal opcode at address 0x1331A8A8
==27243== at 0x1331A8A8: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x13326E28: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x13157CBA: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x13155D51: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x1314DB31: ??? (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x1314B477: rocblas_gemm_batched_ex (in
/usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243== by 0x1305CD09: hipblasGemmBatchedEx (in
/usr/lib/x86_64-linux-gnu/libhipblas.so.0.1)
==27243== by 0x4AA55CD:
ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor
const*, ggml_tensor const*, ggml_tensor*) (in
/home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243== by 0x4A94C71: ggml_backend_cuda_graph_compute(ggml_backend*,
ggml_cgraph*) (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243== by 0x4A1E61C: ggml_backend_sched_graph_compute_async (in
/home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243== by 0x48D1A32: llama_decode (in
/home/pere/src/ki/llama.cpp/build/src/libllama.so)
==27243== by 0x13C4EC: llama_init_from_gpt_params(gpt_params&) (in
/home/pere/src/ki/llama.cpp/build/bin/llama-cli)
==27243==
==27243== HEAP SUMMARY:
==27243== in use at exit: 659,260,263 bytes in 3,380,913 blocks
==27243== total heap usage: 19,712,537 allocs, 16,331,624 frees,
5,271,145,975 bytes allocated
==27243==
==27243== LEAK SUMMARY:
==27243== definitely lost: 120 bytes in 3 blocks
==27243== indirectly lost: 2,422 bytes in 45 blocks
==27243== possibly lost: 18,964 bytes in 160 blocks
==27243== still reachable: 659,238,757 bytes in 3,380,705 blocks
==27243== of which reachable via heuristic:
==27243== multipleinheritance: 1,056 bytes in 12
blocks
==27243== suppressed: 0 bytes in 0 blocks
==27243== Rerun with --leak-check=full to see details of leaked memory
==27243==
==27243== For lists of detected and suppressed errors, rerun with: -s
==27243== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Ulovlig instruksjon (SIGILL)
Accoring to Cory Bloor, The disassembly of those bytes show that it is
the vcvtph2ps instruction causing the crash:
0: c4 e2 79 13 c0 vcvtph2ps xmm0,xmm0
5: c5 f0 57 c9 vxorps xmm1,xmm1,xmm1
9: c5 .byte 0xc5
I managed to avoid the crash and get llama.cpp working by applying the
following patch and rebuilding rocblas:
--- rocblas-5.5.1+dfsg.orig/library/src/CMakeLists.txt
+++ rocblas-5.5.1+dfsg/library/src/CMakeLists.txt
@@ -411,7 +411,7 @@ endif()
# -fno-gpu-rdc compiler option was used with hcc, so revisit feature at some point
# GCC or hip-clang needs specific flags to turn on f16c intrinsics
-target_compile_options( rocblas PRIVATE -mf16c )
+#target_compile_options( rocblas PRIVATE -mf16c )
# Do not allow Variable Length Arrays (use unique_ptr instead)
target_compile_options( rocblas PRIVATE -Werror=vla )
Please consider including it in an upload to Debian.
According to https://github.com/ROCm/rocBLAS/issues/1422 and
<URL: https://github.com/ROCm/rocBLAS/commit/c6bc09073959a2881a701b88ae1ed9de469354f1 >,
the issue might already be fixed upstream, but I have not tested that
version.
See also <URL: https://lists.debian.org/debian-ai/2024/07/msg00007.html >.
--
Happy hacking
Petter Reinholdtsen
Reply to: