Re: Building llama.cpp for AMD GPU using only Debian packages?

To: debian-ai@lists.debian.org
Subject: Re: Building llama.cpp for AMD GPU using only Debian packages?
From: Petter Reinholdtsen <pere@hungry.com>
Date: Mon, 01 Jul 2024 15:12:40 +0200
Message-id: <[🔎] sa6r0cds1br.fsf@hjemme.reinholdtsen.name>
In-reply-to: <[🔎] sa6wmm5s28m.fsf@hjemme.reinholdtsen.name>
References: <[🔎] sa6sewtv5km.fsf@hjemme.reinholdtsen.name> <[🔎] a2e3c24b-6459-41cb-995f-93bb4bdfee97@slerp.xyz> <[🔎] sa6wmm5s28m.fsf@hjemme.reinholdtsen.name>

[Petter Reinholdtsen]
> Thank you.  I got it building.  But I guess I miss some preparation
> step, as I get this when running llama-cli:
>
>   rocBLAS error: Could not initialize Tensile host: No devices found
>   Avbrutt (SIGABRT)

I managed to figure out this one, as a missing permission issue.  Adding
my user to the 'render' group got me further.  I wish there was a way to
grant GPU access without full access to the render device.

Anyway, now I get a crash
in/usr/lib/x86_64-linux-gnu/librocblas.so.0.1.  These are the last lines
from a valgrind run:

==27243== Warning: set address range perms: large range [0x221c55000, 0x231e56000) (noaccess)
llama_kv_cache_init:      ROCm0 KV buffer size =   256,00 MiB
llama_new_context_with_model: KV self size  =  256,00 MiB, K (f16):  128,00 MiB, V (f16):  128,00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0,12 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   164,00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    12,01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
vex amd64->IR: unhandled instruction bytes: 0xC4 0xE2 0x79 0x13 0xC0 0xC5 0xF0 0x57 0xC9 0xC5
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0
==27243== valgrind: Unrecognised instruction at address 0x1331a8a8.
==27243==    at 0x1331A8A8: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x13326E28: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x13157CBA: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x13155D51: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x1314DB31: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x1314B477: rocblas_gemm_batched_ex (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x1305CD09: hipblasGemmBatchedEx (in /usr/lib/x86_64-linux-gnu/libhipblas.so.0.1)
==27243==    by 0x4AA55CD: ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243==    by 0x4A94C71: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243==    by 0x4A1E61C: ggml_backend_sched_graph_compute_async (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243==    by 0x48D1A32: llama_decode (in /home/pere/src/ki/llama.cpp/build/src/libllama.so)
==27243==    by 0x13C4EC: llama_init_from_gpt_params(gpt_params&) (in /home/pere/src/ki/llama.cpp/build/bin/llama-cli)
==27243== Your program just tried to execute an instruction that Valgrind
==27243== did not recognise.  There are two possible reasons for this.
==27243== 1. Your program has a bug and erroneously jumped to a non-code
==27243==    location.  If you are running Memcheck and you just saw a
==27243==    warning about a bad jump, it's probably your program's fault.
==27243== 2. The instruction is legitimate but Valgrind doesn't handle it,
==27243==    i.e. it's Valgrind's fault.  If you think this is the case or
==27243==    you are not sure, please let us know and we'll try to fix it.
==27243== Either way, Valgrind will now raise a SIGILL signal which will
==27243== probably kill your program.
==27243== 
==27243== Process terminating with default action of signal 4 (SIGILL)
==27243==  Illegal opcode at address 0x1331A8A8
==27243==    at 0x1331A8A8: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x13326E28: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x13157CBA: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x13155D51: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x1314DB31: ??? (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x1314B477: rocblas_gemm_batched_ex (in /usr/lib/x86_64-linux-gnu/librocblas.so.0.1)
==27243==    by 0x1305CD09: hipblasGemmBatchedEx (in /usr/lib/x86_64-linux-gnu/libhipblas.so.0.1)
==27243==    by 0x4AA55CD: ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243==    by 0x4A94C71: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243==    by 0x4A1E61C: ggml_backend_sched_graph_compute_async (in /home/pere/src/ki/llama.cpp/build/ggml/src/libggml.so)
==27243==    by 0x48D1A32: llama_decode (in /home/pere/src/ki/llama.cpp/build/src/libllama.so)
==27243==    by 0x13C4EC: llama_init_from_gpt_params(gpt_params&) (in /home/pere/src/ki/llama.cpp/build/bin/llama-cli)
==27243== 
==27243== HEAP SUMMARY:
==27243==     in use at exit: 659,260,263 bytes in 3,380,913 blocks
==27243==   total heap usage: 19,712,537 allocs, 16,331,624 frees, 5,271,145,975 bytes allocated
==27243== 
==27243== LEAK SUMMARY:
==27243==    definitely lost: 120 bytes in 3 blocks
==27243==    indirectly lost: 2,422 bytes in 45 blocks
==27243==      possibly lost: 18,964 bytes in 160 blocks
==27243==    still reachable: 659,238,757 bytes in 3,380,705 blocks
==27243==                       of which reachable via heuristic:
==27243==                         multipleinheritance: 1,056 bytes in 12 blocks
==27243==         suppressed: 0 bytes in 0 blocks
==27243== Rerun with --leak-check=full to see details of leaked memory
==27243== 
==27243== For lists of detected and suppressed errors, rerun with: -s
==27243== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Ulovlig instruksjon (SIGILL)

I assume my old test machine CPU lack some required feature, and that
the library need a rebuild to avoid it, or I need to finish setting the
new test machine together to get a more recent CPU. :)

-- 
Happy hacking
Petter Reinholdtsen

Reply to:

Follow-Ups:
- Re: Building llama.cpp for AMD GPU using only Debian packages?
  - From: Cordell Bloor <cgmb@slerp.xyz>

References:
- Building llama.cpp for AMD GPU using only Debian packages?
  - From: Petter Reinholdtsen <pere@hungry.com>
- Re: Building llama.cpp for AMD GPU using only Debian packages?
  - From: Cordell Bloor <cgmb@slerp.xyz>
- Re: Building llama.cpp for AMD GPU using only Debian packages?
  - From: Petter Reinholdtsen <pere@hungry.com>

Prev by Date: Re: Building llama.cpp for AMD GPU using only Debian packages?
Next by Date: Re: Building llama.cpp for AMD GPU using only Debian packages?
Previous by thread: Re: Building llama.cpp for AMD GPU using only Debian packages?
Next by thread: Re: Building llama.cpp for AMD GPU using only Debian packages?
Index(es):
- Date
- Thread