Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

To: debian-ai@lists.debian.org
Subject: Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
From: Christian Kastner <ckk@debian.org>
Date: Sat, 1 Feb 2025 19:34:08 +0100
Message-id: <[🔎] c9735f7c-982a-4e81-a048-bc588833dccf@debian.org>
In-reply-to: <efae84f0-dfa9-4cd2-a869-752ae1bd22cd@debian.org>
References: <d373f55c-2869-490b-aeaf-0fba8c10c02e@debian.org> <d373f55c-2869-490b-aeaf-0fba8c10c02e@debian.org> <sa6mss4bytd.fsf@hjemme.reinholdtsen.name> <fdedee66-9a55-475e-9e23-acfdfc351025@debian.org> <d373f55c-2869-490b-aeaf-0fba8c10c02e@debian.org> <sa65xxw4jhn.fsf@hjemme.reinholdtsen.name> <22d3e2d2-cfbd-431d-9211-e902ac3dfe4b@debian.org> <22d3e2d2-cfbd-431d-9211-e902ac3dfe4b@debian.org> <d373f55c-2869-490b-aeaf-0fba8c10c02e@debian.org> <de29a469-6c9b-4025-bbed-988e10dc5a38@slerp.xyz> <0aa4f182-da25-4ba5-8d9f-a1d1f8ad9221@debian.org> <ece647c1-3dba-4737-a215-c93112990fe4@debian.org> <7976e018-a547-4bba-82ba-13847980356e@debian.org> <efae84f0-dfa9-4cd2-a869-752ae1bd22cd@debian.org>

Mo, thanks again for the pointer. This was great advice.

On 2024-12-22 20:42, Mo Zhou wrote:
> No need to do that manually. The Glibc already provided that kind of
> dispatching functionality when you build multiple solibs with different
> baselines. Please check
> 
>   https://lists.debian.org/debian-devel/2019/04/msg00057.html
> 
> Or concretely,
> 
>   section "Hardware capabilities" from ld.so(8)
> 
> I believe this is what you were expecting. But it seems that the
> avx2/ avx512 dispatch is missing from the man page. Don't know
> what's happening for their support.

This was updated with glibc 2.33, with a better solution that supports
the amd64 v1-v4 levels [2]. The 32-bit stuff was declared legacy, and
dropped in 2.37.

Using this new approach, I was elated to see that it doesn't just work
perfectly, it also works with non-standard library paths (lacking
stability, llama.cpp's libraries are kept private, for now). And the
baseline build should work fine for any other dynamic linker
implementation, so there is really no meaningful downside to using
hwcaps.

Below are the outputs for the CPU backend of llama.cpp. These are all
from the same QEMU VM on endeavour (EPYC 9354P) to which I assigned 8
cores and 64GiB RAM, and just booted with a different -cpu argument,
one matching each level. You can see the levels get picked up by the
linker, and the resulting effect on performance [3].

Best,
Christian

[2]: https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels

[3]: I could have "hacked" this to run on bare metal, too, but the VM
     approach was less intrusive to our production CI host.


# -cpu host, this one has AVX512
root@unstable-guest:~# ldd /usr/bin/llama-bench
        linux-vdso.so.1 (0x00007fc25698a000)
        libllama.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v4/libllama.so (0x00007fc256735000)
        libggml.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v4/libggml.so (0x00007fc256726000)
        libggml-base.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v4/libggml-base.so (0x00007fc256677000)
        [...]
root@unstable-guest:~# llama-bench -m ggml-model-q4_0.gguf 
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         pp512 |        314.18 ± 0.29 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         tg128 |        137.38 ± 0.08 |


# -cpu EPYC, up to AVX512
root@unstable-guest:~# ldd /usr/bin/llama-bench
        linux-vdso.so.1 (0x00007f275b5a3000)
        libllama.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v3/libllama.so (0x00007f275b34e000)
        libggml.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v3/libggml.so (0x00007f275b33f000)
        libggml-base.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v3/libggml-base.so (0x00007f275b290000)
        [...]
root@unstable-guest:~# llama-bench -m ggml-model-q4_0.gguf 
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         pp512 |        281.55 ± 0.12 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         tg128 |        139.31 ± 0.08 |


# -cpu Opteron_G5, up to SSE4.2
root@unstable-guest:~# ldd /usr/bin/llama-bench
        linux-vdso.so.1 (0x00007f2d23bc1000)
        libllama.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v2/libllama.so (0x00007f2d2392e000)
        libggml.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v2/libggml.so (0x00007f2d2391e000)
        libggml-base.so => /usr/lib/x86_64-linux-gnu/llama.cpp/glibc-hwcaps/x86-64-v2/libggml-base.so (0x00007f2d2386e000)
        [...]
root@unstable-guest:~# llama-bench -m ggml-model-q4_0.gguf 
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         pp512 |         86.39 ± 0.53 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         tg128 |         66.94 ± 0.07 |


# -cpu Opteron_G3, baseline SSE2
root@unstable-guest:~# ldd /usr/bin/llama-bench
        linux-vdso.so.1 (0x00007faf3fb6c000)
        libllama.so => /usr/lib/x86_64-linux-gnu/llama.cpp/libllama.so (0x00007faf3f917000)
        libggml.so => /usr/lib/x86_64-linux-gnu/llama.cpp/libggml.so (0x00007faf3f908000)
        libggml-base.so => /usr/lib/x86_64-linux-gnu/llama.cpp/libggml-base.so (0x00007faf3f859000)
        [...]
root@unstable-guest:~# llama-bench -m ggml-model-q4_0.gguf 
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         pp512 |         48.25 ± 0.03 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |       8 |         tg128 |         41.14 ± 0.01 |

Reply to:

Follow-Ups:
- Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
  - From: "M. Zhou" <lumin@debian.org>

Prev by Date: Bug#1094931: transition: onnxruntime
Next by Date: Re: Bug#1094806: ITP: ollama -- large language model tools
Previous by thread: Bug#1094931: transition: onnxruntime
Next by thread: Re: Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++
Index(es):
- Date
- Thread